VCS Nirvana

Being generally obsessed with source-control means that I’m always interested in seeing better VCS implementations than the ones we already have. Here’s my list of features that I’d want for the next VCS tech. Note: This is an update to a previous blog posting, and follows a new Git release which addresses something I’ve been talking about for a while - “sparse clone” (see below).

1. Unlimited Repo Size.

Subversion, Perforce and PlasticSCM can go up to terabytes of history quite easily, whereas Git has a hypothetical top limit. Git can’t go to terabytes yet. Sure, Git-LFS allows Git to function at a level where it can handle large amounts of video files more easily, but it’s not quite built-in.

All VCS systems keep all history on the server-side, of course. For Git you’d use --depth x to clone less history to your client, but you’re always going to clone at least one version of every dir/file (without other ‘sparse’ tricks). With the Subversion or Perforce clients it is implicit --depth 1 history at all times. PlasticSCM has modes of operation where it can function like Git or Subversion in this respect.

Unlimited size suits huge development organizations like Google who popularized monorepos, and games companies trying to target a big graphics thing to a number of platforms in a single branch. Monorepos go hand in hand with Trunk-Based Development in this way, but you can have extremely large repos without TBD. You can also have TBD without large repos, or indeed with many repos instead on a monorepo.

2. Separate read/write permissions down to dir/file level

Subversion, Perforce and PlasticSCM can maintain read and write permissions for each directory or file. They can also group users together to make for terser config for that. For Subversion the “Authz” technology is very confusing and error-prone in the hands of novices. This Authz config though text, is outside source-control. For Perforce there’s a similar mechanism that leverages globbing. That isn’t text or under source control either, even if it is queryable from so-authorized users on the client-side over the VCS’s APIs.

Subversion and Perforce can veto an attempt to commit a change to a resource that the user is not authorized to change. PlasticSCM can veto such changes when there’s an attempt to push commits back upstream, but can’t block what the user does on their own workspace.

3. Clone/checkout at a sub-directory (sub-tree) level

With Subversion you can “svn co” a subdirectory and that is all that comes down to your client (no parent directories). The same is true for Perforce and PlasticSCM (with workspace/client mapping and other tricks). Both of those are only the latest versions per branch/tag.

Git can’t do sub-directory clones/checkouts - you have to clone from the root folder for at least the branch in question. Note that GitHub added a Subversion client capability for their Git implementation, and if you switched to ‘svn’ for a checkout of a Git repo hosted there, you could do so from a sub-folder of the repo. That’s a trick, really though.

3.1 Sparse checkouts

Git, Perforce and Subversion have sparse checkout. They work slightly differently in each case, and Perforce does not refer to it as ‘sparse’. This is where you’ve set a mask on the client-side (client spec) and your operation to checkout working copy can be subset via masks/filters. For example, you could omit specific folders, or file types.

Git’s is easier to use, I think, but Git didn’t have sparse clones until v2.25. Because of that, there would not be a saving on the client’s storage for the local .git/ folder, even if the working copy modified as part of the checkout operation is reduced. Not only that but no saving on time taken to clone the repo in the first place (or keep abreast of changes upstream).

3.2 Sparse Clones

For Subversion and vanilla Perforce there isn’t a clone in the Git sense as ‘checkout’ was the metaphor for getting versions from an upstream repo. Git “sparse clone” was delivered with v2.25 though (see GitHub’s blog entry: bring your monorepo down to size with sparse checkout, so we’re in an exciting new era.

3.3 Google’s expanding/contracting way of working

For their monorepo (formerly Perforce, now Piper), the directory structure of the VCS checkout to your filesystem could change day to day. As an AdSense developer, you could have a sub-set of the whole 9 million source file monorepo that suited all that you needed and no more. You could still commit and push those changes back just fine. One day you could be asked to change something that affected the UI of AdSense and AdWords and you would expand your checkout to include the second application with a simple command that queries the Bazel BUILD files on the server. You’d load both into the same IDE and be able to make changes, including cross-cutting refactorings, at full speed. I go into that with Googlers Subset their Trunk and others in the same category.

4. Direct ‘update’ of specific files.

Hypothetically, Subversion, Perforce and Plastic can “put” a single file resource in the repo without having checked it out first (or any aspect of the repo at all). This is super useful for uses of VCS that are not developer’s regular “source project” uses.

It should be noted that Gitea, RhodeCode, and GitHub itself (after a fashion), have added the ability for you to effectively PUT a resource to a Git remote repo without first having cloned it. I have a few proofs of concept that utilize that:

5. Push/pull bottleneck

Subversion and Perforce do not have to be up to date, before committing back. Sure, both will object to the attempt to commit, if changes would smush into someone else’s changes that arrived upstream first, but in the case where the server-side can merge the changes without asking the user to arbitrate, they can do so.

Git needs you to pull (and resolve conflicts) before you push changes back. As a developer your workflow could be git-push, note clash, git-pull, arbitrate over any merge conflicts, git-commit, git-push. If someone else beats you again, you could be repeating that cycle.

6. Arbitrary branching models

Subversion and Perforce allows you to make branches at any point in the directory tree, but that’s a really sharp knife that you can hurt yourself with, and many enterprises have. Take a quick look at http://svn.apache.org/repos/asf/spamassassin/. ‘asf’ is the root of the repo, and the team looking after spamassassin have their own directory at the top level. Inside that, they have trunk/ tags/ and branches/ directories. Take a look at http://svn.apache.org/viewvc/axis and note that the Axis team has two roots of their own axis1 and axis2. Inside each of those, they have a ‘java’ and ‘c’ directory, and inside each of those they have trunk/ tags/ and branches/ directories. Complicated even if it is organized - branches and tags can be mounted anywhere. In a Monorepo configuration, nobody misses arbitrary branching - and all branches would be made from the true root. That notwithstanding and enforced Trunk-Based Development way of working.

PlasticSCM which has Perforce-scale as a design goal does not allow arbitrary branching. In that regard, it is the same as Git and Mercurial, in that the branch is created and maintained at the root directory (whole repo).

On balance, while this is a difference, I’m not sure is a huge plus for the VCS techs that support it, as most enterprises hurt themselves with it (inextricably linked with a slow release cadences).

7. Continuous Review & Patch queue built-in

Continuous Review was pioneered by Google internally in their Mondrian system, then delivered for the masses by GitHub from its 2008 launch. If you’re using GitHub (the public portal or the enterprise app that you install on-premises), it stores the code reviews in a Postgres database that you don’t have access to. Well, you do for the on-prem install (in an appliance) but you have to reverse engineer the schema in order to tamper with the record. It would be better if the reviews themselves were in source control, in my view.

Google’s Mondrian augmented Perforce with review (and other per-commit meta-into) in a relation scheme. Thus for them (and all at the mega-team high-throughput level), it is a patch queue. For smaller teams doing Trunk-Based Development variants, GitFlow and others then temporary branches (or forks in GitHub-land) are whether the changes themselves are stored.

I’ve made the case for code review commentary to be in source-control to the GitHub people. But I could not convince them. You’d get to take advantage of the same Merkle tree concepts to be able to slide smoothly into the audit cycle with a single SHA1 as the representation of the last code review commentary that made it to that release. If in the same repo/branch then the SHA1 would be for the source code itself and the reviews the same. Yes, as single SHA1 again.

The Subversion team has made a ‘shelve’ function that isn’t based on branches (lightweight or not), but it is early days for that technology. A build-in patch queue is long overdue for them, and the tools-makers have augmented their subversion offerings with review and queue functionality (RhodeCode, CollabNet, Assembla). That and independent OSS efforts like Rietveld, (Gerrit is the tech that followed that one) and Phabricator.

PlasticSCM has queue and review built-in for their tech, and Perforce have added it with their GitLab-based systems that bundle the fabled ‘p4d’ server side.

8. Direct access to hash representations of the Merkle tree

If you do git checkout <HASH> (after cloning), then Git takes you back in time to that moment in time quite quickly, regardless of where which branch or tag that may be on. Subversion and Perforce don’t do Merkle tree hashes per commit, and only keep sequential numbers for the commits. Were Subversion and Perforce able to do that, then they’d encounter a new problem: the Merkle tree would be different for each set of permissions for users/groups.

Hashes are SHA1 today (Git, Subversion for specific files), but at some point, Git will move to SHA-256 as for hashes, and there is parallel development of that in master with notes as to progress making it into Git’s release notes

9. Replication built-in.

Git allows cloning to achieve some sort of distribution. Your “server” could go up in smoke, and someone with the latest clone could use that to recreate a new upstream server for the rest of the team to use. Subversion perhaps more than Perforce feels to be from the old vertically-scaled world where backups and failover are the principal strategies. It would be great for members of a global dev team to be able to push/pull/clone/commit to their local server and that (behind the scenes) replicates the other servers around the world. Plastic can do this with scripts, and Perforce with config, but Git’s attempt to do this would hit the push:pull bottleneck I suspect.

I spied rqlite v5’s release a week ago, and thought that the Raft consensus algorithm squaring this circle would be cool. As it happens rqlite uses sqlite which substantially by Dr. Richard Hipp who also maintains a VCS called Fossil (Fossil is self-hosted, and Sqlite is maintained in Fossil). Fossil itself doesn’t do “Direct ‘update’ of specific files”, which is a shame. Otherwise, the tech is very interesting. Anyway, I was thinking what if there were a VCS on top of rqlite or at least the raft consensus protocol.

10. Integrated third-party app hosting

(Relates to #4)

You would be able to install apps on the server-side that give extra functionality alongside the core VCS functions. These apps would run in a sandboxed way (say docker), but per-request could leverage the “is or isn’t authenticated” determination of the VCS routing traffic to it. The apps themselves would be served up from the same HTTPS domain as the content (notwithstanding other protocols). For example a bug tracker or wiki that stores content on the same basis as the regular files under source control. I’ve coded a few apps like this before - hacked into existing tools: Rhodecode (2016), GitHub (2015), Perforce (2013), vanilla Git (2012) (as linked to earlier).

11. Redaction features.

(Jan 11th, 2021 additon)

Say I had a commit that added /foo/bar/Paul_McCartney.txt and had contents about him dying in October 1969 that was so salacious that #TeamMacca took me to court and in a settlement got me to agree to redact that “fact”. I should be able to go and write something to my nirvana-esque VCS that shows the item is gone for good, but the merkle tree is intact. The SHA1 of Paul_McCartney.txt is 57eac801b9ee82c62233e9e6f11a5c6c0f5969a3. If the contents were He died in 1969 and it has been a big cover-up ever since with doubles (SHA1 - 812ad73ea0a55d0b4f27d31f78ea469a2f35af2c), then I’d expect to see a .redactions.txt file like so …

57eac801b9ee82c62233e9e6f11a5c6c0f5969a3 812ad73ea0a55d0b4f27d31f78ea469a2f35af2c

… appear in that directory. Well, perhaps SHA256 instead of SHA1 and the contents would contain whatever one-way hashing info is needed to keep the history-retaining merkle tree intact. It gets complicated when you think the file might have received revisions and renames, but a system could be made to work. As well as the merkle tree remaining intact, no revision of Paul_McCartney.txt is retrievable. Nor is the name recorded in that form anywhere - just the SHA1/SHA256 remains. Of course, it is never just name and contents, it is date and other meta-data too (at least for Git). If someone somewhere else has hung on to the fact that the hash maps to the previous name/contents/other-meta, then that is their business and wholly outside the VCS system of record.

There would need to be a legal acceptance that one-way hashing of something is not legally the same as the unhashed aspects of the same. That seems obvious, but putting 10,000 PhDs on standby to explain it in court cases isn’t practical, so being written as accepted as law seems inevitable. There’d be some Pi should 4.0 not 3.14 for convenience reasons back and forth for a while before it gets accepted but it would ultimately get there.

← Previous Archive Next →

Published

January 19^th, 2020

Reads:

Paul Hammant's Blog: VCS Nirvana