I’m wishing for a rise in the use of formal source/version control for non source-code usages, and I see signs that it is happening slowly.

I have been fascinated with source-control (where items stored within are primarily textual) for some time. When I say source control, others would say version-control, and I would too, but only where that stored within it is primarily binary. I’m not a fan of proprietary version control though. You know, like where a wiki or a CMS has versions of artifacts, and a mechanism to navigate through their history and potentially compare revisions. What I don’t get is the ability to check them out as a set (or at a specific prior moment in time), operate on some or all of them, then check them in again (as a set and atomically). What I like is for tools that purporting to support the versioning of artifacts allow me to configure my own formal version control backend. I also like such applications to use that facility as the primary store for its artifacts. Indeed I don’t like anything that would store in a relational schema, where the item in question would be better suited to a version-control system.

I look at the likes of Mongo and similar document / key-value stores, and smile. They are so close. I love their advanced document indexing and querying. What I’d love more, is the ability to work on sets of documents after a checkout. There’s 30 years of PhD thinking that’s pushed the science of version-control (and merging and branching) that I think would have a happy home boosting the functionality of the document & key-value stores available today.

A hypothetical example

From the command line (of course):

hamgo checkout from inbox where content contains "the rise of VCS" and sender endswith @thoughtworks.com

I might want to do traditional grep-style unix work on that set after checking them out, to fuel my wish to crunch data and produce stats. I might also like to mutation to some of the documents, and do a commit afterwards:

cd inbox
perl -p -i -e 's/TAGS[/TAGS[TW_VCS /g' *.email
hamgo commit -am "tag emails from TWers talking to me or mail-lists about my rise of VCS topic"
hamgo push
cd ..
rm -rf inbox # delete the checkout/clone thingy

Incidentally, do I think email should be under source control? Perhaps not, though there could be benefits to modifying your own emails. Deletes is an obvious function, with the safety of history, moving the idea from “nuts” to “dubious”. In a previous article I’ve wished for a pervasive inbox which called for rewriting of emails on the server side, amongst other things.

Note, Mongo does have a shell (but not the syntax I showed above, or with a checkout/clone capability).

More things for formal source control

  • Publishing / Content management systems, as previously mentioned.
  • Internal collaborative document systems - SnirtLabs has enterprise-ready solutions for that.
  • Issue trackers / Story management tools - documents for issues/stories/tasks, that tend towards completion.

Document stores for .doc, .ppt (etc) will have to handle binary formats, and the version-control system backing them be good at smart deltas for those file types. Story/issues, if represented as text are suitable, but for those the need for branching is very weak (see immediately below for “very small data”).

Further reading

I published before about SCM and Key-Value Store Convergence (2012).

There’s also Nearly All CMS Technologies Suck (2014) where I break down why, for CMS platforms in particular.

Very Small Data (2012) goes into a bunch of reasons as to why/when source-control generally.



Published

December 8th, 2014
Reads: 721

Syndicated by DZone.com
Reads:
17282 (link)

Categories

Comments formerly in Disqus, but exported and mounted statically ...


Wed, 07 Jan 2015Jeff Dickey

Good luck using "document stores for .doc, ppt (etc)" on non-Windows systems (actually the vast majority of computing devices that would want access to a VCS). Actually, the first VCS system I worked with was 1972-vintage SCCS. Like many in the craft, our team now makes extensive use of Git.

Or perhaps this was intended as snark?

Thu, 30 Jul 2015Mark Levison

Paul - I tried version control for all of my training materials, writing etc for about three years until I got a warning from BitBucket that my repo was larger than 1GB. That killed the experiment.

Issues
- no decent diff tools that understand .docx .xlsx files etc.
- no instant updates

Strangely Dropbox with a history function has worked better for me, although my software developer soul is sad.

Thu, 30 Jul 2015Jeff Dickey

Mark, see my comment below regarding binary files. Binary files are treated as unitary blobs in most VCSes.

.docx and .xlsx files are actually ZIP (compressed binary) files containing a number of XML files. BitBucket quite rightly saw those as binaries and refused to diff them, instead preserving each revision in its original binary form.

You can unzip a .docx or .xlsx into its own directory and add that to BitBucket (or update an existing repo), which should store each of the component XML files as text files in BB (which will then be happy to diff them, for what that's worth). To recover a previous version, check it out into a fresh directory and then create a new ZIP file from the contents, renaming that .zip file to .docx (etc) as appropriate. Word (or Excel) will then be able to open the file just as it had previously been saved.

Yes, that's a pain. No, that's not BitBucket's fault; it takes a lot of text files and revisions to same to consume 1 GB in a BitBucket repo. Blame Ballmer-era Microsoft for choosing reduced disk space over truth in advertising — OfficeOpen XML (as .docx and .xlsx are, which is not the same as "OpenOffice XML") is a binary vendor proprietary "standard", not an XML-based open standard.

Thu, 30 Jul 2015Mark Levison

I understand the problem albeit I thought that Git did implement binary differencing. Real XML files for Word et al? That might work well or be very very painful. Having worked on an SGML parser a hundred years ago I have a hate hate relationship with XML/SGML etc :-)

Fri, 31 Jul 2015Jeff Dickey

I fought the Parser Wars for about five years myself, back in the C++/dot-bomb days, and can painfully relate. I never said they were well-formed XML; just that, by being text files, they were far less actively hostile to VCSes like Git and Mercurial/BitBucket.

IIRC, Git does implement binary differencing, but abandons attempts to diff when it decides that the diff exceeds some threshold percentage (100%?) of the new binary. I dealt with binary diffs and merges back in the UCSD p-System days, and the memories that brings are far more visceral and terrifying than anything the Parser Wars did to/for/with me.

Cheers.

Fri, 31 Jul 2015paul_hammant

I'd pay for Git to be changed to silently unzip .docx, .xlsx and .pptx in the .git/ folder and only reconstitute them in the working copy. So I'm only interested in the carriage-return-delimited text diffs (incl XML) if I'm diffing at all.