I’d love Git to really understand MS Office docs, but it doesn’t really. A previous blog entry on a startup’s effort to make MS Word play natively with Git was indication of that interest. It came up again today at work, and coincidentally in Disqus comments for an older blog entry of mine The rise of version control.

Anyway, I thought I’d spike what Git would do, where it reworked to silently unzip (for commit) and rezip (as it makes working copy). Here’s a repo - git-word-diff-test. Here’s a commit of a simulated storage of a Word doc made with the Mac version of Microsoft Office - Mary.docx - which just contains one word per line “Mary had a little lamb”, and a single commit that changed that to “Mary had a little iPad”.

It turns out that the commit is pretty noisy. Click the “Split” on that page, and search in page for “iPad”. It was the real change. It is a shame that a bunch of seemingly random and temporal stuff changes at the same time. Microsoft: try to make idempotent things please. I’ve reformatted the XML on the ‘pretty’ branch too to make it easier to see the change between the two revisions. Microsoft: consider formatted/pretty XML - it makes no difference if you’re zipping it.

Despite that noise, the second commit was only an 170 byte addition to the .git/ blobs, when the .docx file is ordinarily 23Kb in size.

Byte diff calc for the HEAD commit:

COMMITSHA=$(git log | grep "commit " | head -n 1 | sed 's/commit //')
CURRENTSIZE=$(git ls-tree -lrt $COMMITSHA | grep blob | sed -E "s/.{53} *([0-9]*).*/\1/g" | paste -sd+ - | bc)
PREVSIZE=$(git ls-tree -lrt $COMMITSHA^ | grep blob | sed -E "s/.{53} *([0-9]*).*/\1/g" | paste -sd+ - | bc)
echo "$CURRENTSIZE - $PREVSIZE" | bc

(that modified from stackoverflow)

As I sorta said in the Disqus comments, I’d pay for Git to be changed to silently unzip .docx, .xlsx and .pptx documents and only reconstitute them in the working copy as I checkout. I’m only interested in the carriage-return-delimited text diffs (incl XML), if I am diffing at all. Diffs on the binary aspects of such zips are just wrong.

Updates

Sept 1st, 2015

Inside the zip, there’s vbaProject.bin for Word/Excel/Powerpoint docs that have VBA. This raises the bar on the idea - as it is binary. Luckily there is open-source know-how that will allow this otherwise less-than-open-standard piece to be unpacked too: Philippe Lagadec’s oletools.

Jul 13th, 2016

I discovered Followup: Managing EXCEL with git - diff problem - a preexisting conversation on the Git mail list, around the same problem.



Published

July 30th, 2015
Reads: 1,201

Syndicated by DZone.com
Reads:
4233 (link)

Categories

Comments formerly in Disqus, but exported and mounted statically ...


Fri, 31 Jul 2015Mark Levison

For better or worse I've moved past using GIT for dealing with my business back office. Version control should simply baked in to the OS. Dropbox is starting to get it right to a limited degree.

Sat, 29 Aug 2015paul_hammant

Mark, what does Dropbox do in the "clash" situation I outline in https://paulhammant.com/2014... ?

Sun, 30 Aug 2015Mark Levison

Funny thing is I don't know. I've been operating my business on Dropbox for ~18mths and have yet to create the problem. My wife has been doing for far longer. Even more interesting MS seems to be building in support for the Cloud into Office. I assume (untested) that this will allow synchronous editing.

I think word documents and spreadsheets avoid the problem that code has since we don't do large refactorings that span 100's of documents at one time.

I'm beginning to wonder if applying version control to documents is right approach at al.

Sun, 30 Aug 2015paul_hammant

Have a look at https://www.versionrocket.com/ - with thunks a Git-aware side panel into MsWord. Also http://sparkleshare.org/ (Dropbox meets Github) though it needs work to unzip-merge-rezip as I outline, but at least has a single place where that could happen w/o the Git maintainers needing to agree with the strategy.

Mon, 31 Aug 2015Mark Levison

Paul - thanks. Both tools look cool. Sadly VersionRocket doesn't support Mac's yet.

Cheers
Mark

Sat, 29 Aug 2015Oskar Gewalli

You could do it with git hook scripts and perhaps filewatcher (that continually unzips) so that you can view the changes. Then add gitignore for the zipped files.

Sat, 29 Aug 2015paul_hammant

I don't think that would work. That said I'm often wrong. Care to do a proof of concept for a blog entry?

Sat, 29 Aug 2015Oskar Gewalli

Depends on what kind of behaviour you want. If you have a pre-commit to unzip into a folder and post-checkout hook to zip folders into files. Then you need other hooks to handle rebase, merge etc. You could have some name convention for the folders that should turn into word files. You will need to run some script to install the hooks when cloning.

Sun, 30 Aug 2015paul_hammant

Surely commit hook scripts can only work on things that are not in the .gitignore list ?

Sun, 30 Aug 2015Oskar Gewalli

Yes, that's why you have the folders named something like document._docx so that the scripts can recreate the file document.docx.

Fri, 11 Sep 2015Dale Visser

You could always save in Open Document format. I assume (without actually having tested) that it has better idempotency properties: https://en.wikipedia.org/wi...