Paul Hammant's Blog: Git storing unzipped office docs
I’d love Git to grok MS Office docs, but it doesn’t really. It came up again today at work, and coincidentally in Disqus comments for an older blog entry of mine The rise of version control
Anyway, I though I’d spike what Git would do, were it reworked to silently unzip (for commit) and rezip (as it makes working copy). Here’s a repo - git-word-diff-test. Here’s a commit of a simulated storage of a Word doc made with the Mac version of Microsoft Office - Mary.docx
- which just contains one word per line “Mary had a little lamb”, and a single commit that changed that to “Mary had a little iPad”.
It turns out that the commit is pretty noisy. Click the “Split” on that page, and search in page for “iPad”. It was the real change. It is a shame that a whole bunch of seemingly random and temporal stuff changes at the same time. Microsoft: try to make idempotent things please. I’ve reformatted the XML on the ‘pretty’ branch too to make it easier to see the change between the two revisions. Microsoft: consider formatted/pretty XML - it makes no difference if you’re zipping it.
Despite that noise, the second commit was only an 170 byte addition to the .git/ blobs, when the .docx file is ordinarily 23Kb in size.
Byte diff calc for the HEAD commit:
COMMITSHA=$(git log | grep "commit " | head -n 1 | sed 's/commit //')
CURRENTSIZE=$(git ls-tree -lrt $COMMITSHA | grep blob | sed -E "s/.{53} *([0-9]*).*/\1/g" | paste -sd+ - | bc)
PREVSIZE=$(git ls-tree -lrt $COMMITSHA^ | grep blob | sed -E "s/.{53} *([0-9]*).*/\1/g" | paste -sd+ - | bc)
echo "$CURRENTSIZE - $PREVSIZE" | bc
As I sorta said in the Disqus comments, I’d pay for Git to be changed to silently unzip .docx, .xlsx and .pptx documents and only reconstitute them in the working copy as I checkout. I’m only interested in the carriage-return-delimited text diffs (incl XML), if I am diffing at all. Diffs on the binary aspects of such zips are just wrong.
Updates
Sept 1st, 2015
Inside the zip, there’s vbaProject.bin for Word/Excel/Powerpoint docs that have VBA. This raises the bar on the idea - as it is binary. Luckily there is open-source know-how that will allow this otherwise less-than-open-standard piece to be unpacked too: Philippe Lagadec’s oletools.
Jul 13th, 2016
I discovered Followup: Managing EXCEL with git - diff problem - a preexisting conversation on the Git mail list, around the same problem.