Paul Hammant's Blog: Corporate File Sync: Agony and Ecstasy
Snirt - a new technology for SCM backed file-sync
Seeing file history in the Snirt app, with a preview of the contents to the right:
Opening the file from the Snirt app, launches MS Word that has a panel from Snirt in the view to the right:
I’ve been playing with a File-sync technology design for an “own cloud” solution that should be particularly attractive to corporations. It’s from a startup Snirt Labs, who are slowly emerging from stealth, and is called Snirt. Right now it’s a Windows client that does file-sync to/from a remote repo, and facilitates team collaboration the same as DropBox and others do. What is novel is that Snirt is also an extensive document management app and integrates with MS Word, Excel, and Powerpoint to make the multi-person editing of documents safe for a scenario. The specific scenario I want to be safe is where two people are editing a document and might have conflicting changes. Snirt allows you to merge the otherwise conflicting changes, and it works very well in that regard. The version that’s available now uses Git as a backing store, and even integrates with Github for a larger management experience. If the Github repo you’re pointing at is public, Snirt is available for free. Where this stuff is different:
- While editing a doc, you can choose when to save a document back to the remote repo vs save locally. The former is a commit, the latter is just a incremental save to working copy, as old timers never knew when MS Word would crash and lose your work.
- After editing a doc, and in a potential clash situation, you have a graphical workflow to arbitrate over “yours” and “theirs” choices, in order to make a single canonical version going forwards.
For pre-existing file-sync solutions, there may be rudimentary version control built in, but nothing as powerful as this. At least for the document types that are supported first class. Snirt is Windows-only for now, but there could well be Mac versions later (additional rounds of VC funding, as well as customer demand). Also with prioritization, these guys could hit alternate VCS backends. SnirtLabs are working on the Subversion provider/backend for a forthcoming release, as well as implementing a meaningful and semantically correct rename facility (which is hard) - both for files and directories. I’d love for Perforce to be targeted as a backend because of its industrial strength. For that matter, though I don’t love it, TFS needs to be targeted too because that’s the type of thing Microsoft technology shops choose by default.
File Share/Sync for a previous client
Some months ago I had a corporate client in the finance space that used Sharepoint, and needed a more sophisticated file share/sync allowing groups of users with different permissions in one “space”. Group one would be a team working incrementally on documents for publication. Groups two to six would be vendors who’d want to subscribe to those documents. The vendor groups would also have their own directories within the document application, but not be able to see each other’s directories (nor even the existence of other vendors or their employees).
We looked at Confluence, Jive, Alfresco and fresher version of Sharepoint than the client had installed at that moment, but none were right as these were more of a portal, and we needed something that centered on file share/sync.
In January, and disgruntled with the pure-web strategy of the incumbent portal solutions, I penned Browser Downloads Suck I would have written more directly about the client’s problem - office document management and dissemination - but they’d have not liked that, so I had a contrived case - my paystubs, and turned the blog entry into a positive story. The synopsis is that browser apps are very poor local file interop tools. When you upload a document, the browser can’t delete the source, yet that’s a needed workflow for such things. Sometimes you want to move a file from your Desktop to the implicit VCS. Snirt allows that (or copy - your choice).
Sparkleshare is a suitable technology at the hobbyist end, that’s good and uses Git behind a file-syncing protocol. It doesn’t do anything special for arbitration on clashes though. We also looked at AeroFS, OwnCloud, Cmissync (CMIS is trying to take over from WebDAV), BittorrentSync, and Seafile.
We couldn’t find a solution, so nothing changed, and I’m long gone from that client now. I suspect, though, I’m going to see the same problems again and again.
VCS systems as backends - further thoughts
MS document formats are zips.
Snirt uses the built-in MicroSoft tools for a natural review/merge mechanism. The trouble is that the storage of .docx (etc) documents inside Git is a storage of binaries that don’t support any space optimization because of deltas. What I mean by that is if you make a one-word change to a 100K Word document it stores 200K on the server (and any client-side Git clone).
Inside the .docs file it is actually a bunch of fairly regular XML. It would be great it Git had an option to “rezip” certain file suffixes. Meaning, as Git consumed a .docx file, it unzips it, natively stores the plain text, and casually reassembles the zip in the working copy folder. Git’s zipping algorithms would need to be compatible with MS-Word’s unzipping algorithms, but that should be doable - right? This way a second revision of the 100K .docx file would only use 1K more storage, while both Word compatible .docx revisions are still available for loading into Word. The Snirt guys can just wait for this feature of Git to become available, and instantly use it for space-saving advantage.
Back in the day Visual Source Safe (VSS) had some enhancements for word documents (pre the ‘x’ file suffixes) that made it better in ways that I forget. VSS was unsuitable for it’s core purpose of course, and even Microsoft didn’t use it.
git config --global version.zip.contents docx,pptx,xlsx
Git of course has optimizations for Word docs, but it is only about diff. It’s not about a reduced bytes in terms of storage.
If you’re implementing a VCS as a backend, it’ll have it’s own authentication system. That could be LDAP or Active Directory integrated. Indeed it should be in any properly configured corporation. That suggests that the File share/sync agent should leverage the same. It definitely needs to differentiate activities from different uses and ensure that the history makes sense given that.
Corporations have needs for directory permissions. Both read and Write, and managed groups of users with the same permissions for various things.
Perforce Commons uses the industrial strength Perforce VCS as a back end. Pixar uses Perforce to version everything so the base technology is a good match. Commons interoperates with DropBox, which gives it a quick advantage in that there are stable Dropbox clients out there, and plenty of companies that have already rolled it out. Commons has a bunch of online tooling to allow for the management and arbitration of changes to files over a time line, including the merging of Word and Powerpoint changes in the web-interface.
Adding Commons could be easy for many companies. I’m not sure that feels right though. I think the Perforce people should release their own sync client to avoid something that feels like a mismatch - is the formal version control tool dominant or the sync agent? The Commons integration into the Dropbox world is watching for events in the Dropbox world and making revisions based on decisions thereafter. There’s one other problem in that many corporations want a totally private solution. That’s a space that Perforce has sold into successfully over the years, but DropBox is definitely an in-public cloudy thing. That last in itself is enough to significantly reduce the appeal. I’m not sure how Perforce’s sophisticated access control list setup for files/directories will work with DropBox.
Bad Box Experience
Mid-month, I used Box for the very first time, and was bitten hard immediately. Box does “sync” well enough, but my first use case for any of these tools is “clash”. Box failed that test. Here’s my scenario, step by step:
- I invite colleague Pawel to Box folder I created.
- Put a OmnigRaffle diagram in there.
- He sees file is sync’d changes it and hits “save”
- In the web interface (me logged in), I see it is changed
- On my Mac, in Finder, I can’t see that changed revision
- I search for 10 minutes for a “sync now” option before giving up.
- I make a change to the same OmniGraffle diagram
- Box’s sync agent now wakes up and informs me that there’s a clash
- Box’s sync agent then goes on to unilaterally “fork” the document within the same folder and upload that fork.
- After that, I’m not in a detached state - I’m not sure how I get back to editing the original (it still has the original name in the title bat of OmniGraffle), but each save I do is going to make a another “fork” (see below).
- Pawel isn’t explicitly notified of the clash.
- It’s non-obvious how to pull back from the fork situation.
Sync agents, if they know nothing about merge for that file type, should tell you they can’t save a document and give you choices:
- Abandon your change and accept theirs
- Assert your changes, and forcibly overwrite theirs (history still has theirs)
- Consent to do that fork thing that’s default with Box
Sounds like a formal VCS thing doesn’t it? I hope the Snirt guys change the field after getting rich. I also hope that they don’t sell out to a larger entity that reduces their service and tool/tech offerings in the years that follow.
My non-standard workflow with Snirt
I have employed a non-standard way to try to get test the clash scenario. One client is a virtualized windows instance on Skytap.com (courtesy of SnirtLabs). The other client is my Mac, and via the vanilla Github capability (that Snirt normally does the interop with for you). I wasn’t expecting Snirt to work in this scenario, but round-trip is a power-user feature, and Snirt is compatible with that mode of operation already, as you’ll see below.
Snirt tries hard to not allow a clash to happen. It looks for an update to a doc, immediately before launching an editor on that document. To force the clash scenario, you have to open in both, before you save & publish from one or the other.
Note also, that many of the pictures below are zoomed in to a section of a screen in order to reduce the file sizes for this article.
In Snirt’s Windows app, I can see a two documents under version control for the repo in question:
I click on the word document, which I’d added as Snirt setup the repo, and I can see a single revision with a preview:
Switching to the Mac and MS-Word (2011), I’ll add a preamble and ignore the spelling mistake (see the red highlighting):
To complete the round-trip aspect of this experiment, I have to commit and push from the command line on the Mac:
Snirt keeps meta-data in Git too, in a separate folder. In the case of the commit from my Mac there is no accompanying meta-data, but Snirt doesn’t seem to mind. It does mean that there is no publication of a preview picture though.
Back on the PC’s MS-Word (2013), I’ll fix some spelling and grammar (see the red highlighting):
I’ll prepare to publish by clicking “Save & Publish” and adding a commit message:
Ordinarily you expect the second “publish” button to push to Github. But this time I click it and the commit/push gets vetoed as there’s a clash according to latest from the server (Snirt checks at the moment of publication):
I’m offered a “resolve conflict” option, amongst others. Also, the Mac’s revision can now be seen on the PC:
After electing to resolve, I’m offered alternates as to how, with “merge” as the default:
I’m now presented with a Word doc, with the revisions presented for a standard Word “accept/reject & next/prev” workflow (see the red highlighting):
You then go on to save/publish the result for all to see. I’m looking forward to a future version that does automerge :)