Paul Hammant's Blog: Maven In A Google Style Monorepo
Google’s gigantic Monorepo:
- 86TB of history
- 9 million unique files
- one branch (Trunk Based Development)
- 25K developers all committing there
From my calcs (Googlers please correct me), there’s one commit to the trunk every 30 seconds, and their propriately CI infrastructure keeps up with that on a per-commit basis. That’s a different story.
A Monorepo is where two or more teams with different deployable/shippable applications and/or services (and potentially different release cadences/schedules) exist in the same repo/branch. ‘The trunk’ of Trunk Based Development specifically.
Those checkouts could get big, right?
A level 2 Monorepo
(Martin Fowler will give this a better name hopefully)
The checkout to the developer’s workstation expands or contracts to the smallest amount of buildable/linkable modules needed to perform intended test and package operations. Not only that, but in a 100% provable way from a deep understanding of the directed graph of buildable things and the things that could use them.
At some level, ex-Googler ex-FaceBooker Buck-committer, Simon Stewart points out, this expand/contract thing is just a view or projection of a Monorepo. He also says:
“The amount of boilerplate required to get the Maven thing working is terrifying. Both Buck and Blaze let you just create a new directory, shovel classes into it, and you’re done. The “src” nonsense required to make mvn work without hoop jumping seriously raises the bar to multiple modules.”
He is right of course, but Maven is still the enterprise Gorilla.
The Maven challenge
Maven is a recursive build technology that forward declares child modules to build. Compare that to Google’s Blaze (partially open sourced as Bazel) and Facebook’s Buck which are directed graph build systems where there are no forward declarations. It is no co-incidence that expanding/contracting checkouts are easy with them, because that was the design goal Blaze (the one that came first).
Maven’s forward module declarations look like this:
<modules> <!-- Maven works out the build order - phew! --> <module>aModule</module> <module>another-module</module> <module>this_one_has_child_modules_too</module> </modules>
All those are directiries within the current directory. Maven is going to take some coercing.
Ant and Gradle are the same, but I’ll focus on the Java enterprise gorilla - Maven.
Maven Monorepo proof of concept - in Git
I took Google’s Guava because it had a multi-module build that although small. could be a surrogate for something with hundreds of modules, that could represent a company’s entire set of (Java) deployable/shippable applications and/or services.
The default checkout of this repo doesn’t build as all the
No matter, run this:
Now you have POMs again.
mvn install runs as you’d expect now.
Running this on the command line:
mvn com.github.ferstl:depgraph-maven-plugin:aggregate -Dincludes=com.google.guava
Gives a Dot graph (GraphViz) that via some colorization in OmniGraffle looks like:
I’ve colored two items blue. Wouldn’t it be nice to modify the checkout (working copy) to have just those two modules, and Maven not choke because of missing modules.
Well that’s possible now. Do this:
git config core.sparsecheckout true echo '/mr' > .git/info/sparse-checkout echo '/README.md' >> .git/info/sparse-checkout echo '/pom*' >> .git/info/sparse-checkout echo '/guava/' >> .git/info/sparse-checkout echo '/guava-testlib/' >> .git/info/sparse-checkout mr/checkout.sh
mvn install works on just two modules now, rather than the eight
before. The dependency graph looks like this:
A Word of warning
If you are a Maven and Git using company, and you are wanting to share-code this agressively in a Monorepo, but worry about build times going up exponentially, then this expand/contract scripting is for you. Maybe even right now, without much extra tooling effort. You may have to do split history repos when the size of the .git folder goes above the recommended limit, but that is not much of a problem really, provided you have Git-LFS turned on. There’s nothing stopping you from going live with this immediately.
For the love of Turing though, have a lock-step version number for everything built in the
monorepo. Maybe Maven’s classic
1.0-SNAPSHOT suffices, and in your CD-esque
deployment technologies you designate something more meaningful in Jenkins (etc).
That’s easy, some more Python fu, that works with
the first dot graph
to allow you to conveniently modify
.git/info/sparse-checkout. Like so:
mr/checkout.sh guava-testlib # calculates that it needs guava too # or mr/checkout.sh guava-tests,guava-gwt # calculates that it needs another 5: guava-testlib/test, guava, # guava-testlib/test, guava-tests/test, and guava/test (and not # guava-testlib at all)
A Monorepo in this style with Git will work if you don’t blow through the history size limits (the size of the .git folder). That is said to be about 1GB. If you go above that, you can always do the split history thing, and start a new repo with only the HEAD revisions of the former one seeding it, and have all that growing room again. That’ll make bisecting towards root cause of an issue harder, but people will get over that. The other problem you have is that more than a certain amount of commits a minute will be hard to keep up with - especially if you’re wanting to do a commit/push (and need people to kinda stop committing to facilitate that).
Well maybe I have not convinced anyone. Googlers and Xooglers were already convinced so they do not count. No matter, I am personally looking forward to having Git or Mercurial push their size boundaries to get to the place Perforce, Subversion and PlasticSCM can, so that companies can bank on Monorepo setups in this style.
While I have your attention
Can I direct your attention to a portal documenting ‘Trunk Based Development’ (incl Monorepos)? -> https://trunkbaseddevelopment.com
No Ads, no services being sold, mobile friendly. Well worth a look, in my opinion.
blog comments powered by Disqus