Google’s gigantic Monorepo:

  • 86TB of history
  • 9 million unique files
  • one branch (Trunk-Based Development)
  • 25K developers all committing there

From my calcs (Googlers please correct me), there’s one commit to the trunk every 30 seconds, and their propriately CI infrastructure keeps up with that on a per-commit basis. That’s a different story.

Recapping what a Monorepo is

A Monorepo is where two or more teams with different deployable/shippable applications and/or services (and potentially different release cadences/schedules) exist in the same repo/branch. ‘The trunk’ of Trunk-Based Development specifically.

Those checkouts could get big, right?

An Expanding/Contracting Monorepo (the next level up)

The checkout to the developer’s workstation expands or contracts to the smallest amount of buildable/linkable modules needed to perform intended test and package operations. Not only that, but in a 100% provable way from a deep understanding of the directed graph of buildable things and the things that could use them.

At some level, ex-Googler ex-FaceBooker Buck-committer, Simon Stewart points out, this expand/contract thing is just a view or projection of a Monorepo. He also says:

“The amount of boilerplate required to get the Maven thing working is terrifying. Both Buck and Blaze let you just create a new directory, shovel classes into it, and you’re done. The “src” nonsense required to make mvn work without hoop jumping seriously raises the bar to multiple modules.”

He is right of course, but Maven is still the enterprise Gorilla.

The Maven challenge

Maven is a recursive build technology that forward declares child modules to build. Compare that to Google’s Blaze (partially open sourced as Bazel) and Facebook’s Buck which are directed graph build systems where there are no forward declarations. It is no co-incidence that expanding/contracting checkouts are easy with them, because that was the design goal Blaze (the one that came first).

Maven’s forward module declarations look like this:

<modules>
    <!-- Maven works out the build order - phew! -->
    <module>aModule</module>
    <module>another-module</module>
    <module>this_one_has_child_modules_too</module>  
</modules>

All those are directiries within the current directory. Maven is going to take some coercing.

Ant and Gradle are the same, but I’ll focus on the Java enterprise gorilla - Maven.

Maven Monorepo proof of concept - in Git

See github.com/paul-hammant/googles-monorepo-demo

I took Google’s Guava because it had a multi-module build that although small. could be a surrogate for something with hundreds of modules, that could represent a company’s entire set of (Java) deployable/shippable applications and/or services.

The default checkout of this repo doesn’t build as all the pom.xml files were renamed pom-template.xml

No matter, run this:

mr/checkout.sh

Now you have POMs again. mvn install runs as you’d expect now.

Running this on the command line:

mvn com.github.ferstl:depgraph-maven-plugin:aggregate -Dincludes=com.google.guava

Gives a Dot graph (GraphViz) that via some colorization in OmniGraffle looks like:

I’ve colored two items blue. Wouldn’t it be nice to modify the checkout (working copy) to have just those two modules, and Maven not choke because of missing modules.

Well that’s possible now. Do this:

git config core.sparsecheckout true
echo '/mr' > .git/info/sparse-checkout
echo '/README.md' >> .git/info/sparse-checkout
echo '/pom*' >> .git/info/sparse-checkout
echo '/guava/' >> .git/info/sparse-checkout
echo '/guava-testlib/' >> .git/info/sparse-checkout
mr/checkout.sh

As promised, mvn install works on just two modules now, rather than the eight before. The dependency graph looks like this:

A Word of warning

If you are a Maven and Git using company, and you are wanting to share-code this agressively in a Monorepo, but worry about build times going up exponentially, then this expand/contract scripting is for you. Maybe even right now, without much extra tooling effort. You may have to do split history repos when the size of the .git folder goes above the recommended limit, but that is not much of a problem really, provided you have Git-LFS turned on. There’s nothing stopping you from going live with this immediately.

For the love of Turing though, have a lock-step version number for everything built in the monorepo. Maybe Maven’s classic 1.0-SNAPSHOT suffices, and in your CD-esque deployment technologies you designate something more meaningful in Jenkins (etc).

Next steps?

That’s easy, some more Python fu, that works with the first dot graph to allow you to conveniently modify .git/info/sparse-checkout. Like so:

mr/checkout.sh guava-testlib
# calculates that it needs guava too

# or

mr/checkout.sh guava-tests,guava-gwt
# calculates that it needs another 5: guava-testlib/test, guava,
# guava-testlib/test, guava-tests/test, and guava/test (and not
# guava-testlib at all)

Conclusion

A Monorepo in this style with Git will work if you don’t blow through the history size limits (the size of the .git folder). That is said to be about 1GB. If you go above that, you can always do the split history thing, and start a new repo with only the HEAD revisions of the former one seeding it, and have all that growing room again. That’ll make bisecting towards root cause of an issue harder, but people will get over that. The other problem you have is that more than a certain amount of commits a minute will be hard to keep up with - especially if you’re wanting to do a commit/push (and need people to kinda stop committing to facilitate that).

Well maybe I have not convinced anyone. Googlers and Xooglers were already convinced so they do not count. No matter, I am personally looking forward to having Git or Mercurial push their size boundaries to get to the place Perforce, Subversion and PlasticSCM can, so that companies can bank on Monorepo setups in this style.

While I have your attention

Can I direct your attention to a portal documenting ‘Trunk-Based Development’ (incl Monorepos)? -> https://trunkbaseddevelopment.com

No Ads, no services being sold, mobile friendly. Well worth a look, in my opinion.



Published

January 27th, 2017
Reads: 1,903

Syndicated by DZone.com
Reads:
6811 (link)

Categories

Comments formerly in Disqus, but exported and mounted statically ...


Fri, 27 Jan 2017Jon Forrest

"expands or contract" ->
"expands or contracts"

Sat, 28 Jan 2017Sebastiano Pilla

The more I read about monorepo, the less I like the idea: what are the advantages of a monorepo that justify inventing special tooling for the current build systems, and having to worry about breakages?

Sun, 29 Jan 2017paul_hammant

Google's experience is 'Trunk' is *never* broken. Not for 25K developers/QA-automators. Therefore they think theirs is effectively cost free for the high throughput they've achieved. One year soon, this sort of tooling will be off the shelf.

Tue, 31 Jan 2017Markus Kohler

Cool stuff!I was playing around with the git sparse checkout feature lately as well, to reduce the size of some repos, that contain a lot of cruft I don't need.
With regards to gradle. I don't think it would be very difficult to set this up for gradle. Since gradle is very flexible there are different ways to do it. I know you can setup gradle in a way such that to checkout all projects into the same folder (no deep hierarchy), dependencies are defined using ivy.xml files, and gradle is able to compile an arbitrary subset of your projects. E.g. it will pull you dependency from the Ivy repo (pre-compiled jars) if you haven't checked out the project.
I think this should in principle work for a monorepo approach. Sure if you need to checkout 10000s of projects into the very same project, that might cause problems depending on your filesystem, but most companies don't have that many projects (I guess). I'm also not sure whether the flat directory structure is really some gradle limitation, that couldn't be worked around.

With regards to Maven, yes it is still the default for a lot of large companies. But honestly I wouldn't really recommend it for anyone using monorepo approach. AFAIK Maven still has problems to figure out which (Java) projects really need to be re-compiled(maybe there is in the meantime some plugin that fixes this?). With a monorepo approach, you really want to make sure you only build what has to be build.
Gradle in my experience is pretty good in my experience to only compile the projects that need to be compiled.
In addition I guess you also really want a "distributed build cache" for building monorepos quickly. Gradle has plans since a while to support that, but it is not yet there. Blaze got experimental support for such cache.
Essentially this "distributed build cache" replaces repositories for "binaries" like Ivy/ Maven nexus in a transparent way.

Wed, 01 Feb 2017paul_hammant

Markus, did you know of any Github repos that are multi-project. Something I could do the work on to make a Gradle equivalent of these two: https://github.com/paul-ham... ?