Paul Hammant's Blog: Maven Central as multiple Git repositories
How about all the jars (classes, source archives, and Javadocs) being up on a (rebooted) Maven central but as Git repositories instead of downloads over HTTP. As you ran your build locally, the suitably enhanced Maven would go get jars from Git repositories online.
Not whole jars, but the classes within a jar, unzipped. Afterwards on your local system that is either still a
or a reconstituted Jar file again. There are ‘bare’ git clones too, so there would be much to decided about how it would
Why? Well I like the idea of VCS storing unconventional things, and people being able to do dev style workflows with them. In this case you’d never make changes and commit - it’s be an append-only type of thing, and subscription is where downstream teams benefit. Especially corporates. And it’s smaller in use as you’ll see because I’ve done an experiment, after writing a script:
My experiment is with XStream (0.01% of which I wrote). Here is the root folder for XStream in Maven Central today. The experiment was able to show byte savings:
Jars containing JavaDoc
|Total size for 15 original versions||24.1MB|
|The .git folder afterwards||5.6MB|
|Raw/bare storage space saving||76.7%|
|Time taken to make that via the script||4m 45s|
See for yourself github.com/paul-hammant/mc-xs-javadocs.
Diffs are noisier than they should be in my opinion - timestamps in pages that are not actually needed.
Jars containing Source files
|Total size for 16 original versions||4.9MB|
|The .git folder afterwards||0.9MB|
|Raw/bare storage space saving||81.6%|
|Time taken to make that via the script||38s|
Sometimes your IDE downloads these to help navigate libraries your’re not making yourself.
See for yourself github.com/paul-hammant/mc-xs-sources.
Note the is NOT the same as the each dev team’s own repo. It is only the released versions of their source.
If the commits are done in the order of the releases then the natural diffs between commits have no meaning. If you don’t care about that you do not have to worry about matching the commits to the historical order of releases.
What is guaranteed is that any checkout of a specific tag produces exactly the same content regardless of the order of commits.
Jars containing class files
These are what you use on the classpath when you are building and making apps. Yes yes, it is more controversial to think of .class files in a Git repo. The trust/verification model would change to leverage the SHA1 hash of each commit. The current use of MD5/SHA1/GPG on Maven Central can be simplified with Git repositories used as the canonical store. Ignore that clash of PDFs causing a mini SHA1 crisis earlier this year - this type of usage of SHA1 is safe to trust.
|Total size for 27 original versions||8.4MB|
|The .git folder afterwards||2.4MB|
|Raw/bare storage space saving||71.4%|
|Time taken to make that via the script||1m 42s|
See for yourself github.com/paul-hammant/mc-xs-classes.
Pretty simple. Something like this is needed:
git clone https://github.com/paul-hammant/mc-xs-classes --depth 1 --branch 1.4.3 cd mc-xs-classes rm -rf .git jar cvfM ../xstream-1.4.3.jar . cd .. rm -rf mc-xs-classes
Putting that into your
~/.m2/repository/ folder is trickier than just making a jar, but doable too.
Leaving your local clones as bare would be convenient, but it would not help the compile/test/run/package pars of
Maven. i.e. pretty much all of what Maven does. The dependency plugin needs to be changed make jars in the
target folder from those bare clones
There’s another snafu in that if you brought down a specific tag, and later want another it’s not simple.
# Git one tag - say 1.4.3 git clone https://github.com/paul-hammant/mc-xs-classes --depth 1 --branch 1.4.3 --bare xstream-bare-clone # Add another - say 1.4.4 cd xstream-bare-clone git fetch -unf origin 1.4.4:refs/tags/1.4.4 git repack -ad cd ..
I’m out of my depth with Git here really - I don’t know what the long term cost of adding more and more tags out of order to a local clone is.
“Cost” on Maven Central
The savings in the table above are real: 70 to 80%.
There would be no checkout up on ‘central. It would only be the ‘bare’ Git repositories. Granted there are still many objects in
the .git folder. The statistics for https://github.com/paul-hammant/mc-xs-classes are 39 directories and files in
the .git folder and 478 in the HEAD of the checkout. You still would have worry about
inode upper limits being
reached, but which version of Linux you are using, how you installed it, and what file system you have chosen factor
into that too.
An Index - the last piece of the puzzle
End users need to be able to subscribe to group and artifacts permutations to know when new releases are available. That is best
done in a single Git repository, in my opinion. Here is what that could look like.
meta.yaml file for each group. The XStream one is missing (as are thousands), so you
should take a look at one for Lucene
if you want a bigger example, of see the general structure here:
artifacts: an-artifact: lastUpdated: 20121224161832 latest: 1.3 release: 1.3 versions: - 1.1 - 1.2 - 1.2.1 - 1.3 another-artifact: lastUpdated: 20121224161832 latest: 1.3.1 release: 1.3.1 versions: - 0.9-alpha - 1.1 - 1.2 - 1.2.1 - 1.3 - 1.3.1
I’ll not share the Python for that because it is hacky and very slow. It is a function over
maven-metadata.xml that are
co-located with the POMs, JARs etc. As it is now, the .git folder is 7.2MB, but I really do not know how much
is missing from it, that is cataloged up on Maven Central.
See org/apache/lucene/lucene-analyzers/maven-metadata.xml and
others in adjacent directories to see what the Lucene
meta.yaml was made from.
This sort of meta model would aid analytics and visualizations too. Indeed that’s how I first starting thinking about Git as a storage tech for data historically held in a non-Merkle tree Maven Central.
Actually changing Maven Central to do this
The maven ‘deploy’ workflow and plugin would invisibly do a commit to (or create of) a dedicated Git repo up on central.
For XStream, a new deployment would not go into
more. Instead would gos into (say)
The maintainer for the group:artifact would not have to do anything different to exist in the new deploy world, all the heavy lifting is done in Maven Central and that future version of the deploy plugin. They would have to upgrade their project to that deploy plugin, of course.
It is not just Maven Central. It would be Artifactory, Nexus, Gradle (etc) technologies too.
GitHub conveniently makes a zip available for each ‘tag’ that’s available. See here. That would be fantastic if teams could host their own release repositories from there, but the GH ‘release’ zip has a root directory that is not compatible with Java’s execution.
That’s too bad. I bet the GitHub people cleverly implemented a LRU cache in association with their downloads implementation - if the zips are not used they’ll disappear, but they’ll come back as needed (being silently recreated). On first use, straight after the push, it was split second to download that. That is impressive. Specifically it is quicker than the archive can be made on my Mac.