Alternative to Maven Central for Jar publishing (multiple Git repositories)

How about all the jars (classes, source archives, and Javadocs) being up on a (rebooted) Maven central but as Git repositories instead of downloads over HTTP. As you ran your build locally, the suitably enhanced Maven would go get jars from Git repositories online.

Not whole jars, but the classes within a jar, unzipped. Afterwards on your local system that is either still a .git/ folder or a reconstituted Jar file again. There are ‘bare’ git clones too, so there would be much to decided about how it would work.

Why? Well I like the idea of VCS storing unconventional things, and people being able to do dev style workflows with them. In this case you’d never make changes and commit - it’s be an append-only type of thing, and subscription is where downstream teams benefit. Especially corporates. And it’s smaller in use as you’ll see because I’ve done an experiment, after writing a script:

	# pip3 install sh
	from sh import wget, unzip, mkdir, cd, git, touch, rm, du

	root = "http://central.maven.org/maven2/"

	def doGAV(url, v, suffix):

	op = wget("-qO-", root + url + "/" + v + "/index.html")
	for line in op.splitlines():
	if v + suffix + ".jar\"" in line:
	fn = line.split("\"")[1].replace("/","")
	print(fn)
	git("rm", "-r", "*")
	git("commit", "-m", v)
	wget(root + url + "/" + v + "/" + fn)
	sizeK = int(du("-s", "-k", fn).split("\t")[0])
	unzip(fn)
	rm(fn)
	git("add", ".")
	git("commit", "-m", v, "--amend")
	git("tag", v)
	return sizeK
	return 0


	def doGA(g, a, suffix):
	url = g.replace(".","/") + "/" + a
	op = wget("-qO-", root + url + "/index.html")
	versions = []
	sizeK = []
	for line in op.splitlines():
	if "href" in line and ".." not in line and "maven-metadata.xml" not in line:
	v = line.split("\"")[1].replace("/","")
	versions.append(v)
	# Sorting not needed - not change in the size of the .git folder
	# - diffs might look a mess though
	# - breaks anyway on alpha chars
	# versions.sort(key=lambda s: [int(u) for u in s.split('.')])
	print(str(versions))
	for v in versions:
	sizeK.append(doGAV(url, v, suffix))
	ct = 0
	totSize = 0
	for v in sizeK:
	if v > 0:
	ct += 1
	totSize += v
	print("For " + g + ":" + a + ", total size for " + str(ct) + " original versions: " + "{0:.{1}f}".format(totSize/1024,1) + "M")
	git("repack", "-ad")
	afterK = int(du("-s", "-k", ".git").split("\t")[0])
	print("Afterwards the .git folder is: " + "{0:.{1}f}".format(afterK/1024,1) + "M")

	rm("-rf", "work")
	mkdir("work")
	cd("work")
	git("init")
	touch("dummy")
	git("add", ".")
	git("commit", "-m", "start")

	# TODO make is so that you could invoke doGA() time in one script
	doGA("com.thoughtworks.xstream", "xstream", "")

view raw doit.py hosted with ❤ by GitHub

My experiment is with XStream (0.01% of which I wrote). Here is the root folder for XStream in Maven Central today. The experiment was able to show byte savings:

Jars containing JavaDoc

Total size for 15 original versions	24.1MB
The .git folder afterwards	5.6MB
Raw/bare storage space saving	76.7%
Time taken to make that via the script	4m 45s

See for yourself github.com/paul-hammant/mc-xs-javadocs.

Diffs are noisier than they should be in my opinion - timestamps in pages that are not actually needed.

Jars containing Source files

Total size for 16 original versions	4.9MB
The .git folder afterwards	0.9MB
Raw/bare storage space saving	81.6%
Time taken to make that via the script	38s

Sometimes your IDE downloads these to help navigate libraries your’re not making yourself.

See for yourself github.com/paul-hammant/mc-xs-sources.

Note the is NOT the same as the each dev team’s own repo. It is only the released versions of their source.

If the commits are done in the order of the releases then the natural diffs between commits have no meaning. If you don’t care about that you do not have to worry about matching the commits to the historical order of releases.

What is guaranteed is that any checkout of a specific tag produces exactly the same content regardless of the order of commits.

Jars containing class files

These are what you use on the classpath when you are building and making apps. Yes yes, it is more controversial to think of .class files in a Git repo. The trust/verification model would change to leverage the SHA1 hash of each commit. The current use of MD5/SHA1/GPG on Maven Central can be simplified with Git repositories used as the canonical store. Ignore that clash of PDFs causing a mini SHA1 crisis earlier this year - this type of usage of SHA1 is safe to trust.

Total size for 27 original versions	8.4MB
The .git folder afterwards	2.4MB
Raw/bare storage space saving	71.4%
Time taken to make that via the script	1m 42s

See for yourself github.com/paul-hammant/mc-xs-classes.

Recreating Jars

Pretty simple. Something like this is needed:

git clone https://github.com/paul-hammant/mc-xs-classes --depth 1 --branch 1.4.3
cd mc-xs-classes
rm -rf .git
jar cvfM ../xstream-1.4.3.jar .
cd ..
rm -rf mc-xs-classes

Putting that into your ~/.m2/repository/ folder is trickier than just making a jar, but doable too.

Stafus

Leaving your local clones as bare would be convenient, but it would not help the compile/test/run/package pars of Maven. i.e. pretty much all of what Maven does. The dependency plugin needs to be changed make jars in the target folder from those bare clones target/deps/<scope-name>.

There’s another snafu in that if you brought down a specific tag, and later want another it’s not simple.

# Git one tag - say 1.4.3
git clone https://github.com/paul-hammant/mc-xs-classes --depth 1 --branch 1.4.3 --bare  xstream-bare-clone
# Add another - say 1.4.4
cd xstream-bare-clone
git fetch -unf origin 1.4.4:refs/tags/1.4.4
git repack -ad
cd ..

I’m out of my depth with Git here really - I don’t know what the long term cost of adding more and more tags out of order to a local clone is.

“Cost” on Maven Central

The savings in the table above are real: 70 to 80%.

There would be no checkout up on ‘central. It would only be the ‘bare’ Git repositories. Granted there are still many objects in the .git folder. The statistics for https://github.com/paul-hammant/mc-xs-classes are 39 directories and files in the .git folder and 478 in the HEAD of the checkout. You still would have worry about inode upper limits being reached, but which version of Linux you are using, how you installed it, and what file system you have chosen factor into that too.

An Index - the last piece of the puzzle

End users need to be able to subscribe to group and artifacts permutations to know when new releases are available. That is best done in a single Git repository, in my opinion. Here is what that could look like. One meta.yaml file for each group. The XStream one is missing (as are thousands), so you should take a look at one for Lucene if you want a bigger example, of see the general structure here:

artifacts:
  an-artifact:
    lastUpdated: 20121224161832
    latest: 1.3
    release: 1.3
    versions:
      - 1.1
      - 1.2
      - 1.2.1
      - 1.3
  another-artifact:
    lastUpdated: 20121224161832
    latest: 1.3.1
    release: 1.3.1
    versions:
      - 0.9-alpha    
      - 1.1
      - 1.2
      - 1.2.1
      - 1.3
      - 1.3.1

I’ll not share the Python for that because it is hacky and very slow. It is a function over maven-metadata.xml that are co-located with the POMs, JARs etc. As it is now, the .git folder is 7.2MB, but I really do not know how much is missing from it, that is cataloged up on Maven Central.

See org/apache/lucene/lucene-analyzers/maven-metadata.xml and others in adjacent directories to see what the Lucene meta.yaml was made from.

This sort of meta model would aid analytics and visualizations too. Indeed that’s how I first starting thinking about Git as a storage tech for data historically held in a non-Merkle tree Maven Central.

Actually changing Maven Central to do this

The maven ‘deploy’ workflow and plugin would invisibly do a commit to (or create of) a dedicated Git repo up on central.

For XStream, a new deployment would not go into http://central.maven.org/maven2/com/thoughtworks/xstream/xstream/ any more. Instead would gos into (say)git@central.maven.org:maven2/com/thoughtworks/xstream/xstream.git

The maintainer for the group:artifact would not have to do anything different to exist in the new deploy world, all the heavy lifting is done in Maven Central and that future version of the deploy plugin. They would have to upgrade their project to that deploy plugin, of course.

It is not just Maven Central. It would be Artifactory, Nexus, Gradle (etc) technologies too.

GitHub

GitHub conveniently makes a zip available for each ‘tag’ that’s available. See here. That would be fantastic if teams could host their own release repositories from there, but the GH ‘release’ zip has a root directory that is not compatible with Java’s execution.

That’s too bad. I bet the GitHub people cleverly implemented a LRU cache in association with their downloads implementation - if the zips are not used they’ll disappear, but they’ll come back as needed (being silently recreated). On first use, straight after the push, it was split second to download that. That is impressive. Specifically it is quicker than the archive can be made on my Mac.

← Previous Archive Next →

Published

May 13^th, 2017

Syndicated by DZone.com
Reads:

Paul Hammant's Blog: Alternative to Maven Central for Jar publishing (multiple Git repositories)