Paul Hammant's Blog: Step Aside Blockchains, Hashgraphs Are Giving Plain Merkle Trees A Turbo Boost
This is a another new article in a mini-series that started with ‘Old-School’ Merkle Trees Rock, Merkle Trees In Pictures, Blockchains in pictures, Choosing Between Blockchains And Vanilla Merkle Trees, and now this one and another today Merkle Trees vs Blockchains vs Hashgraphs.
The genius of hashgraphs is the distributed consensus advances that rest on a protocol called gossip, and allow large number of participating nodes to agree that a transactions happened, and the order and time for each without requiring a double check before recording the transaction. In fact the minimum amount of TCP/IP is used in a provably correct model that scales smoothly. The algorithms are Byzantine as their is no concept of leader even temporarily or rotating on any basis, but remains fair. swirlds is the company leading the charge for the rollout of solutions.
Choosing one of the three for ledger solutions
I have been trying to work out which applications suit straight Merkle trees and ones that suit blockchains. Although blockchains are partially based on Merkle tree idea, it has always been possible to design Merkle tree solutions that are not blockchains. I’m not sure that’s clear to the majority of even technical people who have heard of both concepts though. Hashgraphs as stated are a compelling third choice. See Merkle Trees vs Blockchains vs Hashgraphs for a table breaking down the pros and cons.
Plain centralized Merkle trees are an order simpler to engineer than Blockchains, and therefore underrated (in my opinion), but even if they are a ledger and may have replicas, they are not conventionally distributed. They are also less capable in terms of coping with a high rate of change. Maybe two orders less capable. The hype around blockchain ensures it gets all the attention, though. Hashgraphs, though, take Merkle trees into a decentralized capability and as such make then very attractive for many application types. There is a bunch of science (Byzantine consensus primarily) to say hashgraphs achieve their compelling capabilities. Capabilities that center on:
- perfectly efficient distributed consensus (no waste)
- leaderless consensus that’s fair
- signed root Merkle nodes proving perfect replicas
Previously the big factor for deciding to use blockchain instead of vanilla Merkle trees was whether the solution was best as a decentralized versus centralized one. Centralized Merkle trees could have verified replicas at a modest cost of replication, but you still had to trust the centralized node, and trusting the decentralized collective was always going to be more attractive.
Another factor that feeds into that decision as to whether you could implement an idea in a vanilla blockchain is “rate of change” (as mentioned). Too high a rate of change and centralized Merkle trees are not viable (too hard for subscribers to keep up). But where they do work they might be an order simpler to implement than blockchains. Also an order simpler to administer and keep in sync with if that desired. Plain Merkle trees can represent a complex hierarchical model that has momentary stability. Stability long enough for the tree to be verified as “not tampered with”. And specifically verified on a subscribing system that is separate (a replica) to the canonical centralized Merkle tree. The hashgraph modification to this means that there’s not just one canonical tree, but many that a provably in step with each other and any of those could be used to verify an offline replica of the merkle tree.
Anyway, I have three examples of application where an architect would want choose correctly upfront between Merkle tree (plain), blockchain, and Hashgraph consensus variant of Merkle trees
Example 1: property ownership in the USA
Spoiler: Plain centralized Merkle tree implementations are perfectly viable, but you still might to a Hashgraph instead
Who owns which lot of land seems like it could be done in a Merkle tree (ignoring privacy aspects). “Who” could also be “non-persons” too. Specifically corporations and non-profits.
Say you are a real estate broker, and you are wanting to check whether a person standing in front of you actually owns a house that they claim they want to sell. You could check a hypothetical centralized federal Merkle tree for that:
(note: not a real domain or link).
That simple Apache-style web server could serve YAML like so:
zipCode: 28803 ownerNameHash: a6f9068958b2d9463e20fff45374f5cd taxpayerHash: 76115fedc9ff5e6655779dc2f59be6a0 mortgageInterest: none description: 6950.4 acre historic mansion, grounds, and tourist attraction ancilliary: # all optional... alsoKnownAs: Biltmore Estate nps-xref: https://npgallery.nps.gov/AssetDetail/NRIS/66000586 wikipedia: https://en.wikipedia.org/wiki/Biltmore_Estate
The owner name hash would be an MD5 function similar to:
echo -n William A V Cecil | md5sum
As could the SSN in the line below that.
To make this one record into a navigable a Merkle tree (over Apache, NGINX etc), SHA1 hashes would be at:
https://realityownership.gov/.sha1 https://realityownership.gov/NC.sha1 https://realityownership.gov/NC/Asheville.sha1 https://realityownership.gov/NC/Asheville/Lodge_St.sha1 https://realityownership.gov/NC/Asheville/Lodge_St/1.sha1
If something about this registration changed in the federal records, then all the sha1 files back to root would be changed in order, so that it was clear to subscribers.
Of course, our imaginary real estate broker does not have to subscribe to all changes. They just need to be able
to look up records so that they can verify the claimed owner in front of them, or they may subscribe to
because they want records for the whole city they do business in.
There are about 2,100 residence sales per weekday in the US and that’s not too much for a simple Merkle tree subscriber solution. Even a great number of subscribers who want all or part of the tree.
A ‘simple’ Apache or Nginx solution serving up the above directory structure from a filesystem that is reasonably fast would work even if the read load were 1000x the write load. So would serving out of S3, or via a CDN, or with load balancers in front of a cluster.
Of course, some sort of central agency has to pay for all that hosting for the benefit of all, but if they wanted to they could charge a tiny fee for each GET operation, but that would require an account system which makes the build out much more complicated, and maybe even the most expensive part of the whole proposition.
The key here is that a subscriber to the whole tree could (multiple times a day) catch up with the changes and be able to verify their own copy of the tree as correct. Specifically, authentic and not tampered with, as well as current.
Hashes vs SSNs and privacy in public
Of course the nefarious in society will attempt to pre-calculate hashes for people. ‘William A V Cecil’ in our case. And from that, you could have a dictionary of all Americans so that reverse engineering real names is easy. Even if you passed a law saying not to do that, it would be done anyway. So maybe a hash with more clashes - like CRC-16 when you want there to be some clashes so that nobody builds a dictionary, but now the realtor’s confidence about the person standing in front of them drops some :-(
The same entry, with CRC-8 hashes of key data (which are clearly going to cause clashes - good)
zipCode: 28803 ownerNameHash: 0xBCCF taxpayerHash: 0x5D58 mortgageInterest: none description: 6950.4 acre historic mansion, grounds, and tourist attraction ancilliary: # all optional... alsoKnownAs: Biltmore Estate nps-xref: https://npgallery.nps.gov/AssetDetail/NRIS/66000586 wikipedia: https://en.wikipedia.org/wiki/Biltmore_Estate
I imagine 50 Merkle tree servers/solutions for political reasons. It fascinates me that all states in the US are actually in competition with the Federal government and each other at the same time. State lawmakers would pass laws to ensure that the something that purports to be for a state would be hosted and managed within that state (meaning 50 solutions), While one federal server could maintain a single hash of the 50 hashes for each state, most likely nobody would bother dealing with that as a subscriber. They’d instead just listen on root hashes changing for the 50 states themselves.
Despite being political, that is sharding of sorts and it boosts the performance of the system (not a big requirement itself). Each state’s server could have several mirrors. And create 50 hashgraph shards, each with multiple full nodes. Again, you’d expect about the same speed for the servers and hashgraphs. And
Perfectly possible too, with additional benefits - the realators and lawyers doing the conveyancing could have their own hashgraph nodes and the process of house sale/purchase be decentralized (no gatekeeper), fair, robust (no single server being down or compromised is going to mess it up). You might want to go that way if you wanted more and more data in the record for the property. Year 2: add plot specifics; year 3 add zoning info and cross references to utilities.
Example 2: Credit Scores/Reports
Spoiler: Merkle trees with a hashgraph boost, instead of Blockchains.
Blockchain credit reports?
There is indeed, at least one proposal for a blockchain (not hashgraph) implementation of credit scores from Bloom. I posed one question after their announcement:
Posted on Sept 7th, 2017 but not answered. A few hours later (early on that Friday morning) I posted the same question on twitter:
Hey @bloomtoken how would a posting on the blockchain version of a credit score get expunged? - in the case of ID theft or simple mistake.— Paul Hammant (@paul_hammant) September 8, 2017
I think it is going to be a challenge to get Blockchain startups be open to questions. Or if they have those answers accessible/collated for others to see. In my case there’s probably an easy answer, but most blockchain startups are initially super focused on securing funding for their vision, and questions from people without checkbooks don’t grab much attention.
Hashgraph boosted Merkle tree credit reports?
There are more credit score ‘moves’ in a day than property sales. Maybe 1000x as much. That could be too much for a simple Apache/S3 Merkle tree solution. At least for the “I wanna subscribe to all changes from root” point of view. Does that suit blockchain more than Merkle trees, then? Maybe not but the platform needed to be able to manage that rate of change (remember all the parent sha1 files need to be changed whenever a leaf node changes) is much more serious than a simple file system presented by a simple web server (property ownership above). Maybe there’s no moment in a day when the changes to that Merkle tree reduce enough for subscribers to catch up and verify their own tree.
It could be that this is better as a hashgraph boosted Merkle tree implementation. That presents many new challenges, though.
Should the Merkle tree be public? If yes, the data has to be encrypted in public, and we know how to do that. Notwithstanding DeCSS and the illegal prime number DVDs have a content protection system that allows manufacturers of DVD players to read the encrypted content of the DVD. First, they access a file [#1] on the DVD that’s been encrypted with their public key, and use their private key ()hidden deep within the chips they have in the DVD player) to retrieve another secret key to allow them to decode the content. All other manufacturers get to the same secret key (unique for that movie/recording) via their own initial #1 file. And it’s more about chip makers really, with minor manufacturers being licensees of appropriately commercially controlled chips.
With DVDs, keys can’t be revoked after one use. But in a Merkle tree not burned into a substrate they can be revoked. The three consumer reporting agencies (CRAs) could cooperate on four files for one credit report:
https://cr-scores.transunion.com/CEA57AD6-equifax.key https://cr-scores.transunion.com/CEA57AD6-transunion.key https://cr-scores.transunion.com/CEA57AD6-experian.key https://cr-scores.transunion.com/CEA57AD6.rpt https://cr.equifax.com/CEA57AD6-equifax.key https://cr.equifax.com/CEA57AD6-transunion.key https://cr.equifax.com/CEA57AD6-experian.key https://cr.equifax.com/CEA57AD6.rpt https://scores.experian.com/CEA57AD6-equifax.key https://scores.experian.com/CEA57AD6-transunion.key https://scores.experian.com/CEA57AD6-experian.key https://scores.experian.com/CEA57AD6.rpt
(four files perfectly replicated between three CRAs)
The first three are same 2048 bit private key for the fourth (.rpt), but totally encrypted with public
keys pertinent to private keys that each has not shared (obviously). The three can casually read the contents
of the .rpt without notifying each other. In there is, say, William A V Cecil’s credit report. But if the
credit report changes (that number goes up or down because of an event), the directory
disappears from the tree and Mr Cecil’s credit report appears at some other random place in the tree. Using Merkle-tree
techniques, that can percolate from one of the three to the other two quite easily. If it was not obvious
a new unique private key is generated for the new .rpt file (the old one falls into nonuse, which amounts to
revocation). The three consumer reporting agencies keep totally private correlation of social security numbers
to that hash. They also update that table when changes arrive in the Merkle tree.
If some authorized entity wanted to do a pull on a credit report, then they could pay one of the three consumer reporting agencies for the data, and they’d get it through the usual channel, or be directed to the node in the tree and also be handed the contents of the key to .rpt. They’d have to pull the data within one minute because the credit reporting agencies would initiate an update to the credit report after that which makes it jump to somewhere else in the tree.
To be a Merkle tree those intermediate directories needs to exist too (with SHA1s):
https://scores.transunion.gov/.sha1 https://scores.transunion.gov/C/.sha1 https://scores.transunion.gov/C/E/.sha1 https://scores.transunion.gov/C/E/A/.sha1 https://scores.transunion.gov/C/E/A/5/.sha1 https://scores.transunion.gov/C/E/A/5/7/.sha1 https://scores.transunion.gov/C/E/A/5/7/A/.sha1 https://scores.transunion.gov/C/E/A/5/7/A/D/.sha1 https://scores.transunion.gov/C/E/A/5/7/A/D/6-equifax.key https://scores.transunion.gov/C/E/A/5/7/A/D/6-transunion.key https://scores.transunion.gov/C/E/A/5/7/A/D/6-experian.key https://scores.transunion.gov/C/E/A/5/7/A/D/6.rpt
Each report would have its own random number of course (hex above).
Where does a hashgraph come in to play (versus an ordinary Merkle tree in an apparent file system)? Well, the mutation of the tree and desired secrecy in public means that we cannot have a social security number in a directory or file name. We instead have to have something random. A SHA1 would be too long for that - 5050a26de967967f456bf0033430e72d40671adc. Specifically too long to be efficient in Merkle calculations. Eight hexadecimal digits would be about right (4bn permutations) and efficient enough. However, how to we turn a SSN into that eight-digit hexadecimal number? It’s not a constant function like SHA1 itself or CRC. It needs to be random, but it also needs to not clash with something else in the tree. Someone else in fact. Best that the three CRAs use a hashgraph to efficiently disseminate changes but also to allocate a random but non-clashing key for credit reports.
You could also claim that this does not need to be publicly accessible at all, and is a private concern of three companies who have APIs for other needing to pull credit reports, but you’d be spoiling the fun discussing the above. Seriously, there is a case to be made for the person in question receiving the key that’s otherwise encrypted and being able to look at their own credit report. Not only that, but being able to be assured that companies seeking the report will be given the same thing. The only thing we the designers of this would need to worry about is them passing the key to other, including the same company that would want to avoid the fees associated. Because, of course, the person in question having the key does not cause the record to be deleted and reappear somewhere else in the the tree with a new key (they can peek at it without affecting their credit report).
We should also note perhaps that there’s a bit of a reality stretch here - the three CRAs compete with each other and maintain separate scores, and would not rush to a system that could admit a fourth and a fifth at their level of operation.
Example 3: Multipart exchange trading.
Spoiler: A plain Merkle tree is inadequate here. Blockchains and Hashgraphs (sharded) are competing to be the solution alone.
Reconciliation (how much was traded and for what price precisely, etc) being potentially time consuming (and often manual) is a bone of contention in the industry today, and one of the reasons for lots of people wishing for something better.
Prime brokerages and the exchange in question having a provable trail of who bought what and when and from whom (etc) is highly desirable. A Blockchain could do it (as we can see from Bitcoin) but the volumes required and rapidity of it all make you wish for more efficient solutions.
On busy exchanges like the NYSE, there are a few orders too many for a plain centralized Merkle trees to be viable, so they are not that more efficient solution. A sharded one (somehow) with a Hashgraph consensus aspect is definitely viable. I’ll detail that here.
The naive Merkle tree design to the storage of trades feels like it could work:
https://10.2017.t.nyse.com/.sha1 https://10.2017.t.nyse.com/31/.sha1 https://10.2017.t.nyse.com/31/11/.sha1 https://10.2017.t.nyse.com/31/11/59/.sha1 https://10.2017.t.nyse.com/31/11/59/ss23tsk7djh.trade https://10.2017.t.nyse.com/31/11/59/ss23tsk7djh.sha1
Those were trades, and you could have corresponding accounts in a Merkle tree too (not shown). Accounts and trades have different privacy aspects which we’ll skip over here, but it is doable with an in-public with public key encryption in the same way as it is done in Blockchains
It’s also clear (or should be) that the trade data is eternal. At some point whole years can be shifted from active systems to entirely static URLs on more junior hardware, because the load on the servers has an inverse relationship with age. And by shift, I mean the URL remains the same, but there was an overnight cut-over to say Apache, Nginx (or better). Maybe even with load balancers initially, but given it’s all read ony after a certain point and read less and less, maybe not. Hence the 10.2017 domain names (MM.CCYY) to facilitate past months to shift to read only (no more trades in a month that’s finished).
Hashgraphs are a significant invention that suits a number of categories of application, and promises to eat into the pie that blockchains has thought its own for a while. You should research this stuff now, and operationalize within a year, if suitable. Possibly with partners, and a bunch of expense is about to removed from a number of industries and you don’t want to be left out.
Note: This blog entry has been a couple of months in the making. It was firstly plain ‘Merkle tree vs Blockchain’ with advice from Ross Pettit, but has pivoted to be ‘vs Hashgraph’ with advice from Leemon Baird the co-founder & CTO of swirlds and inventor of the of the entire process.