So I’ve asked a bunch of colleagues, and there’s no definitive Comp-Sci categorization for what I’m about to discuss. I wish there were but there is not, so I’m going to provisionally title them, and will correct it when a Donald Knuth type intellect weighs in later: “Human-centric source formats” and “Machine-centric source formats”.

Human-centric source formats

Examples are C, Java, Ruby, Python, JSON, YAML, S-Expressions.

They primarily parsed by one or more of Yacc, Bison, flex, Antlr and similar technologies.

Of the seven I mentioned, the first four a turing-complete procedural programming languages, and the last three are ‘good’ formats for declarative data payloads. Functional languages are also in this category, though a larger brain is often required.

Being source formats, they suit version control. These are also eminently diffable, and consequentially merge well (though I think the bar could be lifted with diff semantics that are specialized to each).

There is also a new alternate to YAML and JSON called “Tuple Markup”. See the project page. It is too early to say whether this is technology has staying power or not.

Also note that Python and YAML are stricter with indenting and have rules around white-space concerning that nesting/indenting of scope. I’ve always thought that this gives them an advantage in parsing speed.

Machine-centric source formats

Examples are the SGML derivatives XML and HTML.

They typically NOT parsed by YACC, Bison/flex, Antlr (or similar). Instead ‘SAX’ and ‘DOM’ are often referred to for programmatically processing them. Actually that’s more XML, as HTML has some historical processing strategies that allow for less regularity and even incompleteness. In terms of parsing again, there is a chance that they are faster to parse generally than the human-centric ones above.

One characterizing aspect is the verbosity (particularly closing-elements) and presence of angle-brackets. There is also a general suitability for declarative payloads. These three together often result in people getting lost trying to read complex documents. As such historical attempts to make a general purpose turing-complete XML-based language have failed. That said, for HTML there are six alternate technologies trying to add on discrete turing-completeness. They are Angular, Knockout, Batman, Olives, FunnyFace and Dermis. These first three are dominant ones, and are actually named AngularJS, Knockout.js and Batman.js, suggesting that JavaScript is the bigger factor than HTML.


Google’s Protobufs is an example of something that is not ‘source’ in it’s ultimate encoding. See their encoding page.

Follow ups

I’m going to follow this entry up with something that talks more about suitability in UI markup.

Thanks to colleagues Ola Bini, Ben Butler Cole and Graham Brooks for helping with drafts of this.


January 12th, 2013