So I’ve asked a bunch of colleagues, and there’s no definitive Comp-Sci categorization for what I’m about to discuss. I wish there were but there is not, so I’m going to provisionally title them, and will correct it when a Donald Knuth type intellect weighs in later: “Human-centric source formats” and “Machine-centric source formats”.

Human-centric source formats

Examples are C, Java, Ruby, Python, JSON, YAML, S-Expressions.

They primarily parsed by one or more of Yacc, Bison, flex, Antlr and similar technologies.

Of the seven I mentioned, the first four a turing-complete procedural programming languages, and the last three are ‘good’ formats for declarative data payloads. Functional languages are also in this category, though a larger brain is often required.

Being source formats, they suit version control. These are also eminently diffable, and consequentially merge well (though I think the bar could be lifted with diff semantics that are specialized to each).

There is also a new alternate to YAML and JSON called “Tuple Markup”. See the project page. It is too early to say whether this is technology has staying power or not.

Also note that Python and YAML are stricter with indenting and have rules around white-space concerning that nesting/indenting of scope. I’ve always thought that this gives them an advantage in parsing speed.

Machine-centric source formats

Examples are the SGML derivatives XML and HTML.

They typically NOT parsed by YACC, Bison/flex, Antlr (or similar). Instead ‘SAX’ and ‘DOM’ are often referred to for programmatically processing them. Actually that’s more XML, as HTML has some historical processing strategies that allow for less regularity and even incompleteness. In terms of parsing again, there is a chance that they are faster to parse generally than the human-centric ones above.

One characterizing aspect is the verbosity (particularly closing-elements) and presence of angle-brackets. There is also a general suitability for declarative payloads. These three together often result in people getting lost trying to read complex documents. As such historical attempts to make a general purpose turing-complete XML-based language have failed. That said, for HTML there are six alternate technologies trying to add on discrete turing-completeness. They are Angular, Knockout, Batman, Olives, FunnyFace and Dermis. These first three are dominant ones, and are actually named AngularJS, Knockout.js and Batman.js, suggesting that JavaScript is the bigger factor than HTML.

Others.

Google’s Protobufs is an example of something that is not ‘source’ in it’s ultimate encoding. See their encoding page.

Follow ups

I’m going to follow this entry up with something that talks more about suitability in UI markup.

Thanks to colleagues Ola Bini, Ben Butler Cole and Graham Brooks for helping with drafts of this.

← Previous Archive Next →

Published

January 12^th, 2013

Syndicated by DZone.com
Reads:

Comments formerly in Disqus, but exported and mounted statically ...

Sat, 12 Jan 2013	Frank Carver
Hmm. I'm not sure I entirely agree with your categorizations, here. First. I have written plenty of parsers and in my experience JSON is a format more suitable for machines than humans. The mandatory quotes on all text should tell you that. Try writing JSON by hand and they wil catch you out every time. Someone somewhere has probably written a JSON parser in yacc or antlr, but there is no need to - the JSON grammar is a classic recursive-descent design which can be easily be parsed in a couple hundred lines of basic Java ( see https://github.com/efficacy... ) Second, you have chosen a very small subset of the range of languages and formats to base your decision on. In particular, the four programming languages you mention (C, Ruby, Java, Python) are actually very similar. Sure, Python uses indentation rather than semicolons, but that's about it. Where would old classics such as FORTRAN, COBOL, APL, or FORTH fit? What about PHP? perl? SQL? PDF? I can see that you are taking a particular view here, but you might nonetheless be interested in a blog post I wrote a few days ago which attempts to categorize programming languages along a slightly different scale: http://raspberryalphaomega....
Sat, 12 Jan 2013	paul_hammant
Yes you're right I missed off SQL (human), PostScript & Forth (human, but very hard). And you're right that JSON is unforgiving in the event of a missed quote. Indeed during my research I found Yacc/Antlr research specific usages of XML too. So it's not even genus of parser technology that divides them (though I previous blog entries I've cited Yacc-based forms). PHP (and ASP, JSP, ColdFusion) are left out, but perhaps they are closer to the AngularJS thought that's an augmented HTML. Implicit within the more advanced "human centric" cases, is the complexity that comes with terseness. Would Haskell be the most complex? Or perhaps Erlang? Or is that just inaccessibility born from unfamiliarity, and not having done thousands of hours of practice. Maybe a venn-diagram or a Gartner-quadrant is needed. I've toyed with the same blog posting for many years (and touched on the theme before), but there's not enough axis to describe what I think is the sweetspot for the message I'm ultimately trying to give. A description which this blog entry is just part one of. As a courtesy, I'll read your blog entry too (see you in email).

Paul Hammant's Blog: Categorizing Languages