Scraping a JIRA site

Sometimes you have to say goodbye to an online service. It’s often the data that you want, but only in a form that’s usable to you. When the service is a web-app, statically mounted HTML is my preference. We used to use cURL or Wget for that. The advent of JavaScript means that is often incomplete, as some guy mentioned in blogospace or reddit recently. The reasons for that could include:

  • JavaScript subsequently trying to execute in a page with no server attached being problematic.
  • Parts of the page could be missing after a simple file based scrape.

What I’ve made is a Sikuli script that uses the MAFF plugin for Firefox to scrape JIRA pages after JavaScript has done its business. Here is the Github project. Meaning, it serializes the DOM back to HTML in a way that’s perfectly renderable without any need for JavaScript running in the resulting page whatsoever. The script is about 1,000 times slower than Wget or cURL for the same job, though, but I do not care for my use case. The result of running the tool over PicoContainer’s 382 issues scraped for posterity is here, and looks like this for those too busy to click:

Of course JIRA is fantastic and there are many reasons to keep on using it, or in an enterprise, keep paying for it. Sometimes though, the lesser capability of the simple tracker in GitHub is enough going forward. Especially if your project is past the peak of its contribution. The script I made could stand to be improved - there are internatonalization issues amongst other things (pull-requests welcome).

Codehaus Memories

Of course, all this is because Codehaus is shutting down. I had (or still have) a few projects at Codehaus: PicoContainer and Paranamer. I helped out with many others: JBehave, XStream, and QDox.

I’m sure Bob McWhirter, Ben Walding and famous Java-land hackers like Jason van Zyl and James Strachan had already discussed a portal like Codehaus for a month or two, but I immediately became super-keen as soon as I heard about it in 2002. Lots of London based ThoughtWorkers were excited too. Reasons?

  1. It was not Apache, would not be constrained by The Apache Way (tm)
  2. It would use Subversion, even though it was only v0.6 or thereabouts at the time (CVS was all we had before that)
  3. Would use JIRA, because it was a billion times better than Bugzilla, and anything else. Confluence, as a choice for project documentation, came soon after.

Ben and Bob turned out to be fantastic, and benevolent, overlords for the platform. A referral system worked well to bring new projects in. Those could be ‘from SourceForge’ if the project existed already, but projects were far more likely to have started on Codehaus. I can look back at Codehaus and state that I had zero complaints. Git was created later and quickly became a choice on the platform. Projects that wanted both Subversion and Git on Codehaus - got both. A pragmatic “sure thing” attitude prevailed there - it was great.

Codehaus leaped into being before blogging really took off. I mean, before everyone decided to start their own blog. There were a collection of people with previous OSS experience all gathered on Codehaus who were active bloggers. There were a short couple of years, where those bloggers were a significant channel of information for at least the Java community.

Another factor that was prevalent for Codehaus people from the early days was direct rejection of Sun’s J2EE designs. Perhaps that was mostly EJB, but the wish for “lightweight” was strong amongst Hausmates, and the willingness to make components that were the antithesis of the J2EE doctrine was strong too. Martin Fowler later blogged about Dependency Injection as a lightweight force, but Codehaus activities (often collaborating) to sidestep Sun ownership of Java, were already at full force at that moment.

Being invited to be a Hausmate and meeting (mostly electronically) dozens of other hard-core OSS people, was a privilege. I’ve many lasting friendships that started on Codehaus. Good things can come to and end too - thanks Ben and Bob.

blog comments powered by Disqus


March 20th, 2015