Scraping Github pull requests and their code review comments

Github stores its pull-request and code review data in MySql. I’d much prefer a git reperentation for both (JSON, commits, audit trail, etc). Kinda the way Github Wiki pages are stored. That’s an aside though, this article is about storing code-review comments long term. The problem I’m trying to solve is one of deletion of users thich causes their pull requenst commentary to also get deleted. Sure the commits make it back to the origin/master (in the pull request is processed), but many things are left assoctaed with the fork. If the user gets deleted such info is gone forever :(

I want a permanent copy, so the interim answer is to scrape the data I fear losing, while it still exists.

Hence a scrape-pull-requests.sh bash script (for Mac and maybe Linux). Github’s portal is written in Ruby on Rails. It is extremely fast which helps scraping generally. There’s not a lot of JavaScript and that means that Wget is a viable extraction tool. Anyway the script runs quickly, and leaves a decent HTML interface for easy access later. I’ve tested it, but won’t leave up a scraped set of pull-requests as our GH overlords might object on copyright grounds.

They can’t object for your own GithubEnterprise instance of course. Github could change the structure of their HTML, and the script might stop work so well.If that happens I’m happy to accept back pull-requests via the usual mechanism :)

← Previous Archive Next →

Published

June 27^th, 2015

Reads:

Paul Hammant's Blog: Scraping Github pull requests and their code review comments

Published

Categories