Paul Hammant's Blog: Scraping DZone Syndication Stats
Update (Sept 2015): scroll to the bottom of the page for better script.
DZone syndicate some of my blog entries. I used to keep a list by hand, but it is easy to get out of step that way. That’s a manual process too, so why not automate it?
The DZone page is a little tricky to scrape in that it does not show all syndicated articles until a “More” button has been pressed exhaustively. That means we’re looking at Selenium rather than wget or curl. Here’s some Groovy:
@Grapes(
@Grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0')
)
import org.openqa.selenium.*
import org.openqa.selenium.support.ui.*
import org.openqa.WebDriver.*
import org.openqa.selenium.firefox.FirefoxDriver
import groovy.json.*
WebDriver driver = new FirefoxDriver()
driver.with {
navigate().to("http://dzone.com/users/paulhammant")
// keep clicking "More" button until it disappears
def articleElements = []
while (true) {
def moreButton = By.xpath("//button[@class='more-button']")
try {
// button may arrive in page slowly. timeout after 3 secs
new WebDriverWait(driver, 3).until(ExpectedConditions.visibilityOfElementLocated(moreButton))
findElement(moreButton).click()
} catch(TimeoutException) {
break
} finally {
articleElements = findElements(By.xpath("//div[@class='activity-stream']/ul/li"))
}
}
def results = new Expando()
results.from = null
results.to = null
List articles = new ArrayList()
// Turn each article element (and child elements) into POJO
articleElements.each() { link ->
def divs = link.findElements(By.tagName("div"))
def article = new Expando()
divs.each() { div ->
// article and stats for it are sibling elements, not parent/child.
if (div.getAttribute("class").contains("stream-article")) {
def a = div.findElement(By.tagName("a"))
article.title = a.getText()
article.url = a.getAttribute("href")
}
if (div.getAttribute("class").contains("activity-stats-group")) {
def divs2 = div.findElements(By.tagName("div"))
divs2.each() { div2 ->
def text = div2.getText()
if (text.endsWith("VIEWS")) {
article.views = Integer.parseInt(text.replace("VIEWS","").replace(",","").trim())
}
if (text.endsWith("COMMENTS")) {
article.comments = Integer.parseInt(text.replace("COMMENTS","").replace(",","").trim())
}
if (text.startsWith("on ")) {
article.date = new Date().parse("MMM dd, yyyy", text.substring(text.indexOf("|")+1, text.indexOf(".")).trim())
}
}
}
}
articles.add(article)
if (results.from == null || article.date < results.from) {
results.from = article.date
}
if (results.to == null || article.date > results.to) {
results.to = article.date
}
}
results.articles = articles.reverse() // JSON diffs look better.
results.when = new Date()
new File("DZone.json").withWriter { out ->
out.write(new JsonBuilder(results).toPrettyString())
}
quit()
}
The JSON is consumed by AngularJS, and you can see it at https://paulhammant.com/dzone.html. Given I have used AngularJS, the page won’t be indexed by search crawlers presently. That’s not a problem to me, really, as I don’t want DZone’s rankings for my articles to be higher than the originals. DZone sometimes change the titles, I note.
The JSON is committed to Github, and I can watch changes over time from the comfort of an armchair:
Perhaps this is interesting only to people whose blogs are aggregated into DZone, and until the DZone people make a proper feed. Of course they may have that already, and I missed it.
Sept 2015 - new dzone site, new script
This one is faster now, and in bash with wget against an API rather than Groovy/Selenium:
#!/bin/bash
get_articles() {
author=1008633
fname=$(printf "%02d" $1)
wget -qO- "https://dzone.com/services/widget/article-listV2/list?author=$author&page=$1&portal=all&sort=newest" | jshon | jq ".result.data.nodes" | jq 'map(del (.authors) | del (.tags) | del (.saveStatus) | del (.acl) | del (.editUrl) | del (.imageUrl) | del(.articleContent) | del(.id))' > "dz_$fname.json"
}
for i in {1..15}
do
get_articles "$i"
done
# some of those are empty, delete 'em'
find . -name "dz_*" -size -10c -delete
# make one big list, and prettify so diff is small.
cat dz_*.json | tr '\n' ' ' | sed 's/\] \[/,/g' | jshon -S > dzone.json
# delete workfiles
find . -name "dz_*" -delete