Paul Hammant's Blog:
Scraping DZone Syndication Stats
Update (Sept 2015): scroll to the bottom of the page for better script.
DZone syndicate some of my blog entries. I used to keep a list by hand, but it is easy to get out of step that way. That’s a manual process too, so why not automate it?
The DZone page is a little tricky to scrape in that it does not show all syndicated articles until a “More” button has been pressed exhaustively. That means we’re looking at Selenium rather than wget or curl. Here’s some Groovy:
@Grapes(
@Grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0')
)
import org.openqa.selenium.*
import org.openqa.selenium.support.ui.*
import org.openqa.WebDriver.*
import org.openqa.selenium.firefox.FirefoxDriver
import groovy.json.*
WebDriver driver = new FirefoxDriver()
driver.with {
navigate().to("http://dzone.com/users/paulhammant")
// keep clicking "More" button until it disappears
def articleElements = []
while (true) {
def moreButton = By.xpath("//button[@class='more-button']")
try {
// button may arrive in page slowly. timeout after 3 secs
new WebDriverWait(driver, 3).until(ExpectedConditions.visibilityOfElementLocated(moreButton))
findElement(moreButton).click()
} catch(TimeoutException) {
break
} finally {
articleElements = findElements(By.xpath("//div[@class='activity-stream']/ul/li"))
}
}
def results = new Expando()
results.from = null
results.to = null
List articles = new ArrayList()
// Turn each article element (and child elements) into POJO
articleElements.each() { link ->
def divs = link.findElements(By.tagName("div"))
def article = new Expando()
divs.each() { div ->
// article and stats for it are sibling elements, not parent/child.
if (div.getAttribute("class").contains("stream-article")) {
def a = div.findElement(By.tagName("a"))
article.title = a.getText()
article.url = a.getAttribute("href")
}
if (div.getAttribute("class").contains("activity-stats-group")) {
def divs2 = div.findElements(By.tagName("div"))
divs2.each() { div2 ->
def text = div2.getText()
if (text.endsWith("VIEWS")) {
article.views = Integer.parseInt(text.replace("VIEWS","").replace(",","").trim())
}
if (text.endsWith("COMMENTS")) {
article.comments = Integer.parseInt(text.replace("COMMENTS","").replace(",","").trim())
}
if (text.startsWith("on ")) {
article.date = new Date().parse("MMM dd, yyyy", text.substring(text.indexOf("|")+1, text.indexOf(".")).trim())
}
}
}
}
articles.add(article)
if (results.from == null || article.date < results.from) {
results.from = article.date
}
if (results.to == null || article.date > results.to) {
results.to = article.date
}
}
results.articles = articles.reverse() // JSON diffs look better.
results.when = new Date()
new File("DZone.json").withWriter { out ->
out.write(new JsonBuilder(results).toPrettyString())
}
quit()
}
The JSON is consumed by AngularJS, and you can see it at https://paulhammant.com/dzone.html. Given I have used AngularJS, the page won’t be indexed by search crawlers presently. That’s not a problem to me, really, as I don’t want DZone’s rankings for my articles to be higher than the originals. DZone sometimes change the titles, I note.
The JSON is committed to Github, and I can watch changes over time from the comfort of an armchair:
Perhaps this is interesting only to people whose blogs are aggregated into DZone, and until the DZone people make a proper feed. Of course they may have that already, and I missed it.
Sept 2015 - new dzone site, new script
This one is faster now, and in bash with wget against an API rather than Groovy/Selenium:
#!/bin/bash
get_articles() {
author=1008633
fname=$(printf "%02d" $1)
wget -qO- "https://dzone.com/services/widget/article-listV2/list?author=$author&page=$1&portal=all&sort=newest" | jshon | jq ".result.data.nodes" | jq 'map(del (.authors) | del (.tags) | del (.saveStatus) | del (.acl) | del (.editUrl) | del (.imageUrl) | del(.articleContent) | del(.id))' > "dz_$fname.json"
}
for i in {1..15}
do
get_articles "$i"
done
# some of those are empty, delete 'em'
find . -name "dz_*" -size -10c -delete
# make one big list, and prettify so diff is small.
cat dz_*.json | tr '\n' ' ' | sed 's/\] \[/,/g' | jshon -S > dzone.json
# delete workfiles
find . -name "dz_*" -delete
Comments formerly in Disqus, but exported and mounted statically ...
Tue, 16 Dec 2014 | Andrew Thorburn |
While that's definitely one way of doing it, would it not have been simpler to see what the more button was doing? In this case, it loads the data from the URL http://dzone.com/profile/ac... (where the offset is the number displayed so far), and just done RESTful calls until it returns an empty JSON array? | |
Tue, 16 Dec 2014 | paul_hammant |
Yup, that would have been better. I'd previously thought it was a JSONP response. Pairing pays off huh? :) |