Update (Sept 2015): scroll to the bottom of the page for better script.

DZone syndicate some of my blog entries. I used to keep a list by hand, but it is easy to get out of step that way. That’s a manual process too, so why not automate it?

The DZone page is a little tricky to scrape in that it does not show all syndicated articles until a “More” button has been pressed exhaustively. That means we’re looking at Selenium rather than wget or curl. Here’s some Groovy:

@Grapes(
    @Grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0')
)
 
import org.openqa.selenium.*
import org.openqa.selenium.support.ui.*
import org.openqa.WebDriver.*
import org.openqa.selenium.firefox.FirefoxDriver
import groovy.json.*

WebDriver driver = new FirefoxDriver()
driver.with {
  navigate().to("http://dzone.com/users/paulhammant")

  // keep clicking "More" button until it disappears

  def articleElements = []
  while (true) {
    def moreButton = By.xpath("//button[@class='more-button']")
    try {
      // button may arrive in page slowly. timeout after 3 secs
      new WebDriverWait(driver, 3).until(ExpectedConditions.visibilityOfElementLocated(moreButton))
      findElement(moreButton).click()
    } catch(TimeoutException) {
      break
    } finally {
      articleElements = findElements(By.xpath("//div[@class='activity-stream']/ul/li"))
    }
  }

  def results = new Expando()
  results.from = null
  results.to = null
  List articles = new ArrayList()
  
  // Turn each article element (and child elements) into POJO
  
  articleElements.each() { link ->
    def divs = link.findElements(By.tagName("div"))
    def article = new Expando()

    divs.each() { div ->
      // article and stats for it are sibling elements, not parent/child.
      if (div.getAttribute("class").contains("stream-article")) {
        def a = div.findElement(By.tagName("a"))
        article.title = a.getText()        
        article.url = a.getAttribute("href")
      }
      if (div.getAttribute("class").contains("activity-stats-group")) {
        def divs2 = div.findElements(By.tagName("div"))
        divs2.each() { div2 ->
          def text = div2.getText()
          if (text.endsWith("VIEWS")) {
            article.views = Integer.parseInt(text.replace("VIEWS","").replace(",","").trim())    
          }
          if (text.endsWith("COMMENTS")) {
            article.comments = Integer.parseInt(text.replace("COMMENTS","").replace(",","").trim())
          }
          if (text.startsWith("on ")) {
            article.date = new Date().parse("MMM dd, yyyy", text.substring(text.indexOf("|")+1, text.indexOf(".")).trim())
          }
        }
      }
    }
    articles.add(article)
    if (results.from == null || article.date < results.from) {
       results.from = article.date
    }
    if (results.to == null || article.date > results.to) {
       results.to = article.date
    }
  }
  results.articles = articles.reverse() // JSON diffs look better.
  results.when = new Date()
  new File("DZone.json").withWriter { out ->
    out.write(new JsonBuilder(results).toPrettyString())
  }
  quit()
}

The JSON is consumed by AngularJS, and you can see it at https://paulhammant.com/dzone.html. Given I have used AngularJS, the page won’t be indexed by search crawlers presently. That’s not a problem to me, really, as I don’t want DZone’s rankings for my articles to be higher than the originals. DZone sometimes change the titles, I note.

The JSON is committed to Github, and I can watch changes over time from the comfort of an armchair:

Perhaps this is interesting only to people whose blogs are aggregated into DZone, and until the DZone people make a proper feed. Of course they may have that already, and I missed it.

Sept 2015 - new dzone site, new script

This one is faster now, and in bash with wget against an API rather than Groovy/Selenium:

#!/bin/bash

get_articles() {
  author=1008633
  fname=$(printf "%02d" $1)
    wget -qO- "https://dzone.com/services/widget/article-listV2/list?author=$author&page=$1&portal=all&sort=newest" | jshon | jq ".result.data.nodes" | jq 'map(del (.authors) | del (.tags) | del (.saveStatus) | del (.acl) | del (.editUrl) | del (.imageUrl) | del(.articleContent) | del(.id))' > "dz_$fname.json"
}

for i in {1..15}
do
    get_articles "$i"
done

# some of those are empty, delete 'em'
find . -name "dz_*" -size -10c -delete

# make one big list, and prettify so diff is small.
cat dz_*.json |  tr '\n' ' ' | sed 's/\] \[/,/g' | jshon -S > dzone.json

# delete workfiles
find . -name "dz_*" -delete

← Previous Archive Next →

Published

December 15^th, 2014

Reads:

Paul Hammant's Blog: Scraping DZone Syndication Stats

Sept 2015 - new dzone site, new script

Published