Update (Sept 2015): scroll to the bottom of the page for better script.

DZone syndicate some of my blog entries. I used to keep a list by hand, but it is easy to get out of step that way. That’s a manual process too, so why not automate it?

The DZone page is a little tricky to scrape in that it does not show all syndicated articles until a “More” button has been pressed exhaustively. That means we’re looking at Selenium rather than wget or curl. Here’s some Groovy:

@Grapes(
    @Grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0')
)
 
import org.openqa.selenium.*
import org.openqa.selenium.support.ui.*
import org.openqa.WebDriver.*
import org.openqa.selenium.firefox.FirefoxDriver
import groovy.json.*

WebDriver driver = new FirefoxDriver()
driver.with {
  navigate().to("http://dzone.com/users/paulhammant")

  // keep clicking "More" button until it disappears

  def articleElements = []
  while (true) {
    def moreButton = By.xpath("//button[@class='more-button']")
    try {
      // button may arrive in page slowly. timeout after 3 secs
      new WebDriverWait(driver, 3).until(ExpectedConditions.visibilityOfElementLocated(moreButton))
      findElement(moreButton).click()
    } catch(TimeoutException) {
      break
    } finally {
      articleElements = findElements(By.xpath("//div[@class='activity-stream']/ul/li"))
    }
  }

  def results = new Expando()
  results.from = null
  results.to = null
  List articles = new ArrayList()
  
  // Turn each article element (and child elements) into POJO
  
  articleElements.each() { link ->
    def divs = link.findElements(By.tagName("div"))
    def article = new Expando()

    divs.each() { div ->
      // article and stats for it are sibling elements, not parent/child.
      if (div.getAttribute("class").contains("stream-article")) {
        def a = div.findElement(By.tagName("a"))
        article.title = a.getText()        
        article.url = a.getAttribute("href")
      }
      if (div.getAttribute("class").contains("activity-stats-group")) {
        def divs2 = div.findElements(By.tagName("div"))
        divs2.each() { div2 ->
          def text = div2.getText()
          if (text.endsWith("VIEWS")) {
            article.views = Integer.parseInt(text.replace("VIEWS","").replace(",","").trim())    
          }
          if (text.endsWith("COMMENTS")) {
            article.comments = Integer.parseInt(text.replace("COMMENTS","").replace(",","").trim())
          }
          if (text.startsWith("on ")) {
            article.date = new Date().parse("MMM dd, yyyy", text.substring(text.indexOf("|")+1, text.indexOf(".")).trim())
          }
        }
      }
    }
    articles.add(article)
    if (results.from == null || article.date < results.from) {
       results.from = article.date
    }
    if (results.to == null || article.date > results.to) {
       results.to = article.date
    }
  }
  results.articles = articles.reverse() // JSON diffs look better.
  results.when = new Date()
  new File("DZone.json").withWriter { out ->
    out.write(new JsonBuilder(results).toPrettyString())
  }
  quit()
}

The JSON is consumed by AngularJS, and you can see it at https://paulhammant.com/dzone.html. Given I have used AngularJS, the page won’t be indexed by search crawlers presently. That’s not a problem to me, really, as I don’t want DZone’s rankings for my articles to be higher than the originals. DZone sometimes change the titles, I note.

The JSON is committed to Github, and I can watch changes over time from the comfort of an armchair:

Perhaps this is interesting only to people whose blogs are aggregated into DZone, and until the DZone people make a proper feed. Of course they may have that already, and I missed it.

Sept 2015 - new dzone site, new script

This one is faster now, and in bash with wget against an API rather than Groovy/Selenium:

#!/bin/bash

get_articles() {
  author=1008633
  fname=$(printf "%02d" $1)
    wget -qO- "https://dzone.com/services/widget/article-listV2/list?author=$author&page=$1&portal=all&sort=newest" | jshon | jq ".result.data.nodes" | jq 'map(del (.authors) | del (.tags) | del (.saveStatus) | del (.acl) | del (.editUrl) | del (.imageUrl) | del(.articleContent) | del(.id))' > "dz_$fname.json"
}

for i in {1..15}
do
    get_articles "$i"
done

# some of those are empty, delete 'em'
find . -name "dz_*" -size -10c -delete

# make one big list, and prettify so diff is small.
cat dz_*.json |  tr '\n' ' ' | sed 's/\] \[/,/g' | jshon -S > dzone.json

# delete workfiles
find . -name "dz_*" -delete

← Previous Archive Next →

Published

December 15^th, 2014

Syndicated by DZone.com
Reads: 2431 (link)

Comments formerly in Disqus, but exported and mounted statically ...

Tue, 16 Dec 2014	Andrew Thorburn
While that's definitely one way of doing it, would it not have been simpler to see what the more button was doing? In this case, it loads the data from the URL http://dzone.com/profile/ac... (where the offset is the number displayed so far), and just done RESTful calls until it returns an empty JSON array?
Tue, 16 Dec 2014	paul_hammant
Yup, that would have been better. I'd previously thought it was a JSONP response. Pairing pays off huh? :)

Paul Hammant's Blog: Scraping DZone Syndication Stats

Sept 2015 - new dzone site, new script

Published

Comments formerly in Disqus, but exported and mounted statically ...