Chris Umbel

HTML Parsing with Ruby and Nokogiri

RubyParsing HTML is a frequent and somewhat annoying task programmers are commissioned with occasionally. Activities such as screen-scraping have become rare since the advent of RSS, but still... There's always content out there that you have to get at that leaves you no choice but to parse it out yourself.

One of the more elegant bits that I've seen for this purpose is Nokogiri which is a Ruby library that supports querying HTML content by both an XPath and CSS selector syntax.

XPath

First I'll demonstrate how to parse some content out of a page via the XPath syntax. This code uses the ruby documentation for the Bignum class as a parsing medium and essentially extracts the method names.

require 'nokogiri' 
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.ruby-doc.org/core/classes/Bignum.html'))

doc.xpath('//span[@class="method-name"]').each do | method_span |
	puts method_span.content
	puts method_span.path
	puts
end

The above code simply iterates through a set of Node objects that represent every span tag with the CSS class "method-name" applied. It prints out the inner text and absolute XPath via the "content" and "path" properties respectively. Below is a sample of the output:

power!
/html/body/div[3]/div/div[24]/div[1]/span[1]

big.quo(numeric) => float
/html/body/div[3]/div/div[25]/div[1]/a/span

quo
/html/body/div[3]/div/div[26]/div[1]/a/span[1]

rdiv
/html/body/div[3]/div/div[27]/div[1]/span[1]

big.remainder(numeric)    => number
/html/body/div[3]/div/div[28]/div[1]/a/span

rpower
/html/body/div[3]/div/div[29]/div[1]/a/span[1]

CSS

Nokogiri also supports querying by way of CSS selector syntax. The following example iterates over every link that displays a javascript popup in the Bignum document used above and outputs its absolute css selector path and the text of the "onclick" attribute.

doc.css('a[onclick]').each do | popup_link |
  puts popup_link.css_path
  puts popup_link.attributes['onclick']
end

Practical

A real life use of this library and HTML parsing in general is Anemone which is a web spidering framework for Ruby. Like most things in Ruby it's programmer friendly and delivers quite a bit of power without much work.

The following Anemone example uses Nokogiri under the covers to crawl all links on this site and print out the URLs of articles.

require 'anemone'
require 'open-uri'

# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
  # only process pages in the article directory
  anemone.on_pages_like(/article\/[^?]*$/) do | page |
    puts "#{page.url} indexed."
  end
end

Also, the WebRat DSL (which powers the Cucumber web acceptance testing framework) employs Nokogiri.

Conclusion

While the need for screen-scraping and HTML parsing has diminished over time the need still exists. It's nice to know that when we do have to do it the process is made simple by libraries like Nokogiri.

Sun Jul 12 2009 11:07:11 GMT+0000 (UTC)

4 Comments Comment Feed - Permalink
Not to be left out in the cold, HTML Agility Pack provides similar functionality for .NET developers.

http://htmlagilitypack.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=272
by PalmerEk on Mon Jul 13 2009 08:07:39 GMT+0000 (UTC)
Looks like good stuff!
by chrisumbel on Mon Jul 13 2009 17:07:34 GMT+0000 (UTC)
I don't think I'll be using ScreenScraping any time soon in my apps, however SelectorGadget looks like a great tool which may come in handy for me at some point.
by kollagen on Tue Dec 08 2009 12:12:27 GMT+0000 (UTC)
The aforementioned tool, SelectorGadget can be found at http://www.selectorgadget.com/.  It allows you to easily get a minimal CSS selector of any item on a page by clicking on it.  It's nice!
by chrisumbel on Wed Dec 09 2009 07:12:41 GMT+0000 (UTC)
Add a comment
Name
E mail (Private)
URL
Follow Chris
RSS Feed
Twitter
Facebook
CodePlex
github
LinkedIn
Google