Chris Umbel

HTML Parsing with Ruby and Nokogiri

RubyParsing HTML is a frequent and somewhat annoying task programmers are commissioned with occasionally. Activities such as screen-scraping have become rare since the advent of RSS, but still... There's always content out there that you have to get at that leaves you no choice but to parse it out yourself.

One of the more elegant bits that I've seen for this purpose is Nokogiri which is a Ruby library that supports querying HTML content by both an XPath and CSS selector syntax.

XPath

First I'll demonstrate how to parse some content out of a page via the XPath syntax. This code uses the ruby documentation for the Bignum class as a parsing medium and essentially extracts the method names.

require 'nokogiri' 
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.ruby-doc.org/core/classes/Bignum.html'))

doc.xpath('//span[@class="method-name"]').each do | method_span |
	puts method_span.content
	puts method_span.path
	puts
end

The above code simply iterates through a set of Node objects that represent every span tag with the CSS class "method-name" applied. It prints out the inner text and absolute XPath via the "content" and "path" properties respectively. Below is a sample of the output:

power!
/html/body/div[3]/div/div[24]/div[1]/span[1]

big.quo(numeric) => float
/html/body/div[3]/div/div[25]/div[1]/a/span

quo
/html/body/div[3]/div/div[26]/div[1]/a/span[1]

rdiv
/html/body/div[3]/div/div[27]/div[1]/span[1]

big.remainder(numeric)    => number
/html/body/div[3]/div/div[28]/div[1]/a/span

rpower
/html/body/div[3]/div/div[29]/div[1]/a/span[1]

CSS

Nokogiri also supports querying by way of CSS selector syntax. The following example iterates over every link that displays a javascript popup in the Bignum document used above and outputs its absolute css selector path and the text of the "onclick" attribute.

doc.css('a[onclick]').each do | popup_link |
  puts popup_link.css_path
  puts popup_link.attributes['onclick']
end

Practical

A real life use of this library and HTML parsing in general is Anemone which is a web spidering framework for Ruby. Like most things in Ruby it's programmer friendly and delivers quite a bit of power without much work.

The following Anemone example uses Nokogiri under the covers to crawl all links on this site and print out the URLs of articles.

require 'anemone'
require 'open-uri'

# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
  # only process pages in the article directory
  anemone.on_pages_like(/article\/[^?]*$/) do | page |
    puts "#{page.url} indexed."
  end
end

Also, the WebRat DSL (which powers the Cucumber web acceptance testing framework) employs Nokogiri.

Conclusion

While the need for screen-scraping and HTML parsing has diminished over time the need still exists. It's nice to know that when we do have to do it the process is made simple by libraries like Nokogiri.

Sun Jul 12 2009 11:07:11 GMT+0000 (UTC)

8 Comments Comment Feed - Permalink
Not to be left out in the cold, HTML Agility Pack provides similar functionality for .NET developers.

http://htmlagilitypack.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=272
by PalmerEk on Mon Jul 13 2009 08:07:39 GMT+0000 (UTC)
Looks like good stuff!
by chrisumbel on Mon Jul 13 2009 17:07:34 GMT+0000 (UTC)
I don't think I'll be using ScreenScraping any time soon in my apps, however SelectorGadget looks like a great tool which may come in handy for me at some point.
by kollagen on Tue Dec 08 2009 12:12:27 GMT+0000 (UTC)
The aforementioned tool, SelectorGadget can be found at http://www.selectorgadget.com/.  It allows you to easily get a minimal CSS selector of any item on a page by clicking on it.  It's nice!
by chrisumbel on Wed Dec 09 2009 07:12:41 GMT+0000 (UTC)
I know this if off topic but I'm looking into 
starting my own blog and was curious what all is 
needed to get set up? I'm assuming having a blog like yours 
would cost a pretty penny? I'm not very web savvy so I'm not 100% 
certain. Any recommendations or advice would be greatly appreciated.
Thank you
by tinyurl.com on Fri Jun 05 2015 20:24:14 GMT+0000 (UTC)
Blogger Aileen Barker spray painted Ikea racks 
and developed a shelving system right by her tub.
by rod holders for boats south africa on Sat Jun 13 2015 22:26:27 GMT+0000 (UTC)
Hola! I've been reading your site for a while now and finally got the courage 
to go ahead and give you a shout out from Huffman Texas!
Just wanted to say keep up the great job!
by page one engine review on Tue Jun 23 2015 16:50:15 GMT+0000 (UTC)
Hello, i think that i saw you visited my website thus i came to go back the desire?.I'm trying to to 
find issues to improve my website!I suppose its adequate to 
use a few of your ideas!!
by andrew195floablog.edublogs.org on Mon Jul 06 2015 05:34:04 GMT+0000 (UTC)
Add a comment
Name
E mail (Private)
URL
Follow Chris
RSS Feed
Twitter
Facebook
CodePlex
github
LinkedIn
Google