Parsing HTML is a frequent and somewhat annoying task programmers are commissioned with occasionally. Activities such as screen-scraping have become rare since the advent of RSS, but still... There's always
content out there that you have to get at that leaves you no choice but to parse it out yourself.
One of the more elegant bits that I've seen for this purpose is Nokogiri which is a Ruby library that supports querying HTML content by both an XPath and CSS selector syntax.
XPath
First I'll demonstrate how to parse some content out of a page via the XPath syntax. This code uses the ruby documentation for the Bignum class as a parsing medium and essentially extracts the method names.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.ruby-doc.org/core/classes/Bignum.html'))
doc.xpath('//span[@class="method-name"]').each do | method_span |
puts method_span.content
puts method_span.path
puts
end
The above code simply iterates through a set of Node objects that represent every span tag with the CSS class "method-name" applied. It prints out the inner text and absolute XPath via the "content" and "path" properties respectively. Below is a sample of the output:
power! /html/body/div[3]/div/div[24]/div[1]/span[1] big.quo(numeric) => float /html/body/div[3]/div/div[25]/div[1]/a/span quo /html/body/div[3]/div/div[26]/div[1]/a/span[1] rdiv /html/body/div[3]/div/div[27]/div[1]/span[1] big.remainder(numeric) => number /html/body/div[3]/div/div[28]/div[1]/a/span rpower /html/body/div[3]/div/div[29]/div[1]/a/span[1]
CSS
Nokogiri also supports querying by way of CSS selector syntax. The following example iterates over every link that displays a javascript popup in the Bignum document used above and outputs its absolute css selector path and the text of the "onclick" attribute.
doc.css('a[onclick]').each do | popup_link |
puts popup_link.css_path
puts popup_link.attributes['onclick']
end
Practical
A real life use of this library and HTML parsing in general is Anemone which is a web spidering framework for Ruby. Like most things in Ruby it's programmer friendly and delivers quite a bit of power without much work.
The following Anemone example uses Nokogiri under the covers to crawl all links on this site and print out the URLs of articles.
require 'anemone'
require 'open-uri'
# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
# only process pages in the article directory
anemone.on_pages_like(/article\/[^?]*$/) do | page |
puts "#{page.url} indexed."
end
end
Also, the WebRat DSL (which powers the Cucumber web acceptance testing framework) employs Nokogiri.
Conclusion
While the need for screen-scraping and HTML parsing has diminished over time the need still exists. It's nice to know that when we do have to do it the process is made simple by libraries like Nokogiri.
Sun Jul 12 2009 11:07:11 GMT+0000 (UTC)
Comment Feed -
Permalink
Not to be left out in the cold, HTML Agility Pack provides similar functionality for .NET developers. http://htmlagilitypack.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=272by PalmerEk on Mon Jul 13 2009 08:07:39 GMT+0000 (UTC)
I don't think I'll be using ScreenScraping any time soon in my apps, however SelectorGadget looks like a great tool which may come in handy for me at some point.by kollagen on Tue Dec 08 2009 12:12:27 GMT+0000 (UTC)
The aforementioned tool, SelectorGadget can be found at http://www.selectorgadget.com/. It allows you to easily get a minimal CSS selector of any item on a page by clicking on it. It's nice!by chrisumbel on Wed Dec 09 2009 07:12:41 GMT+0000 (UTC)