Chris Umbel

HTML Parsing with Ruby and Nokogiri

RubyParsing HTML is a frequent and somewhat annoying task programmers are commissioned with occasionally. Activities such as screen-scraping have become rare since the advent of RSS, but still... There's always content out there that you have to get at that leaves you no choice but to parse it out yourself.

One of the more elegant bits that I've seen for this purpose is Nokogiri which is a Ruby library that supports querying HTML content by both an XPath and CSS selector syntax.


First I'll demonstrate how to parse some content out of a page via the XPath syntax. This code uses the ruby documentation for the Bignum class as a parsing medium and essentially extracts the method names.

require 'nokogiri' 
require 'open-uri'

doc = Nokogiri::HTML(open(''))

doc.xpath('//span[@class="method-name"]').each do | method_span |
	puts method_span.content
	puts method_span.path

The above code simply iterates through a set of Node objects that represent every span tag with the CSS class "method-name" applied. It prints out the inner text and absolute XPath via the "content" and "path" properties respectively. Below is a sample of the output:


big.quo(numeric) => float



big.remainder(numeric)    => number



Nokogiri also supports querying by way of CSS selector syntax. The following example iterates over every link that displays a javascript popup in the Bignum document used above and outputs its absolute css selector path and the text of the "onclick" attribute.

doc.css('a[onclick]').each do | popup_link |
  puts popup_link.css_path
  puts popup_link.attributes['onclick']


A real life use of this library and HTML parsing in general is Anemone which is a web spidering framework for Ruby. Like most things in Ruby it's programmer friendly and delivers quite a bit of power without much work.

The following Anemone example uses Nokogiri under the covers to crawl all links on this site and print out the URLs of articles.

require 'anemone'
require 'open-uri'

# crawl this page
Anemone.crawl("") do | anemone |
  # only process pages in the article directory
  anemone.on_pages_like(/article\/[^?]*$/) do | page |
    puts "#{page.url} indexed."

Also, the WebRat DSL (which powers the Cucumber web acceptance testing framework) employs Nokogiri.


While the need for screen-scraping and HTML parsing has diminished over time the need still exists. It's nice to know that when we do have to do it the process is made simple by libraries like Nokogiri.

Sun Jul 12 2009 11:07:11 GMT+0000 (UTC)

24 Comments Comment Feed - Permalink
Not to be left out in the cold, HTML Agility Pack provides similar functionality for .NET developers.
by PalmerEk on Mon Jul 13 2009 08:07:39 GMT+0000 (UTC)
Looks like good stuff!
by chrisumbel on Mon Jul 13 2009 17:07:34 GMT+0000 (UTC)
I don't think I'll be using ScreenScraping any time soon in my apps, however SelectorGadget looks like a great tool which may come in handy for me at some point.
by kollagen on Tue Dec 08 2009 12:12:27 GMT+0000 (UTC)
The aforementioned tool, SelectorGadget can be found at  It allows you to easily get a minimal CSS selector of any item on a page by clicking on it.  It's nice!
by chrisumbel on Wed Dec 09 2009 07:12:41 GMT+0000 (UTC)
I know this if off topic but I'm looking into 
starting my own blog and was curious what all is 
needed to get set up? I'm assuming having a blog like yours 
would cost a pretty penny? I'm not very web savvy so I'm not 100% 
certain. Any recommendations or advice would be greatly appreciated.
Thank you
by on Fri Jun 05 2015 20:24:14 GMT+0000 (UTC)
Blogger Aileen Barker spray painted Ikea racks 
and developed a shelving system right by her tub.
by rod holders for boats south africa on Sat Jun 13 2015 22:26:27 GMT+0000 (UTC)
Hola! I've been reading your site for a while now and finally got the courage 
to go ahead and give you a shout out from Huffman Texas!
Just wanted to say keep up the great job!
by page one engine review on Tue Jun 23 2015 16:50:15 GMT+0000 (UTC)
Hello, i think that i saw you visited my website thus i came to go back the desire?.I'm trying to to 
find issues to improve my website!I suppose its adequate to 
use a few of your ideas!!
by on Mon Jul 06 2015 05:34:04 GMT+0000 (UTC)
For men the pirate shirt is a great crowd puller. Pirates 
of Black Cove marries those inspirations with some RTS elements somewhat similar to Paradox's other 
games such as Kings' Crusade and King Arthur II. Over on Emerald, the big activity is slated for next week: 
Two Brigand Kings, each with a fleet strength of 20 or more, attacking the 
flag Low Blow.
by pirate kings apk hacks on Tue Jul 14 2015 22:12:33 GMT+0000 (UTC)
If you or anyone in your household has actually had anxiety or other mental health issues, he or she 
likely will ask.
by depression quiz online on Fri Jul 17 2015 10:45:56 GMT+0000 (UTC)
You must pick a webhosting that offers you an in-depth report on your web traffic statistics.
by web hosting companies on Fri Jul 17 2015 19:50:00 GMT+0000 (UTC)
Excellent blog post. I definitely appreciate this site.
by on Tue Aug 25 2015 09:40:00 GMT+0000 (UTC)
Fantastic blog! Do you have any tips for aspiring writers?
I'm hoping to start my own site soon but I'm a little lost on everything.

Would you suggest starting with a free platform like Wordpress or go for a 
paid option? There are so many choices out there that I'm completely overwhelmed 
.. Any suggestions? Appreciate it!
by Lee Trotman on Mon Sep 28 2015 13:17:13 GMT+0000 (UTC)
hi!,I like your writing very much! percentage we keep up a correspondence more approximately your article on AOL?
I need an expert inn this house to unravel my problem. Maybe that is 
you! Looking forward to see you.
by Ian Leaf Tax Fraud on Wed Sep 30 2015 13:10:13 GMT+0000 (UTC)
It's difficult to find well-informed people for this subject, however, you seem like 
you know what you're talking about! Thanks
by Inspired Silver on Thu Oct 01 2015 21:25:53 GMT+0000 (UTC)
These that successfully navigate our in-depth recruitment and interview procedure are given every single attainable resource to succeed 
and thrive.
by Terrence on Sun Oct 04 2015 16:56:00 GMT+0000 (UTC)
Hey there, I thunk your blog might be having browser compatibility 
issues. When I look at your website in Firefox, it looks fine but when opening in Internet Explorer, it has some overlapping.
I just wanted to give you a quick heads up! Other then that, 
great blog!
by Michele Frazier on Wed Oct 07 2015 04:50:14 GMT+0000 (UTC)
I am genuinely grateful to the holder of this web page who has shared this fantastic piece of writing at 
at this time.
by Michele Frazier on Wed Oct 07 2015 17:03:30 GMT+0000 (UTC)
Hi Dear, are you in fact visiting this web page daily, if so 
then you will without doubt obtain pleasant experience.
by Michele Frazier on Thu Oct 08 2015 04:50:11 GMT+0000 (UTC)
Peculiar article, exactly what I needed.
by Lee Trotman on Thu Oct 08 2015 15:55:13 GMT+0000 (UTC)
Unquestionably believe that which you stated. Your favorite justification seemed 
to be on the internet the simplest thing to be aware of.
I say to you, I certainly get annoyed while people consider worries that they just don't 
know about. You managed to hit the naail upon the top and 
also defined out the whole thing without having side-effects , people could take a signal.

Will likely be back to get more. Thanks
by Gary Huffman on Fri Oct 09 2015 16:58:32 GMT+0000 (UTC)
Aw, this was an extremely good post. Taking a few minutes and 
actual effort to produce a really good article_ but what can I say_ I put things off a lot and don't 
seem to get anything done.
by Michele Frazier on Sat Oct 10 2015 04:29:20 GMT+0000 (UTC)
Hello i am kavin, its my first time to commenting anywhere, when i read this piece of writing i thought i could also create comment due 
to this sensible article.
by michele frazier on Sun Oct 11 2015 11:56:48 GMT+0000 (UTC)
Generally I do not read post on blogs, however I wish to say that this write-up very pressured me to take 
a look at and do so! Your writing style has been surprised me.
Thanks, very great post.
by Michele Frazier on Sun Oct 11 2015 12:17:14 GMT+0000 (UTC)
Add a comment
E mail (Private)
Follow Chris
RSS Feed