Chris Umbel

Full-Text Indexing in Ruby Using Ferret

Ferret Logo Few things are more useful that a good full-text search. It's clearly the easiest way for users to actively drill down into the content they want. It's also quite easy on the Ruby programmer to implement thanks to Ferret, an Apache Lucene-inspired search engine library.

Building an Index

The first step to implementing a search is to get an index built. The following code illustrates creating an index with two documents in it.

require 'ferret'
include Ferret

# get or create an index on the filesystem
index = Index::Index.new(:path => './test.idx')

# store a document
index << {
  :title => 'A Cool Article',
  :content => 'Penguins are cool.'
}

# store another document
index << {
  :title => 'A Hot Article',
  :content => 'Volcanoes are hot'
}

Querying the Index

Now that the index is built it's ready to be queried. The following code searches the index for documents with the word hot in the content field.

# search the index for the word hot in the content field
index.search_each('content: "hot"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

The search_each method yields the id of the matching documents and their scores. Check out the output:

SCORE: 0.625	TITLE: A Hot Article

All fields can also be matched-up with an asterisk

# search the index for the word hot on all fields
index.search_each('*: "hot"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

One of the more useful features especially in a web scenario is highlighting the matched words. This is made trivial by Index's highlight method. Consider the following code which wraps matching terms in strong tags.

# search the index for the word hot
index.search_each('content: "hot"') do | id, score |
  # put highlights into a copy of the content field
  # by way of <strong> HTML tags
  highlights = index.highlight('content: "hot"',
    id,
    :field => :content,
    :pre_tag => "<strong>",
    :post_tag => "</strong>")
  
  puts highlights
end

Producing:

Volcanoes are <strong>hot</strong>

It's also possible to use Ferret as a more general purpose data store. The following code creates an index of companies and returns those with a market cap over ten billion dollars and the word grocery in it. Note that the indexes are built off strings so it's necessary to pad numbers with leading zeros to preserve ordinality.

index << {
  :ticker => 'GOOG',
  :name => 'Google Inc',
  :market_cap => '183000000000',
  :description => 'indexes websites and generates revenue through advertising'
}

index << {
  :ticker => 'JNJ',
  :name => 'Johnson & Johnson',
  :market_cap => '173000000000',
  :description => 'makes drugs, healthcare products and equipment'
}

index << {
  :ticker => 'WFMI',
  :name => 'Whole Foods Market, Inc',
  :market_cap => '003000000000',
  :description => 'operates organic grocery stores'
}

index << {
  :ticker => 'KR',
  :name => 'The Kroger Co.',
  :market_cap => '014000000000',
  :description => 'operates grocery stores and other retail establishments'
}

# search the index
index.search_each('market_cap:(> 010000000000) AND *:(grocery)') do | id, score |
    puts "TICKER: #{index[id][:ticker]}"
end

Resulting in:

TICKER: KR

A Practical Example

That's all fine and good, but how would it be used in real life? To illustrate a popular use-case I'll implement a simple application that spiders this site using Anemone and stores it's data in Ferret.

require 'anemone'
require 'ferret'
require 'open-uri'

include Ferret

index = Index::Index.new(:path => './chrisumbel.idx')

# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
  # only process pages in the article directory
  anemone.on_pages_like(/article\/[^?]*$/) do | page |
      # store the page in the index
      index << {
        :url => page.url,
        :title => page.doc.at('title').text,
        :content => page.doc.css('div.content_piece').text
      }
      
      puts "#{page.url} indexed."
  end
end

# search the index for articles with either ruby or python
index.search_each('content: "ruby" or "python"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

Integration with Rails

Naturally this is the kind of stuff that would be handy to use as part of your model in Rails. That's made trivial with the ActsAsFerret rails plugin.

Just decorate your model accordingly and the fields you specifiy will be indexed in Ferret:

class Article < ActiveRecord::Base
  acts_as_ferret :fields => [:title, :content]
end

It can then be queried thusly from your controller:

class ArticlesController < ApplicationController
  def search
    @articles = Article.find_with_ferret(@params['search_string'])
  end
end

Sat Nov 28 2009 16:00:00 GMT+0000 (UTC)

Follow Chris
RSS Feed
Twitter
Facebook
CodePlex
github
LinkedIn
Google