Few things are more useful that a good full-text search. It's clearly the
easiest way for users to actively drill down into the content they want. It's
also quite easy on the Ruby programmer to implement
thanks to Ferret,
an Apache Lucene-inspired
search engine library.
Building an Index
The first step to implementing a search is to get an index built. The following code illustrates creating an index with two documents in it.
require 'ferret'
include Ferret
# get or create an index on the filesystem
index = Index::Index.new(:path => './test.idx')
# store a document
index << {
:title => 'A Cool Article',
:content => 'Penguins are cool.'
}
# store another document
index << {
:title => 'A Hot Article',
:content => 'Volcanoes are hot'
}
Querying the Index
Now that the index is built it's ready to be queried. The following code searches the index for documents with the word hot in the content field.
# search the index for the word hot in the content field
index.search_each('content: "hot"') do | id, score |
puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end
The search_each method yields the id of the matching documents and their scores. Check out the output:
SCORE: 0.625 TITLE: A Hot Article
All fields can also be matched-up with an asterisk
# search the index for the word hot on all fields
index.search_each('*: "hot"') do | id, score |
puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end
One of the more useful features especially in a web scenario is highlighting the matched words. This is made trivial by Index's highlight method. Consider the following code which wraps matching terms in strong tags.
# search the index for the word hot
index.search_each('content: "hot"') do | id, score |
# put highlights into a copy of the content field
# by way of <strong> HTML tags
highlights = index.highlight('content: "hot"',
id,
:field => :content,
:pre_tag => "<strong>",
:post_tag => "</strong>")
puts highlights
end
Producing:
Volcanoes are <strong>hot</strong>
It's also possible to use Ferret as a more general purpose data store. The following code creates an index of companies and returns those with a market cap over ten billion dollars and the word grocery in it. Note that the indexes are built off strings so it's necessary to pad numbers with leading zeros to preserve ordinality.
index << {
:ticker => 'GOOG',
:name => 'Google Inc',
:market_cap => '183000000000',
:description => 'indexes websites and generates revenue through advertising'
}
index << {
:ticker => 'JNJ',
:name => 'Johnson & Johnson',
:market_cap => '173000000000',
:description => 'makes drugs, healthcare products and equipment'
}
index << {
:ticker => 'WFMI',
:name => 'Whole Foods Market, Inc',
:market_cap => '003000000000',
:description => 'operates organic grocery stores'
}
index << {
:ticker => 'KR',
:name => 'The Kroger Co.',
:market_cap => '014000000000',
:description => 'operates grocery stores and other retail establishments'
}
# search the index
index.search_each('market_cap:(> 010000000000) AND *:(grocery)') do | id, score |
puts "TICKER: #{index[id][:ticker]}"
end
Resulting in:
TICKER: KR
A Practical Example
That's all fine and good, but how would it be used in real life? To illustrate a popular use-case I'll implement a simple application that spiders this site using Anemone and stores it's data in Ferret.
require 'anemone'
require 'ferret'
require 'open-uri'
include Ferret
index = Index::Index.new(:path => './chrisumbel.idx')
# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
# only process pages in the article directory
anemone.on_pages_like(/article\/[^?]*$/) do | page |
# store the page in the index
index << {
:url => page.url,
:title => page.doc.at('title').text,
:content => page.doc.css('div.content_piece').text
}
puts "#{page.url} indexed."
end
end
# search the index for articles with either ruby or python
index.search_each('content: "ruby" or "python"') do | id, score |
puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end
Integration with Rails
Naturally this is the kind of stuff that would be handy to use as part of your model in Rails. That's made trivial with the ActsAsFerret rails plugin.
Just decorate your model accordingly and the fields you specifiy will be indexed in Ferret:
class Article < ActiveRecord::Base acts_as_ferret :fields => [:title, :content] end
It can then be queried thusly from your controller:
class ArticlesController < ApplicationController
def search
@articles = Article.find_with_ferret(@params['search_string'])
end
end
Sat Nov 28 2009 16:00:00 GMT+0000 (UTC)