Ferret Logo Few things are more useful that a good full-text search. It's clearly the easiest way for users to actively drill down into the content they want. It's also quite easy on the Ruby programmer to implement thanks to Ferret, an Apache Lucene-inspired search engine library.

Building an Index

The first step to implementing a search is to get an index built. The following code illustrates creating an index with two documents in it.

require 'ferret'
include Ferret

# get or create an index on the filesystem
index = Index::Index.new(:path => './test.idx')

# store a document
index << {
  :title => 'A Cool Article',
  :content => 'Penguins are cool.'
}

# store another document
index << {
  :title => 'A Hot Article',
  :content => 'Volcanoes are hot'
}

Querying the Index

Now that the index is built it's ready to be queried. The following code searches the index for documents with the word hot in the content field.

# search the index for the word hot in the content field
index.search_each('content: "hot"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

The search_each method yields the id of the matching documents and their scores. Check out the output:

SCORE: 0.625	TITLE: A Hot Article

All fields can also be matched-up with an asterisk

# search the index for the word hot on all fields
index.search_each('*: "hot"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

One of the more useful features especially in a web scenario is highlighting the matched words. This is made trivial by Index's highlight method. Consider the following code which wraps matching terms in strong tags.

# search the index for the word hot
index.search_each('content: "hot"') do | id, score |
  # put highlights into a copy of the content field
  # by way of <strong> HTML tags
  highlights = index.highlight('content: "hot"',
    id,
    :field => :content,
    :pre_tag => "<strong>",
    :post_tag => "</strong>")
  
  puts highlights
end

Producing:

Volcanoes are <strong>hot</strong>

It's also possible to use Ferret as a more general purpose data store. The following code creates an index of companies and returns those with a market cap over ten billion dollars and the word grocery in it. Note that the indexes are built off strings so it's necessary to pad numbers with leading zeros to preserve ordinality.

index << {
  :ticker => 'GOOG',
  :name => 'Google Inc',
  :market_cap => '183000000000',
  :description => 'indexes websites and generates revenue through advertising'
}

index << {
  :ticker => 'JNJ',
  :name => 'Johnson & Johnson',
  :market_cap => '173000000000',
  :description => 'makes drugs, healthcare products and equipment'
}

index << {
  :ticker => 'WFMI',
  :name => 'Whole Foods Market, Inc',
  :market_cap => '003000000000',
  :description => 'operates organic grocery stores'
}

index << {
  :ticker => 'KR',
  :name => 'The Kroger Co.',
  :market_cap => '014000000000',
  :description => 'operates grocery stores and other retail establishments'
}

# search the index
index.search_each('market_cap:(> 010000000000) AND *:(grocery)') do | id, score |
    puts "TICKER: #{index[id][:ticker]}"
end

Resulting in:

TICKER: KR

A Practical Example

That's all fine and good, but how would it be used in real life? To illustrate a popular use-case I'll implement a simple application that spiders this site using Anemone and stores it's data in Ferret.

require 'anemone'
require 'ferret'
require 'open-uri'

include Ferret

index = Index::Index.new(:path => './chrisumbel.idx')

# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
  # only process pages in the article directory
  anemone.on_pages_like(/article\/[^?]*$/) do | page |
      # store the page in the index
      index << {
        :url => page.url,
        :title => page.doc.at('title').text,
        :content => page.doc.css('div.content_piece').text
      }
      
      puts "#{page.url} indexed."
  end
end

# search the index for articles with either ruby or python
index.search_each('content: "ruby" or "python"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

Integration with Rails

Naturally this is the kind of stuff that would be handy to use as part of your model in Rails. That's made trivial with the ActsAsFerret rails plugin.

Just decorate your model accordingly and the fields you specifiy will be indexed in Ferret:

class Article < ActiveRecord::Base
  acts_as_ferret :fields => [:title, :content]
end

It can then be queried thusly from your controller:

class ArticlesController < ApplicationController
  def search
    @articles = Article.find_with_ferret(@params['search_string'])
  end
end
Created on 2009-11-28 16:00:00 UTC
 
AppEngine Logo I've recently deployed a django application on Google's AppEngine. I'm not sure how I've avoided it thus far but seems to fit my needs relatively well. DataStore (AppEngine's data storage engine) really impressed me. The python API feels so much like django's ORM that there was practically zero learning curve for a chap like me.

One thing that disappointed me, however, was the state of the search facility. The built-in google.appengine.ext.search.SearchableModel suffers from many problems outlined all over the web. i.e. the need to create n indexes to handle for n search terms, index creation failures, inability to exclude properties and a lack of support for common search operations.

A brief web search for alternatives turned up a semi-commercial product and a few open source offerings but nothing that really piqued my interest.

So I figured, what the heck, I'll try to roll my own. It'll give me a chance to do some special tokenization which will come in handy considering the corpora is comprised partly of source code. Even if I don't stick with it it'll surely be a fun exercise.

Keep in mind that this post is mainly recounting my experience using a few surrogate examples along the way. I'm not necessarily sold on the approach yet myself.

The Plan

Python Logo The method for building the indexes seemed strait-forward enough. Tokenize the text of the fields I want to index, get the stems of the words, reduce the list to a unique set and store it in a StringListProperty. From there querying it will be a cakewalk, right?

After doing some home-brew tokenizing and stemming I really wan't happy with the results (not that I expected to be). Sure, if I spent enough time with it and studied stemming algorithms I could have come up with something that wasn't too shabby but heck, I have sites to build!

That's when I remembered the Natural Language ToolKit for python. It greatly simplifies common text processing tasks like, you guessed it, stemming and tokenizing.

The Implementation

Now that I had a plan I had to put it in motion. The first was getting a hold of the Natural Language ToolKit which can be downloaded here. I recommend installing from source because we'll need it later.

With NLTK installed I had to put it into my AppEngine project. It wasn't enough just import the NLTK modules I wanted to use. I had to actually copy the code into my project to ensure that it got deployed to AppEngine along with my projct. This is accomplished by copying the nltk directory and subdirectories of only the modules I needed into the application's root (in this case stem and tokenize subdirectories).

Then I had to replace __init__.py with a blank __init__.py in the nltk directory and its subdirectories. This was necessary to stop NTLK from doing the funky stuff it does upon initialization.

The Model

To continue I'll use a simple blog post model as an example-case. Its model definition follows.

class BlogPost(db.Model):
    title = db.StringProperty(multiline = False)
    content = db.TextProperty()
    created = db.DateTimeProperty(auto_now_add = True)
    # indexed_fields param specifies the other properties that will
    # be indexed
    words = FulltextIndexProperty(indexed_fields = ['title', 'content'])

Notice the words property of the FulltextIndexProperty type. FulltextIndexProperty is not built-in to AppEngine. I'll define that later. See how it specifies the names of the properties to include in the index via the indexed_fields parameter?

Enter NLTK

Now I'll perform some language processing in the model in a helper function.

from nltk.stem.porter import PorterStemmer
from nltk.tokenize import WhitespaceTokenizer

def tokenize(text):
    """ break up some abritrary text into tokens """
    # get rid of all punctuation
    rex = re.compile(r"[^\w\s]")
    text = rex.sub('', text)

    # create NLTK objects we'll need
    stemmer = PorterStemmer()
    tokenizer = WhitespaceTokenizer()

    # break text up into words
    tokens = tokenizer.tokenize(text)

    # get the stems of the words
    words = [stemmer.stem(token.lower()) for token in tokens]

    return words

The previous function uses the NLTK to tokenize and stem the words of supplied text. An example of stemming is converting the words "hasher", "hashing" and "hashed" to "hash". That way when a user searches for "hash" posts with the word "hashing" will be returned. That would also be a handy place to insert some custom tokenization.

The Index

With that out of the way I'll define the FulltextIndexProperty type.

class FulltextIndexProperty(db.StringListProperty):
    """ Property that stores a full-text index of other textual
        properties of the model """
    def __init__(self, *args, **kwargs):
        self.indexed_fields = kwargs['indexed_fields']
        del kwargs['indexed_fields']
        super(FulltextIndexProperty, self).__init__(*args, **kwargs)
    
    def get_value_for_datastore(self, model_instance):
        """ persist a full-text index applicable properties of this instance """
        field_values = []
        
        # iterate all fields to include in the index
        for field_name in self.indexed_fields:
            # get the value of the property and tokenize it
            field_values += tokenize(str(getattr(model_instance, field_name)))

        # return a unique list of words
        return list(set(field_values))

The FulltextIndexProperty class overrides the get_value_for_datastore method which will produce list of unique stems of all included fields. This is the the actual full-text index to be stored in DataStore. That would be a convenient place to include a feature such as ignored words or adjective expansion.

Because FulltextIndexProperty extends StringListProperty what's actually stored is a list of unique word stems of all properties included in the index.

Querying

In a final piece of plumbing I'll add the following static method to the BlogPost class. Note that in production I'd probably wrap this up into a base class.

@staticmethod
def fulltext_search(fti_property_name, search_string):        
    # us the same tokenization we used in indexing
    # to tokenize the search string
    query = tokenize('words', search_string)
    
    # create a GQL where clause with a condition for
    # each search term.
    gql = "where %(conditions)s" % {'conditions' :
        ''.join(["%(prop_name)s = '%(word)s' and " % {'word' : word,
            'prop_name' : fti_property_name} for word in query])[:-5]}
    
    # query datastore
    return BlogPost.gql(gql)

The View

Now that I have blog entries indexing themselves and a full-text search method in the model I'm ready to write a view to search them.

def search(request):
    search_string = request.GET.get('search_string')
    
    # query datastore
    posts = BlogPost.fulltext_search('words', search_string)
        
    return render_to_response(request, 'search_resutls.html', {
            'posts': posts,
            'search_string': search_string
        })

That's it! The value passed in for the search_string key of the query will be built into a GQL where clause to perform a fast full-text search. This system takes advantage of the StringListProperty which allows us to store the index directly in the entities.

Next Steps

This implementation is rather simplistic and not much better than the SearchableModel. All words are given the same weight (there are no term vectors), no consideration is given to word proximity, occurrence counts and exact-phrase searches aren't handled. However, with a little creativity those features and many others could be handled which would justify the effort.

Created on 2009-11-22 20:11:00 UTC
 
django I sure was naive. When I launched a certain django-based site that accepted user comments (wonder which one that is?) a while back I thought I could block the comment spam myself without CAPTCHA. After a few months of traffic I started getting hammered with it and tried blocking IPs, keywords and patterns. All to no avail.

The trouble-spot was a strait-forward, regular old HTML form that accepted the comment input. I needed it to appeal to wide browser requirements of the site. My AJAX-jQuery-to-django-piston-service comment submissions rarely were the source of spam entry but I needed my regular forms locked down as well.

reCAPTHCA logo

I toyed with the idea of rolling my own CAPTCHA but I honestly have bigger fish to fry. Turns out that integrating reCAPTCHA with django was a sinch and solved my comment spam problems.

Here's how to do it.

Step #1: Get a reCAPTCHA Account

reCAPTCHA is a service and all the heavy lifting is done on reCAPTCHA's servers. Because of that you must sign up for an account to use the service here.

By default a key works on a single domain, but you can also create your key as "Global" allowing them to work on multiple site.

Step #2: Install recaptcha-client

In order for django (or any other python code) to use the reCAPTCHA service you must install the recaptcha-client library. This is most easily accomplished with setuptools:

easy_install recaptcha-client

Alternatively you can install it directly from source by downloading it from: http://pypi.python.org/pypi/recaptcha-client

Step #3: Add reCAPTCHA to a Template

Now it's time for some actual web development. I'll start out by putting the familiar reCAPTCHA interface in a django template. Notice that I have it as part of a form named edit_form. There are also a couple locations where you have to insert your public key which you get when you sign up for a reCAPTCHA account.

<form  action="#" method="POST">
  <table>
    {{ edit_form }}
    <tr>
        <th>Are you human?</th>
        <td>
            <span class="validation_error">{{ captcha_response }}</span>
        
            <script type="text/javascript"
            src="http://api.recaptcha.net/challenge?k=[[ YOUR PUBLIC KEY ]]">
            </script>
            
            <noscript>
            <iframe src="http://api.recaptcha.net/noscript?k=[[ YOUR PUBLIC KEY ]]"
            height="300" width="500" frameborder="0"></iframe><br>
            <textarea name="recaptcha_challenge_field" rows="3" cols="40">
            </textarea>
            <input type="hidden" name="recaptcha_response_field" 
            value="manual_challenge">
            </noscript>
        <td>
    </tr>
    <tr>
        <th></th>
        <td><input type="submit" value="Save"/></td>
    </tr>
  </table>
</form>

Step #4: Handle reCAPTCHA Upon Form Submission

Now I'll set up a view to handle the template and form submission. This would all live in your application's views.py.

# load the recaptcha  module
from recaptcha.client import captcha

# create the form to be submitted
class EditForm(forms.Form):
    data_field = forms.CharField()

def myview(request):	
    if request.method == 'POST':
        edit_form = EditForm(request.POST)
        # talk to the reCAPTCHA service
        response = captcha.submit(
            request.POST.get('recaptcha_challenge_field'),
            request.POST.get('recaptcha_response_field'),
            '[[ MY PRIVATE KEY ]]',
            request.META['REMOTE_ADDR'],)
        
        # see if the user correctly entered CAPTCHA information
        # and handle it accordingly.
        if response.is_valid:
            captcha_response = "YOU ARE HUMAN: %(data)s" % {'data' :
		edit_form.data['data_field']}
        else:
            captcha_response = 'YOU MUST BE A ROBOT'
        
        return render_to_response('mytemplate.html', {
                'edit_form': edit_form,
                'captcha_response': captcha_response})
    else:
        edit_form = EditForm()
        return render_to_response('mytemplate.html', {'edit_form': edit_form})

Which would look like:

If a user enters the CAPTCHA text correctly they get a message indicating that they are human. This is where you would put logic to do something like save a comment. If a user fails to enter the CAPTCHA text correctly they get a nasty error message telling them that they must be a robot. In that code path you'd want to assume the user either entered the text wrong and will retry or that there is no real user at all.

Conclusion

It's unfortunate that we have to deal with spam bots and other abuses of our hard work. Luckily services like reCAPTCHA make it relatively easy to defend against. And the benefits extend beyond just protecting our own web content. Every time a user uses reCAPTCHA they're actually helping to digitize books on the other end.

Created on 2009-11-21 15:11:00 UTC
 
Just a quick note. I've launched a new site called PhatGoCode.com containing various bits of sample code for the Go programming language. I essentially did this to address the lack of simple and concise samples that are available at this early stage.

At the time of writing this site is barely off the ground but will be under vigorous construction in the comming weeks. By 2009-11-22 I hope to have a system in place to allow for community contributions.

Created on 2009-11-19 21:23:00 UTC
 
Well, I've been spending a little more time fiddling with Google's new Go programming language of late and again figured I'd share some more playing-around-code.

Edit 12/2/2009: Note that I've launched PhatGoCode.com, a site full of Go example code.

HTTP Operations and XML Processing

One of my favorite examples I tend to use in higher level languages is the retrieval of twitter statuses with only out-of-the-box libraries. I was surprised how simple this task ended up to be with Go which is more of a systems language. The HTTP get was a one-liner and the resultant XML can be unmarshaled right into native structs.

package main

import (
       "http";
       "fmt";
       "xml";
)

/* these structs will house the unmarshalled response.
   they should be hierarchically shaped like the XML
   but can omit irrelevant data. */
type Status struct {
     Text string
}

type User struct {
     XMLName xml.Name;
     Status Status;
}

func main()  {
     /* perform an HTTP request for the twitter status */
     response, _, _ := http.Get("http://twitter.com/users/chrisumbel.xml");

     /* initialize the structure of the XML response */
     var user = User{xml.Name{"", "user"}, Status{""}};
     /* unmarshal the XML into our structures */
     xml.Unmarshal(response.Body, &user);

     fmt.Printf("status: %s", user.Status.Text);
}

Object Orientation

Go's approach to object orientation is interesting. Essentially you just tack methods onto structs as is illustrated below.

package main

import "fmt"

/* basic data structure upon with we'll define methods */
type employee struct {
     salary float;
}

/* a method which will add a specified percent to an
   employees salary */
func (this *employee) giveRaise(pct float) {
     this.salary += this.salary * pct;
}

func main() {
     /* create an employee instance */
     var e = new(employee);
     e.salary = 100000;
     /* call our method */
     e.giveRaise(0.04);

     fmt.Printf("Employee now makes %f", e.salary);
}

Go doesn't have inheritance per se but does support contracts by way of interfaces. The following code illustrates how two different kinds of things (stock positions and automobiles) share a common operation (obtaining its value).

package main

import "fmt";

type stockPosition struct {
     ticker string;
     sharePrice float;
     count float;
}

/* method to determine the value of a stock position */
func (this stockPosition) getValue() float {
     return this.sharePrice * this.count;
}

type car struct {
     make string;
     model string;
     price float;
}

/* method to determine the value of a car */
func (this car) getValue() float {
     return this.price;
}

/* contract that defines different things that have value */
type valuable interface {
     getValue() float;
}

/* anything that satisfies the "valuable" interface is accepted */
func showValue(asset valuable) {
     fmt.Printf("Value of the asset is %f\n", asset.getValue());
}

func main() {
     var o valuable = stockPosition{ "GOOG", 577.20, 4 };
     showValue(o);
     o = car{ "BMW", "M3", 66500 };
     showValue(o);
}
Created on 2009-11-17 20:00:00 UTC
 
Tags:
.Net .net framework 4.0 ADO.NET Android AppleScript Astoria BI BeOS C C++ Data Services EF GNOME GObject Groovy HTML Haiku JVM Java Lucene Mac MongoDB ORM Objective-C Operating Systems Oracle SSRS Solr VS 2010 Vala Web Services appengine c# clojure cloud clr cocoa touch concurrency couchdb cql curl database django dlr dynamic entity framework erlang exchange server filestream full-text functional go iPhone indexes ironpython ironruby jQuery linq lisp lucene mongodb monitoring natural language object oriented parallel performance podcasts powershell python rails refactoring remoting reporting services rs ruby scripting security setpolicies simpledb sql 2008 sql server systems programming testing tools vb virtualization wave webdav windows xml