Chris Umbel

Home-Brewing a Full-Text Search in Google's AppEngine

AppEngine Logo I've recently deployed a django application on Google's AppEngine. I'm not sure how I've avoided it thus far but seems to fit my needs relatively well. DataStore (AppEngine's data storage engine) really impressed me. The python API feels so much like django's ORM that there was practically zero learning curve for a chap like me.

One thing that disappointed me, however, was the state of the search facility. The built-in google.appengine.ext.search.SearchableModel suffers from many problems outlined all over the web. i.e. the need to create n indexes to handle for n search terms, index creation failures, inability to exclude properties and a lack of support for common search operations.

A brief web search for alternatives turned up a semi-commercial product and a few open source offerings but nothing that really piqued my interest.

So I figured, what the heck, I'll try to roll my own. It'll give me a chance to do some special tokenization which will come in handy considering the corpora is comprised partly of source code. Even if I don't stick with it it'll surely be a fun exercise.

Keep in mind that this post is mainly recounting my experience using a few surrogate examples along the way. I'm not necessarily sold on the approach yet myself.

The Plan

Python Logo The method for building the indexes seemed strait-forward enough. Tokenize the text of the fields I want to index, get the stems of the words, reduce the list to a unique set and store it in a StringListProperty. From there querying it will be a cakewalk, right?

After doing some home-brew tokenizing and stemming I really wan't happy with the results (not that I expected to be). Sure, if I spent enough time with it and studied stemming algorithms I could have come up with something that wasn't too shabby but heck, I have sites to build!

That's when I remembered the Natural Language ToolKit for python. It greatly simplifies common text processing tasks like, you guessed it, stemming and tokenizing.

The Implementation

Now that I had a plan I had to put it in motion. The first was getting a hold of the Natural Language ToolKit which can be downloaded here. I recommend installing from source because we'll need it later.

With NLTK installed I had to put it into my AppEngine project. It wasn't enough just import the NLTK modules I wanted to use. I had to actually copy the code into my project to ensure that it got deployed to AppEngine along with my projct. This is accomplished by copying the nltk directory and subdirectories of only the modules I needed into the application's root (in this case stem and tokenize subdirectories).

Then I had to replace __init__.py with a blank __init__.py in the nltk directory and its subdirectories. This was necessary to stop NTLK from doing the funky stuff it does upon initialization.

The Model

To continue I'll use a simple blog post model as an example-case. Its model definition follows.

class BlogPost(db.Model):
    title = db.StringProperty(multiline = False)
    content = db.TextProperty()
    created = db.DateTimeProperty(auto_now_add = True)
    # indexed_fields param specifies the other properties that will
    # be indexed
    words = FulltextIndexProperty(indexed_fields = ['title', 'content'])

Notice the words property of the FulltextIndexProperty type. FulltextIndexProperty is not built-in to AppEngine. I'll define that later. See how it specifies the names of the properties to include in the index via the indexed_fields parameter?

Enter NLTK

Now I'll perform some language processing in the model in a helper function.

from nltk.stem.porter import PorterStemmer
from nltk.tokenize import WhitespaceTokenizer

def tokenize(text):
    """ break up some abritrary text into tokens """
    # get rid of all punctuation
    rex = re.compile(r"[^\w\s]")
    text = rex.sub('', text)

    # create NLTK objects we'll need
    stemmer = PorterStemmer()
    tokenizer = WhitespaceTokenizer()

    # break text up into words
    tokens = tokenizer.tokenize(text)

    # get the stems of the words
    words = [stemmer.stem(token.lower()) for token in tokens]

    return words

The previous function uses the NLTK to tokenize and stem the words of supplied text. An example of stemming is converting the words "hasher", "hashing" and "hashed" to "hash". That way when a user searches for "hash" posts with the word "hashing" will be returned. That would also be a handy place to insert some custom tokenization.

The Index

With that out of the way I'll define the FulltextIndexProperty type.

class FulltextIndexProperty(db.StringListProperty):
    """ Property that stores a full-text index of other textual
        properties of the model """
    def __init__(self, *args, **kwargs):
        self.indexed_fields = kwargs['indexed_fields']
        del kwargs['indexed_fields']
        super(FulltextIndexProperty, self).__init__(*args, **kwargs)
    
    def get_value_for_datastore(self, model_instance):
        """ persist a full-text index applicable properties of this instance """
        field_values = []
        
        # iterate all fields to include in the index
        for field_name in self.indexed_fields:
            # get the value of the property and tokenize it
            field_values += tokenize(str(getattr(model_instance, field_name)))

        # return a unique list of words
        return list(set(field_values))

The FulltextIndexProperty class overrides the get_value_for_datastore method which will produce list of unique stems of all included fields. This is the the actual full-text index to be stored in DataStore. That would be a convenient place to include a feature such as ignored words or adjective expansion.

Because FulltextIndexProperty extends StringListProperty what's actually stored is a list of unique word stems of all properties included in the index.

Querying

In a final piece of plumbing I'll add the following static method to the BlogPost class. Note that in production I'd probably wrap this up into a base class.

@staticmethod
def fulltext_search(fti_property_name, search_string):        
    # us the same tokenization we used in indexing
    # to tokenize the search string
    query = tokenize('words', search_string)
    
    # create a GQL where clause with a condition for
    # each search term.
    gql = "where %(conditions)s" % {'conditions' :
        ''.join(["%(prop_name)s = '%(word)s' and " % {'word' : word,
            'prop_name' : fti_property_name} for word in query])[:-5]}
    
    # query datastore
    return BlogPost.gql(gql)

The View

Now that I have blog entries indexing themselves and a full-text search method in the model I'm ready to write a view to search them.

def search(request):
    search_string = request.GET.get('search_string')
    
    # query datastore
    posts = BlogPost.fulltext_search('words', search_string)
        
    return render_to_response(request, 'search_resutls.html', {
            'posts': posts,
            'search_string': search_string
        })

That's it! The value passed in for the search_string key of the query will be built into a GQL where clause to perform a fast full-text search. This system takes advantage of the StringListProperty which allows us to store the index directly in the entities.

Next Steps

This implementation is rather simplistic and not much better than the SearchableModel. All words are given the same weight (there are no term vectors), no consideration is given to word proximity, occurrence counts and exact-phrase searches aren't handled. However, with a little creativity those features and many others could be handled which would justify the effort.

Sun Nov 22 2009 20:11:00 GMT+0000 (UTC)

Follow Chris
RSS Feed
Twitter
Facebook
CodePlex
github
LinkedIn
Google