Few things are more useful that a good full-text search. It's clearly the
easiest way for users to actively drill down into the content they want. It's
also quite easy on the Ruby programmer to implement
thanks to Ferret,
an Apache Lucene-inspired
search engine library.
Building an Index
The first step to implementing a search is to get an index built. The following code illustrates creating an index with two documents in it.
require 'ferret'
include Ferret
# get or create an index on the filesystem
index = Index::Index.new(:path => './test.idx')
# store a document
index << {
:title => 'A Cool Article',
:content => 'Penguins are cool.'
}
# store another document
index << {
:title => 'A Hot Article',
:content => 'Volcanoes are hot'
}
Querying the Index
Now that the index is built it's ready to be queried. The following code searches the index for documents with the word hot in the content field.
# search the index for the word hot in the content field
index.search_each('content: "hot"') do | id, score |
puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end
The search_each method yields the id of the matching documents and their scores. Check out the output:
SCORE: 0.625 TITLE: A Hot Article
All fields can also be matched-up with an asterisk
# search the index for the word hot on all fields
index.search_each('*: "hot"') do | id, score |
puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end
One of the more useful features especially in a web scenario is highlighting the matched words. This is made trivial by Index's highlight method. Consider the following code which wraps matching terms in strong tags.
# search the index for the word hot
index.search_each('content: "hot"') do | id, score |
# put highlights into a copy of the content field
# by way of <strong> HTML tags
highlights = index.highlight('content: "hot"',
id,
:field => :content,
:pre_tag => "<strong>",
:post_tag => "</strong>")
puts highlights
end
Producing:
Volcanoes are <strong>hot</strong>
It's also possible to use Ferret as a more general purpose data store. The following code creates an index of companies and returns those with a market cap over ten billion dollars and the word grocery in it. Note that the indexes are built off strings so it's necessary to pad numbers with leading zeros to preserve ordinality.
index << {
:ticker => 'GOOG',
:name => 'Google Inc',
:market_cap => '183000000000',
:description => 'indexes websites and generates revenue through advertising'
}
index << {
:ticker => 'JNJ',
:name => 'Johnson & Johnson',
:market_cap => '173000000000',
:description => 'makes drugs, healthcare products and equipment'
}
index << {
:ticker => 'WFMI',
:name => 'Whole Foods Market, Inc',
:market_cap => '003000000000',
:description => 'operates organic grocery stores'
}
index << {
:ticker => 'KR',
:name => 'The Kroger Co.',
:market_cap => '014000000000',
:description => 'operates grocery stores and other retail establishments'
}
# search the index
index.search_each('market_cap:(> 010000000000) AND *:(grocery)') do | id, score |
puts "TICKER: #{index[id][:ticker]}"
end
Resulting in:
TICKER: KR
A Practical Example
That's all fine and good, but how would it be used in real life? To illustrate a popular use-case I'll implement a simple application that spiders this site using Anemone and stores it's data in Ferret.
require 'anemone'
require 'ferret'
require 'open-uri'
include Ferret
index = Index::Index.new(:path => './chrisumbel.idx')
# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
# only process pages in the article directory
anemone.on_pages_like(/article\/[^?]*$/) do | page |
# store the page in the index
index << {
:url => page.url,
:title => page.doc.at('title').text,
:content => page.doc.css('div.content_piece').text
}
puts "#{page.url} indexed."
end
end
# search the index for articles with either ruby or python
index.search_each('content: "ruby" or "python"') do | id, score |
puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end
Integration with Rails
Naturally this is the kind of stuff that would be handy to use as part of your model in Rails. That's made trivial with the ActsAsFerret rails plugin.
Just decorate your model accordingly and the fields you specifiy will be indexed in Ferret:
class Article < ActiveRecord::Base acts_as_ferret :fields => [:title, :content] end
It can then be queried thusly from your controller:
class ArticlesController < ApplicationController
def search
@articles = Article.find_with_ferret(@params['search_string'])
end
end

I've recently deployed a django
application on Google's AppEngine. I'm not
sure how I've avoided it thus far but seems to fit my needs relatively well.
DataStore (AppEngine's data storage engine) really impressed me. The python API
feels so much like django's ORM that there was practically zero learning curve
for a chap like me.
One thing that disappointed me, however, was the state of the search facility. The built-in google.appengine.ext.search.SearchableModel suffers from many problems outlined all over the web. i.e. the need to create n indexes to handle for n search terms, index creation failures, inability to exclude properties and a lack of support for common search operations.
A brief web search for alternatives turned up a semi-commercial product and a few open source offerings but nothing that really piqued my interest.
So I figured, what the heck, I'll try to roll my own. It'll give me a chance to do some special tokenization which will come in handy considering the corpora is comprised partly of source code. Even if I don't stick with it it'll surely be a fun exercise.
Keep in mind that this post is mainly recounting my experience using a few surrogate examples along the way. I'm not necessarily sold on the approach yet myself.
The Plan
The method for building the indexes seemed strait-forward enough. Tokenize the
text of the fields I want to index, get the stems of the words,
reduce the list to a unique set and store it in a StringListProperty.
From there querying it will be a cakewalk, right?
After doing some home-brew tokenizing and stemming I really wan't happy with the results (not that I expected to be). Sure, if I spent enough time with it and studied stemming algorithms I could have come up with something that wasn't too shabby but heck, I have sites to build!
That's when I remembered the Natural Language ToolKit for python. It greatly simplifies common text processing tasks like, you guessed it, stemming and tokenizing.
The Implementation
Now that I had a plan I had to put it in motion. The first was getting a hold of the Natural Language ToolKit which can be downloaded here. I recommend installing from source because we'll need it later.
With NLTK installed I had to put it into my AppEngine project. It wasn't enough just import the NLTK modules I wanted to use. I had to actually copy the code into my project to ensure that it got deployed to AppEngine along with my projct. This is accomplished by copying the nltk directory and subdirectories of only the modules I needed into the application's root (in this case stem and tokenize subdirectories).
Then I had to replace __init__.py with a blank __init__.py in the nltk directory and its subdirectories. This was necessary to stop NTLK from doing the funky stuff it does upon initialization.
The Model
To continue I'll use a simple blog post model as an example-case. Its model definition follows.
class BlogPost(db.Model):
title = db.StringProperty(multiline = False)
content = db.TextProperty()
created = db.DateTimeProperty(auto_now_add = True)
# indexed_fields param specifies the other properties that will
# be indexed
words = FulltextIndexProperty(indexed_fields = ['title', 'content'])
Notice the words property of the FulltextIndexProperty type. FulltextIndexProperty is not built-in to AppEngine. I'll define that later. See how it specifies the names of the properties to include in the index via the indexed_fields parameter?
Enter NLTK
Now I'll perform some language processing in the model in a helper function.
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import WhitespaceTokenizer
def tokenize(text):
""" break up some abritrary text into tokens """
# get rid of all punctuation
rex = re.compile(r"[^\w\s]")
text = rex.sub('', text)
# create NLTK objects we'll need
stemmer = PorterStemmer()
tokenizer = WhitespaceTokenizer()
# break text up into words
tokens = tokenizer.tokenize(text)
# get the stems of the words
words = [stemmer.stem(token.lower()) for token in tokens]
return words
The previous function uses the NLTK to tokenize and stem the words of supplied text. An example of stemming is converting the words "hasher", "hashing" and "hashed" to "hash". That way when a user searches for "hash" posts with the word "hashing" will be returned. That would also be a handy place to insert some custom tokenization.
The Index
With that out of the way I'll define the FulltextIndexProperty type.
class FulltextIndexProperty(db.StringListProperty):
""" Property that stores a full-text index of other textual
properties of the model """
def __init__(self, *args, **kwargs):
self.indexed_fields = kwargs['indexed_fields']
del kwargs['indexed_fields']
super(FulltextIndexProperty, self).__init__(*args, **kwargs)
def get_value_for_datastore(self, model_instance):
""" persist a full-text index applicable properties of this instance """
field_values = []
# iterate all fields to include in the index
for field_name in self.indexed_fields:
# get the value of the property and tokenize it
field_values += tokenize(str(getattr(model_instance, field_name)))
# return a unique list of words
return list(set(field_values))
The FulltextIndexProperty class overrides the get_value_for_datastore method which will produce list of unique stems of all included fields. This is the the actual full-text index to be stored in DataStore. That would be a convenient place to include a feature such as ignored words or adjective expansion.
Because FulltextIndexProperty extends StringListProperty what's actually stored is a list of unique word stems of all properties included in the index.
Querying
In a final piece of plumbing I'll add the following static method to the BlogPost class. Note that in production I'd probably wrap this up into a base class.
@staticmethod
def fulltext_search(fti_property_name, search_string):
# us the same tokenization we used in indexing
# to tokenize the search string
query = tokenize('words', search_string)
# create a GQL where clause with a condition for
# each search term.
gql = "where %(conditions)s" % {'conditions' :
''.join(["%(prop_name)s = '%(word)s' and " % {'word' : word,
'prop_name' : fti_property_name} for word in query])[:-5]}
# query datastore
return BlogPost.gql(gql)
The View
Now that I have blog entries indexing themselves and a full-text search method in the model I'm ready to write a view to search them.
def search(request):
search_string = request.GET.get('search_string')
# query datastore
posts = BlogPost.fulltext_search('words', search_string)
return render_to_response(request, 'search_resutls.html', {
'posts': posts,
'search_string': search_string
})
That's it! The value passed in for the search_string key of the query will be built into a GQL where clause to perform a fast full-text search. This system takes advantage of the StringListProperty which allows us to store the index directly in the entities.
Next Steps
This implementation is rather simplistic and not much better than the SearchableModel. All words are given the same weight (there are no term vectors), no consideration is given to word proximity, occurrence counts and exact-phrase searches aren't handled. However, with a little creativity those features and many others could be handled which would justify the effort.

I sure was naive. When I launched a certain django-based site that accepted user
comments (wonder which one that is?) a while back I thought I could block the comment spam myself without CAPTCHA. After a
few months of traffic I started getting hammered with it and tried blocking IPs,
keywords and patterns. All to no avail.
The trouble-spot was a strait-forward, regular old HTML form that accepted the comment input. I needed it to appeal to wide browser requirements of the site. My AJAX-jQuery-to-django-piston-service comment submissions rarely were the source of spam entry but I needed my regular forms locked down as well.
I toyed with the idea of rolling my own CAPTCHA but I honestly have bigger fish to fry. Turns out that integrating reCAPTCHA with django was a sinch and solved my comment spam problems.
Here's how to do it.
Step #1: Get a reCAPTCHA Account
reCAPTCHA is a service and all the heavy lifting is done on reCAPTCHA's servers. Because of that you must sign up for an account to use the service here.
By default a key works on a single domain, but you can also create your key as "Global" allowing them to work on multiple site.
Step #2: Install recaptcha-client
In order for django (or any other python code) to use the reCAPTCHA service you must install the recaptcha-client library. This is most easily accomplished with setuptools:
easy_install recaptcha-client
Alternatively you can install it directly from source by downloading it from: http://pypi.python.org/pypi/recaptcha-client
Step #3: Add reCAPTCHA to a Template
Now it's time for some actual web development. I'll start out by putting the familiar reCAPTCHA interface in a django template. Notice that I have it as part of a form named edit_form. There are also a couple locations where you have to insert your public key which you get when you sign up for a reCAPTCHA account.
<form action="#" method="POST">
<table>
{{ edit_form }}
<tr>
<th>Are you human?</th>
<td>
<span class="validation_error">{{ captcha_response }}</span>
<script type="text/javascript"
src="http://api.recaptcha.net/challenge?k=[[ YOUR PUBLIC KEY ]]">
</script>
<noscript>
<iframe src="http://api.recaptcha.net/noscript?k=[[ YOUR PUBLIC KEY ]]"
height="300" width="500" frameborder="0"></iframe><br>
<textarea name="recaptcha_challenge_field" rows="3" cols="40">
</textarea>
<input type="hidden" name="recaptcha_response_field"
value="manual_challenge">
</noscript>
<td>
</tr>
<tr>
<th></th>
<td><input type="submit" value="Save"/></td>
</tr>
</table>
</form>
Step #4: Handle reCAPTCHA Upon Form Submission
Now I'll set up a view to handle the template and form submission. This would all live in your application's views.py.
# load the recaptcha module
from recaptcha.client import captcha
# create the form to be submitted
class EditForm(forms.Form):
data_field = forms.CharField()
def myview(request):
if request.method == 'POST':
edit_form = EditForm(request.POST)
# talk to the reCAPTCHA service
response = captcha.submit(
request.POST.get('recaptcha_challenge_field'),
request.POST.get('recaptcha_response_field'),
'[[ MY PRIVATE KEY ]]',
request.META['REMOTE_ADDR'],)
# see if the user correctly entered CAPTCHA information
# and handle it accordingly.
if response.is_valid:
captcha_response = "YOU ARE HUMAN: %(data)s" % {'data' :
edit_form.data['data_field']}
else:
captcha_response = 'YOU MUST BE A ROBOT'
return render_to_response('mytemplate.html', {
'edit_form': edit_form,
'captcha_response': captcha_response})
else:
edit_form = EditForm()
return render_to_response('mytemplate.html', {'edit_form': edit_form})
Which would look like:
If a user enters the CAPTCHA text correctly they get a message indicating that they are human. This is where you would put logic to do something like save a comment. If a user fails to enter the CAPTCHA text correctly they get a nasty error message telling them that they must be a robot. In that code path you'd want to assume the user either entered the text wrong and will retry or that there is no real user at all.
Conclusion
It's unfortunate that we have to deal with spam bots and other abuses of our hard work. Luckily services like reCAPTCHA make it relatively easy to defend against. And the benefits extend beyond just protecting our own web content. Every time a user uses reCAPTCHA they're actually helping to digitize books on the other end.

At the time of writing this site is barely off the ground but will be under vigorous construction in the comming weeks. By 2009-11-22 I hope to have a system in place to allow for community contributions.

Well, I've been spending a little more time fiddling with Google's new Go
programming language of late and again figured I'd share some more
playing-around-code.
Edit 12/2/2009: Note that I've launched PhatGoCode.com, a site full of Go example code.
HTTP Operations and XML Processing
One of my favorite examples I tend to use in higher level languages is the retrieval of twitter statuses with only out-of-the-box libraries. I was surprised how simple this task ended up to be with Go which is more of a systems language. The HTTP get was a one-liner and the resultant XML can be unmarshaled right into native structs.
package main
import (
"http";
"fmt";
"xml";
)
/* these structs will house the unmarshalled response.
they should be hierarchically shaped like the XML
but can omit irrelevant data. */
type Status struct {
Text string
}
type User struct {
XMLName xml.Name;
Status Status;
}
func main() {
/* perform an HTTP request for the twitter status */
response, _, _ := http.Get("http://twitter.com/users/chrisumbel.xml");
/* initialize the structure of the XML response */
var user = User{xml.Name{"", "user"}, Status{""}};
/* unmarshal the XML into our structures */
xml.Unmarshal(response.Body, &user);
fmt.Printf("status: %s", user.Status.Text);
}
Object Orientation
Go's approach to object orientation is interesting. Essentially you just tack methods onto structs as is illustrated below.
package main
import "fmt"
/* basic data structure upon with we'll define methods */
type employee struct {
salary float;
}
/* a method which will add a specified percent to an
employees salary */
func (this *employee) giveRaise(pct float) {
this.salary += this.salary * pct;
}
func main() {
/* create an employee instance */
var e = new(employee);
e.salary = 100000;
/* call our method */
e.giveRaise(0.04);
fmt.Printf("Employee now makes %f", e.salary);
}
Go doesn't have inheritance per se but does support contracts by way of interfaces. The following code illustrates how two different kinds of things (stock positions and automobiles) share a common operation (obtaining its value).
package main
import "fmt";
type stockPosition struct {
ticker string;
sharePrice float;
count float;
}
/* method to determine the value of a stock position */
func (this stockPosition) getValue() float {
return this.sharePrice * this.count;
}
type car struct {
make string;
model string;
price float;
}
/* method to determine the value of a car */
func (this car) getValue() float {
return this.price;
}
/* contract that defines different things that have value */
type valuable interface {
getValue() float;
}
/* anything that satisfies the "valuable" interface is accepted */
func showValue(asset valuable) {
fmt.Printf("Value of the asset is %f\n", asset.getValue());
}
func main() {
var o valuable = stockPosition{ "GOOG", 577.20, 4 };
showValue(o);
o = car{ "BMW", "M3", 66500 };
showValue(o);
}

Digg it
Reddit
Delicous
Facebook










