Chris Umbel

Google Wave Robots in Java

google wave logo When Google Wave was first announced I was pretty excited. The concept seemed perfect. Broad like twitter but rich like email. Brief like instant messenger but collaborative like a message board.

Things have been somewhat slow going in beta thus far. But hey, it's still beta. If Google refines it a bit and wave catches on (what actual does catch on these days seems to be a crap-shoot) it has the potential to provide tons of value.

One of the possibilities I find particularly interesting is the use of robots. No, there's nothing underhanded about it, a robust robot API is provided for that very purpose. Automated programs that are participants in the conversation.

Shortly after getting development sandbox access I had to get to work on one. While I'm going to keep the features of the actual bot I'm writing close to the vest for now I'll at least share an example I used while learning.

Platform

google app engine logo Google Wave robots must exist on Google's AppEngine, at least for now (this restriction will ultimately go away). That limits your language choice to either Python or Java while using the AppEngine SDK. When using Java you also have to include the json.jar and jsonrpc.jar libraries in your /war/WEB-INF/lib/, both of which can be found here.

I got started developing for Wave with Java. I'm not exactly sure how that happened considering how I love me some Python. Nonetheless I dusted off my Java cap and got to work. It's been a while, be patient with me, please.

Handling

From a Java point of view Wave robots are simply servelets that process events. What kind of events? Anything from a new participant entering a wave (a conversation) to a blip (the basic atom of a wave) being started or completed. What's important, however, is that you declare what events you plan on handling up front. That's accomplished by creating a /war/_wave/capabilities.xml file similar to what follows.



  
    
  
  1

That example specifies that the servlet will be called after a blip is completed. Note that if you want to change what events are handled in this file you must increment the version tag in order for your changes to take effect.

Servlet

I might as well hit you strait up with it. Essentially you have to subclass com.google.wave.api.AbstractRobotServlet and override processEvents. It's within processEvents that you'll perform your magic.

In the case of this example I'll read out the text of the previously completed blip (the one that fired this event) and try to find stock ticker symbols by way of the pattern "ticker:" i.e. "ticker:GOOG". If I think I've found any I'll use Google's Finance REST API to look up their prices and write them back as blips.

import com.google.wave.api.*;
import java.net.*;
import java.io.*;
import org.json.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class StockPriceBotServlet extends AbstractRobotServlet {
    private static final long serialVersionUID = 1L;

    @Override
    public void processEvents(RobotMessageBundle bundle) {
        Wavelet wavelet = bundle.getWavelet();
        String ticker;
        
        for (Event e: bundle.getEvents()) {
            if (e.getType() == EventType.BLIP_SUBMITTED) {
                /* grab the text of the blip that fired this event */
                String userBlipText = e.getBlip().getDocument().getText();
                
                /* search for the trigger to act */
                Matcher matcher = Pattern.compile("(ticker\\:)(\\w*)").matcher(userBlipText);
                
                /* iterate all matches */
                while (matcher.find()) {
                    /* add a blip to the wave */
                    Blip blip = wavelet.appendBlip();
                    TextView textView = blip.getDocument();
                    ticker = matcher.group(2);
                        
                    try {
                        /* connect to google */
                        URL url = new URL(String.format("http://www.google.com/finance/info?client=ig&q=%s", ticker));
                        BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
                        String inputLine;
                        StringBuilder sb = new StringBuilder();
    
                        /* read response from google into a StringBuilder */
                        while ((inputLine = reader.readLine()) != null)
                            sb.append(inputLine);
    
                        /* get rid of wrapper google adds */
                        sb.delete(0, 4);	
                        /* parse response into a JSON object */
                        JSONObject o = new JSONObject(sb.toString());
                        /* pull out the property named "l" and send it back to wave */
                        textView.append(String.format("%s: %s", ticker, o.getString("l")));
    
                        reader.close();			        
                    } catch(Exception ex) { }   
                }
            }
        }
    }
}

Deployment

Before you deploy you must set up your /war/WEB-INF/web.xml like so:


    
        StockPriceBot
        StockPriceBotServlet
    
    
        StockPriceBot
        /_wave/robot/jsonrpc
    

Now you must deploy the bot to AppEngine as you would any Java AppEngine project.

Use

Now it's time to actually use this contraption. It's rather strait forward. Just invite @appspot.com to your wave where is the appspot designation of the code you just deployed. From there just have a conversation with the bot as follows:

example wave output

Next Steps

Wave Eliza, perhaps?

Mon Dec 07 2009 23:12:00 GMT+0000 (UTC)

Comments

Employing Solr/Lucene with SQL Server for Full-Text Searching

Solr logo I've been fiddling with Lucene a good bit of late and have been quite impressed. It's more than just a "blazing fast" full-text indexing system, especially when implemented via Solr. With Solr it becomes an incredibly scalable, full-featured and extensible search engine platform.

I had always assumed that the Lucene stack wasn't for me. For the most part I store my data either in SQL Server or MySQL, both of which have perfectly adequate full-text search capability. It turns out that I could have saved myself a few headaches and saved my employer some money by adopting Solr and not writing my own faceting, caching, etc.

Lucene Logo

Naturally, Lucene/Solr isn't for everyone. If you just have a few hundred-thousand rows of text that you want to perform some basic searches on under light load then you're probably better off using the full-text search facility within your RDMS.

However, If you need to scale out widely, perform faceted searches or use some advanced/custom search techniques then it's probably worth looking into Solr, even if you're already deployed under an RDBMS with full-text support.

Apache logo

In this article I'll outline the *VERY* basics of getting Solr up and running using SQL Server as a data source. While I'm actually doing this in production under Linux I'm going to tailor my instructions to Windows here to appeal to the average SQL Server DBA. I'll also employ the AdventureWorks sample database for demonstrative purposes.

Note that you'll have to have TCP/IP enabled in your SQL Server instance. Named pipes, VIA and shared memory won't cut it.

Step 1: Download and install Java

Solr and Lucene are written in Java so a Java Runtime is a prerequisite. It can be downloaded here.

After installation make sure to set the JRE_HOME environment variable to your Java install directory i.e. C:\Program Files\Java\jre6

Step 2: Download Tomcat

Tomcat Logo Solr requires a servlet container. I recommend Tomcat which can be downloaded here. Then extract it to C:\tomcat6 (Note that I'm going to hang this all right off C:\ to keep the tutorial simple).

Step 3: Download Solr

This whole thing's about Solr, right? You can pick it up here. Extract the contents to a temporary location.

Step 4: Move Solr into Tomcat

Copy:

  • apache-solr-1.4.0\example\solr to c:\tomcat6
  • apache-solr-1.4.0\dist\apache-solr-1.4.0.war to c:\tomcat6\webapps\solr.war

Congratulations! Solr is essentially operational now, or would be upon starting tomcat. It'd just be devoid of data.

Step 5: Download and install a SQL Server JDBC driver

In order for Java to talk to SQL Server we'll have to supply a JDBC driver. There are many available but I used Microsoft's which can be downloaded here. Note that there's also a unix version available.

Now create a C:\tomcat6\solr\lib folder. Copy the file sqljdbc4.jar out of the archive downloaded above into it.

Step 6: Configure the import

Create a C:\tomcat6\solr\conf\data-config.xml file and put the following content in it, modifying it to the details of your configuration, naturally. This file defines what data we're going to import (SQL statement), how we're going to get it (definition of JDBC driver class) and where form (connection string and authentication information). The resultant columns are then mapped to fields in Lucene.

<dataConfig>
  <dataSource type="JdbcDataSource"
	    driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
	    url="jdbc:sqlserver://localhost\INSTANCENAME;databaseName=AdventureWorks" 
	    user="TESTUSER"
	    password="TESTUSER"/>
  <document name="productreviews">
    <entity name="review" query="
        SELECT ProductReviewID, ProductID, EmailAddress, Comments
        FROM Production.ProductReview">
	    
      <field column="ProductReviewID" name="id"/>
      <field column="ProductID" name="product_id"/>
      <field column="EmailAddress" name="email"/>
      <field column="Comments" name="comments"/>
    </entity>
  </document>
</dataConfig>

Step 7: Tell Solr about our import

Add the following requesthandler to C:\tomcat6\solr\conf\solrconfig.xml:

<requestHandler name="/dataimport"
    class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">C:\tomcat6\solr\conf\data-config.xml</str>
  </lst>
</requestHandler>

This essentially allows Solr to perform operations of the data import we defined above upon a visit to the /dataimport URL.

Step 8: Configure schema

Ensure the fields are set up correctly in C:\tomcat6\solr\conf\schema.xml. There will be plenty of example fields, copy fields dynamic fields and a default search field in there to start with. Just get rid of them.

 <fields>
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="comments" type="text" indexed="true" stored="true"/>
  <field name="email" type="string" indexed="true" stored="true"/>
  <field name="product_id" type="int" indexed="true" stored="true"/>
  <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
 </fields>

 <copyField source="comments" dest="text"/>
 <copyField source="email" dest="text"/>
  
 <defaultSearchField>text</defaultSearchField>

There's quite a bit of power that I won't go into in this article dealing with Solr schemas. Dynamic fields, copy fields, compression... Needless to say it's worth reading up on which you can do here.

Step 9: Start Tomcat

OK! We're finally configured well enough for an import. All we have to do is start up Tomcat. Make sure you're in Tomcat's directory as the quick-and-dirty configuration I showed you here requires it in order to find the Solr webapp.

c:\tomcat6>.\bin\startup.bat

If you'd like to move the Solr webapp elsewhere on the filesystem, remove the requirement for starting in Tomcat's directory or perform an advanced configuration please see the Solr with Apache Tomcat article in the Solr Wiki. Pay special attention to the section labeled, "Installing Solr instances under Tomcat" where they show you how to create contexts.

Step 10: Import

Now visit http://localhost:8080/solr/dataimport?command=full-import with a web browser. That'll trigger the import. Because we're just importing a small amount of test data the process will be nearly instantaneous.

Step 11: Observe your results

That's it! You can verify your work by issuing a query against Solr with a RESTful query like http://localhost:8080/solr/select/?q=heavy&version=2.2&start=0&rows=10&indent=on that searches the index for all reviews with the word heavy in the comments.

Pitfalls

There are a number of reasons a data import could fail, most likely due to problem with the configuration of data-config.xml. To see for sure what's going on you'll have to look in C:\tomcat6\solr\logs\catalina.*.

If you happen to find that your import is failing due to system running out of memory, however, there's an easy, SQL Server specific fix. Add responseBuffering=adaptive and selectMethod=cursor to the url attribute of the dataSource node in data-config.xml. That stops the JDBC driver from trying to load the entire result set into memory before reads can occur.

Next Steps

So we've gone from zero to a functioning Solr instance rather quickly there. Not too shabby! However, we've only queried Solr through REST. Libraries like solrnet are handy for wrapping objects around the data in .Net. For example:

/* review domain object */
public class Review
{
    /*  attribute decorations tell solrnet how to map
        the properties to Solr fields. */
    [SolrUniqueKey("id")]
    public string Id { get; set; }

    [SolrField("product_id")]
    public string ProductID { get; set; }

    [SolrField("email")]
    public string EmailAddress { get; set; }

    [SolrField("comments")]
    public string Text { get; set; }
}

class Program
{
    static void Main(string[] args)
    {
        /* create a session */
         Startup.Init<Review>("http://localhost:8080/solr");
        ISolrOperations<Review> solr =
                  ServiceLocator.Current.GetInstance<ISolrOperations<Review>>();
        /* issue a lucene query */
        ICollection<Review> results = solr.Query("comments:heavy");

        foreach (Review r in results)
        {
            Console.WriteLine(r.Id);
        }
    }
}

Resulting in:

2
4

If you're totally new to Solr it's worth checking out the wiki. It outlines the handy features such as replication, facets and distribution.

Sat Dec 05 2009 23:12:00 GMT+0000 (UTC)

Comments

Full-Text Indexing in Ruby Using Ferret

Ferret Logo Few things are more useful that a good full-text search. It's clearly the easiest way for users to actively drill down into the content they want. It's also quite easy on the Ruby programmer to implement thanks to Ferret, an Apache Lucene-inspired search engine library.

Building an Index

The first step to implementing a search is to get an index built. The following code illustrates creating an index with two documents in it.

require 'ferret'
include Ferret

# get or create an index on the filesystem
index = Index::Index.new(:path => './test.idx')

# store a document
index << {
  :title => 'A Cool Article',
  :content => 'Penguins are cool.'
}

# store another document
index << {
  :title => 'A Hot Article',
  :content => 'Volcanoes are hot'
}

Querying the Index

Now that the index is built it's ready to be queried. The following code searches the index for documents with the word hot in the content field.

# search the index for the word hot in the content field
index.search_each('content: "hot"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

The search_each method yields the id of the matching documents and their scores. Check out the output:

SCORE: 0.625	TITLE: A Hot Article

All fields can also be matched-up with an asterisk

# search the index for the word hot on all fields
index.search_each('*: "hot"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

One of the more useful features especially in a web scenario is highlighting the matched words. This is made trivial by Index's highlight method. Consider the following code which wraps matching terms in strong tags.

# search the index for the word hot
index.search_each('content: "hot"') do | id, score |
  # put highlights into a copy of the content field
  # by way of <strong> HTML tags
  highlights = index.highlight('content: "hot"',
    id,
    :field => :content,
    :pre_tag => "<strong>",
    :post_tag => "</strong>")
  
  puts highlights
end

Producing:

Volcanoes are <strong>hot</strong>

It's also possible to use Ferret as a more general purpose data store. The following code creates an index of companies and returns those with a market cap over ten billion dollars and the word grocery in it. Note that the indexes are built off strings so it's necessary to pad numbers with leading zeros to preserve ordinality.

index << {
  :ticker => 'GOOG',
  :name => 'Google Inc',
  :market_cap => '183000000000',
  :description => 'indexes websites and generates revenue through advertising'
}

index << {
  :ticker => 'JNJ',
  :name => 'Johnson & Johnson',
  :market_cap => '173000000000',
  :description => 'makes drugs, healthcare products and equipment'
}

index << {
  :ticker => 'WFMI',
  :name => 'Whole Foods Market, Inc',
  :market_cap => '003000000000',
  :description => 'operates organic grocery stores'
}

index << {
  :ticker => 'KR',
  :name => 'The Kroger Co.',
  :market_cap => '014000000000',
  :description => 'operates grocery stores and other retail establishments'
}

# search the index
index.search_each('market_cap:(> 010000000000) AND *:(grocery)') do | id, score |
    puts "TICKER: #{index[id][:ticker]}"
end

Resulting in:

TICKER: KR

A Practical Example

That's all fine and good, but how would it be used in real life? To illustrate a popular use-case I'll implement a simple application that spiders this site using Anemone and stores it's data in Ferret.

require 'anemone'
require 'ferret'
require 'open-uri'

include Ferret

index = Index::Index.new(:path => './chrisumbel.idx')

# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
  # only process pages in the article directory
  anemone.on_pages_like(/article\/[^?]*$/) do | page |
      # store the page in the index
      index << {
        :url => page.url,
        :title => page.doc.at('title').text,
        :content => page.doc.css('div.content_piece').text
      }
      
      puts "#{page.url} indexed."
  end
end

# search the index for articles with either ruby or python
index.search_each('content: "ruby" or "python"') do | id, score |
    puts "SCORE: #{score}\tTITLE: #{index[id][:title]}"
end

Integration with Rails

Naturally this is the kind of stuff that would be handy to use as part of your model in Rails. That's made trivial with the ActsAsFerret rails plugin.

Just decorate your model accordingly and the fields you specifiy will be indexed in Ferret:

class Article < ActiveRecord::Base
  acts_as_ferret :fields => [:title, :content]
end

It can then be queried thusly from your controller:

class ArticlesController < ApplicationController
  def search
    @articles = Article.find_with_ferret(@params['search_string'])
  end
end

Sat Nov 28 2009 16:00:00 GMT+0000 (UTC)

Comments

Home-Brewing a Full-Text Search in Google's AppEngine

AppEngine Logo I've recently deployed a django application on Google's AppEngine. I'm not sure how I've avoided it thus far but seems to fit my needs relatively well. DataStore (AppEngine's data storage engine) really impressed me. The python API feels so much like django's ORM that there was practically zero learning curve for a chap like me.

One thing that disappointed me, however, was the state of the search facility. The built-in google.appengine.ext.search.SearchableModel suffers from many problems outlined all over the web. i.e. the need to create n indexes to handle for n search terms, index creation failures, inability to exclude properties and a lack of support for common search operations.

A brief web search for alternatives turned up a semi-commercial product and a few open source offerings but nothing that really piqued my interest.

So I figured, what the heck, I'll try to roll my own. It'll give me a chance to do some special tokenization which will come in handy considering the corpora is comprised partly of source code. Even if I don't stick with it it'll surely be a fun exercise.

Keep in mind that this post is mainly recounting my experience using a few surrogate examples along the way. I'm not necessarily sold on the approach yet myself.

The Plan

Python Logo The method for building the indexes seemed strait-forward enough. Tokenize the text of the fields I want to index, get the stems of the words, reduce the list to a unique set and store it in a StringListProperty. From there querying it will be a cakewalk, right?

After doing some home-brew tokenizing and stemming I really wan't happy with the results (not that I expected to be). Sure, if I spent enough time with it and studied stemming algorithms I could have come up with something that wasn't too shabby but heck, I have sites to build!

That's when I remembered the Natural Language ToolKit for python. It greatly simplifies common text processing tasks like, you guessed it, stemming and tokenizing.

The Implementation

Now that I had a plan I had to put it in motion. The first was getting a hold of the Natural Language ToolKit which can be downloaded here. I recommend installing from source because we'll need it later.

With NLTK installed I had to put it into my AppEngine project. It wasn't enough just import the NLTK modules I wanted to use. I had to actually copy the code into my project to ensure that it got deployed to AppEngine along with my projct. This is accomplished by copying the nltk directory and subdirectories of only the modules I needed into the application's root (in this case stem and tokenize subdirectories).

Then I had to replace __init__.py with a blank __init__.py in the nltk directory and its subdirectories. This was necessary to stop NTLK from doing the funky stuff it does upon initialization.

The Model

To continue I'll use a simple blog post model as an example-case. Its model definition follows.

class BlogPost(db.Model):
    title = db.StringProperty(multiline = False)
    content = db.TextProperty()
    created = db.DateTimeProperty(auto_now_add = True)
    # indexed_fields param specifies the other properties that will
    # be indexed
    words = FulltextIndexProperty(indexed_fields = ['title', 'content'])

Notice the words property of the FulltextIndexProperty type. FulltextIndexProperty is not built-in to AppEngine. I'll define that later. See how it specifies the names of the properties to include in the index via the indexed_fields parameter?

Enter NLTK

Now I'll perform some language processing in the model in a helper function.

from nltk.stem.porter import PorterStemmer
from nltk.tokenize import WhitespaceTokenizer

def tokenize(text):
    """ break up some abritrary text into tokens """
    # get rid of all punctuation
    rex = re.compile(r"[^\w\s]")
    text = rex.sub('', text)

    # create NLTK objects we'll need
    stemmer = PorterStemmer()
    tokenizer = WhitespaceTokenizer()

    # break text up into words
    tokens = tokenizer.tokenize(text)

    # get the stems of the words
    words = [stemmer.stem(token.lower()) for token in tokens]

    return words

The previous function uses the NLTK to tokenize and stem the words of supplied text. An example of stemming is converting the words "hasher", "hashing" and "hashed" to "hash". That way when a user searches for "hash" posts with the word "hashing" will be returned. That would also be a handy place to insert some custom tokenization.

The Index

With that out of the way I'll define the FulltextIndexProperty type.

class FulltextIndexProperty(db.StringListProperty):
    """ Property that stores a full-text index of other textual
        properties of the model """
    def __init__(self, *args, **kwargs):
        self.indexed_fields = kwargs['indexed_fields']
        del kwargs['indexed_fields']
        super(FulltextIndexProperty, self).__init__(*args, **kwargs)
    
    def get_value_for_datastore(self, model_instance):
        """ persist a full-text index applicable properties of this instance """
        field_values = []
        
        # iterate all fields to include in the index
        for field_name in self.indexed_fields:
            # get the value of the property and tokenize it
            field_values += tokenize(str(getattr(model_instance, field_name)))

        # return a unique list of words
        return list(set(field_values))

The FulltextIndexProperty class overrides the get_value_for_datastore method which will produce list of unique stems of all included fields. This is the the actual full-text index to be stored in DataStore. That would be a convenient place to include a feature such as ignored words or adjective expansion.

Because FulltextIndexProperty extends StringListProperty what's actually stored is a list of unique word stems of all properties included in the index.

Querying

In a final piece of plumbing I'll add the following static method to the BlogPost class. Note that in production I'd probably wrap this up into a base class.

@staticmethod
def fulltext_search(fti_property_name, search_string):        
    # us the same tokenization we used in indexing
    # to tokenize the search string
    query = tokenize('words', search_string)
    
    # create a GQL where clause with a condition for
    # each search term.
    gql = "where %(conditions)s" % {'conditions' :
        ''.join(["%(prop_name)s = '%(word)s' and " % {'word' : word,
            'prop_name' : fti_property_name} for word in query])[:-5]}
    
    # query datastore
    return BlogPost.gql(gql)

The View

Now that I have blog entries indexing themselves and a full-text search method in the model I'm ready to write a view to search them.

def search(request):
    search_string = request.GET.get('search_string')
    
    # query datastore
    posts = BlogPost.fulltext_search('words', search_string)
        
    return render_to_response(request, 'search_resutls.html', {
            'posts': posts,
            'search_string': search_string
        })

That's it! The value passed in for the search_string key of the query will be built into a GQL where clause to perform a fast full-text search. This system takes advantage of the StringListProperty which allows us to store the index directly in the entities.

Next Steps

This implementation is rather simplistic and not much better than the SearchableModel. All words are given the same weight (there are no term vectors), no consideration is given to word proximity, occurrence counts and exact-phrase searches aren't handled. However, with a little creativity those features and many others could be handled which would justify the effort.

Sun Nov 22 2009 20:11:00 GMT+0000 (UTC)

Comments
< 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 >
Follow Chris
RSS Feed
Twitter
Facebook
CodePlex
github
LinkedIn
Google