Chris Umbel

Quick Notes on ScriptTransformers in Solr DataImportHandlers

One thing that's impressed me with Solr is the flexibility of the Data Import Handlers (DIHs). When I was new to Solr there were several times I thought for sure I'd have to write my own extension of DataImportHandler. Every time that's happened I've been wrong. A transformer or something handled my needs. Sometimes it's wonderful to be wrong! Especially when it means less code I have to write myself!

One of the aspects of DIH's that provide such great flexibility is transformers like RegexTransformer and TemplateTransformer. In this post, however, I'm going to *quickly* cover the ScriptTransformer wich allows you to employ your own custom JavaScript code in the processing of imports.

Prerequisites

Obviously you'll need a functional Solr instance. Also, ScriptTransformers require Java 6 due to JavaScript support. I'll also assume you have an understanding of how dynamicFields work.

Objective

At the office I've recently used a ScriptTransformer to build the field names of dynamicFields and I'm going to do the same in this article. The actual use-case I dealt with was very esoteric and honestly a bit proprietary so I'll substitute an example data scenario here.

Basically I'll import data about students grades for various courses from different institutions. In the resultant Solr index I'll provide a dynamicField for every course to provide easy sorting of students by their grades in the courses they took.

Consider the following MySQL schema and data and try to think beyond this sample data. Think about hundreds of schools, thousands of courses and, well, a ton of students.

create table schools (
  id int auto_increment primary key,
  name varchar(255)  
);

insert into schools (name) values ('Pitt');
insert into schools (name) values ('Penn State');

create table students (
  id int auto_increment primary key,
  first_name varchar(255),
  last_name varchar(255),
  current_school_id int references schools(id)
);

insert into students (first_name, last_name, current_school_id) values 
('John', 'Doe', 1);
insert into students (first_name, last_name, current_school_id) values 
('Bill', 'Miller', 1);
insert into students (first_name, last_name, current_school_id) values 
('Jane', 'Dow', 2);
insert into students (first_name, last_name, current_school_id) values 
('Dennis', 'Itchison', 2);

create table courses (
  id int auto_increment primary key,
  school_id int references schools(id),
  course_number varchar(10),
  name varchar(255)
);

insert into courses (school_id, course_number, name) values
(1, 'CS1501', 'Algorithm Implementations');
insert into courses (school_id, course_number, name) values
(1, 'CS1541', 'Introduction to Computer Architecture');
insert into courses (school_id, course_number, name) values
(2, 'CMPSC465', 'Data Structures and Algorithm');
insert into courses (school_id, course_number, name) values
(2, 'CMPSC473', 'Operating Systems');

create table grades (
  id int auto_increment primary key,
  value FLOAT,
  course_id int references courses(id),
  student_id int references students (id)
);

insert into grades (value, course_id, student_id) values (4.0, 1, 1);
insert into grades (value, course_id, student_id) values (2.5, 2, 1);
insert into grades (value, course_id, student_id) values (3.0, 3, 1);
insert into grades (value, course_id, student_id) values (3.0, 1, 2);
insert into grades (value, course_id, student_id) values (3.5, 2, 2);
insert into grades (value, course_id, student_id) values (3.5, 3, 3);
insert into grades (value, course_id, student_id) values (2.5, 4, 3);
insert into grades (value, course_id, student_id) values (3.0, 3, 4);
insert into grades (value, course_id, student_id) values (2.0, 4, 4);

Keep in mind that an idea here is that there would be far too many courses to conceivably have a sparse-style column per course if we were denormalizing a list of students. A student can also have taken courses at several of the institutions despite where they're enrolled now.

Solr Schema

The data above will be transformed into the following Solr schema:


  
  
  

  

DIH Configuration

In order to facilitate the transformation of the data into the schema defined above I'll employ the following DIH configuration:


  

  
  
    
      
      
      

      
      
    
  

See the script tag? That's where I've defined a pivotGrades javascript function to turn the data from grade sub-entity on its side into dynamicFields. In the real world you might expect to see some more intense text manipulation here to warrant the ScriptTransformation I s'pect.

Querying

All the work I've done above was done specifically so I can easily and concisely sort students by their grades in specific courses. Here's the money:

http://localhost:8080/solr/students/select/?q=*:*&version=2.2&sort=grade_CS1541%20desc

Resulting in:

 
 
 
 0 
 1 
  
  grade_CS1541 desc 
  on 
  *:* 
  2.2 
  
 
 
  
  Bill 
  3.0 
  3.5 
  2 
  Miller 
  
  
  John 
  3.0 
  2.5 
  1 
  Doe 
  
  
  Jane 
  3.5 
  2.5 
  3 
  Dow 
  
  
  Dennis 
  3.0 
  2.0 
  4 
  Itchison 
  
 
 

Sat Mar 20 2010 22:03:00 GMT+0000 (UTC)

Comments

Solrnet, a Solr Client Library for .Net

One of the strength's of Solr is it's ease of consumption by other platforms due to its REST API and response writers which include XML, JSON, native Ruby and native Python code.

If you're trying to consume a Solr service from .Net you could easily use a WebClient and parse the results with .Net's System.Xml namespace and perhaps even build an object wrapper on top of it. Luckily that work's already been done with the solrnet library.

In this post I'll outline the fundamentals of solrnet usage.

Prerequisites

This article assumes you have a .Net development environment such as Visual Studio and a functional Solr install in servlet container. I'll also assume that you understand how to configure Solr's schema. If that's not the case please consult the official Solr wiki.

Sample Schema

For demonstrative purposes I'll assume the following field declarations in schema.xml.


   
  
  
    
  




 
text
id

Project Setup

With the basic system in place now it's time to download solrnet from its project site on Google Code then add references to SolrNet.dll and Microsoft.Practices.ServiceLocation.dll (included with SolrNet) to a project.

Model

Now let's write some bloody code! Consider the following class declaration which defines the document's we'll be working with. In this case it's an article, much like a blog post, with a key, title, textual content and a list of tags.

Notice the SolrUniqueKey and SolrField attributes decorating the properties. That facilitates the mapping of the properties to field's in Solr.

using System;
using System.Collections.Generic;
using SolrNet;
using SolrNet.Attributes;
using SolrNet.Commands.Parameters;
using Microsoft.Practices.ServiceLocation;
  
class Article {
    [SolrUniqueKey("id")]
    public int ID { get; set; }

    [SolrField("title")]
    public string Title { get; set; }

    [SolrField("content")]
    public string Content { get; set; }

    [SolrField("tag")]
    public List Tags { get; set; }
}

Writing Data

Model defined we can now connect to Solr and save some articles. The following code locates our Solr instance (running locally on port 8080 in my case), creates some documents and commits them to the index.

  
// find the service  
Startup.Init
("http://localhost:8080/solr"); ISolrOperations
solr = ServiceLocator.Current.GetInstance>(); // make some articles solr.Add(new Article() { ID = 1, Title = "my laptop", Content = "my laptop is a portable power station", Tags = new List() { "laptop", "computer", "device" } }); solr.Add(new Article() { ID = 2, Title = "my iphone", Content = "my iphone consumes power", Tags = new List() { "phone", "apple", "device" } }); solr.Add(new Article() { ID = 3, Title = "your blackberry", Content = "your blackberry has an alt key", Tags = new List() { "phone", "rim", "device" } }); // commit to the index solr.Commit();

Basic Querying

Of course the primary purpose of Solr is performing search queries. Consider the following examples which does a general full-text search on the word "power" and a tag search for "phone":

  
// fulltext "power" search  
Console.WriteLine("POWER ARTICLES:");
ISolrQueryResults
powerArticles = solr.Query(new SolrQuery("power")); foreach (Article article in powerArticles) { Console.WriteLine(string.Format("{0}: {1}", article.ID, article.Title)); } Console.WriteLine(); // tag search for "phone" Console.WriteLine("PHONE TAGGED ARTICLES:"); ISolrQueryResults
phoneTaggedArticles = solr.Query(new SolrQuery("tag:phone")); foreach (Article article in phoneTaggedArticles) { Console.WriteLine(string.Format("{0}: {1}", article.ID, article.Title)); }
which produces the following output
POWER ARTICLES:
1: my laptop
2: my iphone

PHONE TAGGED ARTICLES:
2: my iphone
3: your blackberry

Faceting

One of my personal favorite features of Solr is faceting which enables aggregate counts to be returned along with query results. Faceting is well supported in solrnet.

The following example displays counts per tag of articles matching the "device" tag:

Console.WriteLine("DEVICE TAGGED ARTICLES:");
ISolrQueryResults
articles = solr.Query(new SolrQuery("tag:device"), new QueryOptions() { Facet = new FacetParameters { // ask solr for facets Queries = new[] { new SolrFacetFieldQuery("tag") } } }); foreach (Article article in articles) { Console.WriteLine(string.Format("{0}: {1}", article.ID, article.Title)); } Console.WriteLine("\nTAG COUNTS:"); foreach (var facet in articles.FacetFields["tag"]) { Console.WriteLine("{0}: {1}", facet.Key, facet.Value); }

with the following output:

DEVICE TAGGED ARTICLES:
1: my laptop
2: my iphone
3: your blackberry

TAG COUNTS:
device: 3
phone: 2
apple: 1
computer: 1
laptop: 1
rim: 1

Wrapping up

Solrnet makes consumption of a Solr service easy, but I've only covered the basic concepts here. Other features of Solr such as spell checking and match highlighting are also handled. The solrnet wiki will tell you more.

Mon Mar 08 2010 23:03:00 GMT+0000 (UTC)

Comments

Monitoring Solr with LucidGaze

LucidGaze for Solr logo As a professional DBA I'm always interested in monitoring systems. I have to know what's going on with my systems. Even in a world with automatic scaling strategies and automatic tuning humans have to be in the loop. Let's face it, sometimes automatic things don't work. Worse yet they sometimes automatically do what you said, not what you meant:)

As I've been moving more of my read-only data out of traditional relational databases and into highly scalable, simpler document-oriented systems like Solr I've been keeping an eye out for monitoring tools. Specially for Solr I found LucidGaze by Lucid Imagination, a Lucene/Solr specialty shop.

What is LucidGaze?

LucidGaze is essentially the combination of a Solr request handler and a web application that sits in your servlet container. The request handler gathers and stores information about other handlers in the Solr instance and the web app provides the administrator with visualization capabilities.

It's downloadable free of charge from Lucid's website.

Installation

The readme enclosed with LucidGaze does a fine job of getting you up and running so I won't provide step-by-step instructions here. Suffice it to say that it's just copying some jar's into your Solr install and servlet container, then configuring the request handler in solrconfig.xml.

Configuration

After a service restart visit the gaze web application with a url like http://localhost:8080/gaze/index.html and you can configure it to provide visualization for the Solr instance you deployed LucidGaze into by entering its URL and selecting a time retention.

You can then select which handlers in the Solr instance you want to monitor.

Then sit back, relax and let LucidGaze collect metrics.

Visualization

With the configuration out of the way and some metrics collected you can view graphs of the activity in the web application.

  • Overview - the main entry point is an overview of all selected handlers with graphs in requests/sec and milliseconds/request.

  • Drilldown - after clicking on a graph a detail graph is displayed that allows you to adjust the time dimension.

A few information screens are also provided.

  • System Info - a comprehensive system info dialog is available

  • Index Info - the index info dialog displays statistics about the index and its schema

Conclusion

As I lean more and more on Solr it's becoming increasingly important that I understand the load on a given system at a moments notice. LucidGaze certainly seems to help me along the way to that goal.

Note that Lucid Imagination also has a version for LucidGaze for plain-old Lucene, and a certified Solr distribution named LucidWorks which contains LucidGaze.

Sun Feb 21 2010 21:02:00 GMT+0000 (UTC)

Comments

Haiku, an Open Source Continuation of BeOS

* Edit 6/4/2010: Alpha 2 is now available for download at http://www.haiku-os.org/

Haiku Logo Back in the 90's and early 2000's I played around with BeOS, an alternative operating system developed by Be Inc. The intention of BeOS was to be a desktop operating system that specialized in multimedia and competed with Microsoft Windows and Mac OS. Be Inc. even attempted unsuccessfully to sell out to Apple as the replacement for the classic Mac OS.

The proper BeOS is now long defunct but the effort is continued by way of an operating system called Haiku which is an open source continuation of BeOS.

Be LogoI've been meaning to make some time to play with Haiku (which is currently in alpha) for a few months now, but family, other projects and work have gotten in the way. Thanks to some snow... Well, a whole lot of snow, I've had a chance to get my feet wet with it. I'd now like to provide some resources for those who are interested in doing the same, and I definitely suggest that you do if you have an interest in alternative operating systems.

What is Haiku?

As I stated above Haiku is an open source continuation of the BeOS effort, currently in the alpha phase. Right now only the x86 architecture is supported but future x64 support is likely.

One of the core design principals behind BeOS and Haiku is to keep the system simple. The belief of the designers is that operating systems like unix's and windows have had layer after layer added on over time as new needs arose. This layering resulted in inconsistency and complexity. BeOS and Haiku avoided this complexity by starting from scratch with modern needs in mind.

While it's not a unix (and sort of takes exception to unix) it is POSIX compliant, has a Bash shell and has ports of many typical open source unix-y programs. If you're familiar with unix you'll feel at home on BeOS/Haiku. If you're not familiar with unix it should still be easy enough to operate due to Haiku's user-friendly GUI.

Haiku Bash Shell

Installation

Installation is pretty strait-forward and entirely GUI based. For most people it will be something like:

  1. Select your destination disk
  2. Click "Setup partitions..." and make a BeOS partition
  3. Click "Install" and watch status bar eagerly
  4. Click on "Write Boot Sector to <Disk Name>" button

Haiku Install

Software

No matter how stable, fast, or easy to use an operating system is it's only as useful as the software it runs. Luckily plenty of apps exist or have been ported to BeOS. Developers are also porting new apps to Haiku all the time.

Right out of the box you'll notice familiar open source applications are included such as the nano text editor and the gcc C compiler. You'll also have the necessities like a paint program (WonderBrush), media player (MediaPlayer), PDF viewer (BePDF), web browser (BeZilla), and others.

BeZilla

Thus far I've relied on three main sites for additional software for Haiku:

  • BeBits - A general BeOS software repository. Some apps aren't compatible with Haiku specifically but it's still a great resource due to its variety.
  • Haiku Ports - A repository of open source software that's been ported to Haiku.
  • HaikuWare - A vast Haiku-specific software site.

In general there's a little bit of something for everyone if you do some looking around: productivity, multimedia, software development and even games like Wolfenstein 3d.

Wolfenstien

Hardware Support

It's hard for me to speak with any authority about supported hardware but drivers do indeed seem to be supported for most popular hardware like good video cards from NVIDIA and ATI. BeBits and HaikuWare are great resources for drivers for audio, video and other devices but I suspect there will be gaps for more obscure devices.

Development

Most programmers will be pleased to know that Haiku ships with such necessities as Python and gcc. A myriad of other languages like BASIC, Ruby and Eiffel are also available from the sites outlined above as well.

Hello World WindowThe most fundamental way to write Haiku applications, however, is to use the actual BeOS API from C++. While I'm not all that familiar with the API myself I managed to rearrange the HelloWorld sample from the official BeOS R5 sample code into a concise example here. It simply displays a new window with the classic "Hello, World!" text.

Note that the code is arranged into a single unit for ease of posting. It's by no means intended to exhibit proper style.

#include <Application.h>
#include <Window.h>
#include <StringView.h>

class HelloWorldView : public BStringView {
public:
    HelloWorldView(BRect rect, const char *name, const char*text):
    	BStringView(rect, name, text) {
    	    SetFont(be_bold_font);
    	    SetFontSize(24);
    }
};

class HelloWorldWin : public BWindow {
public: 
    HelloWorldWin(BRect frame) :
    	    BWindow(frame, "Hello", B_TITLED_WINDOW,
                B_NOT_RESIZABLE | B_NOT_ZOOMABLE) {
    	HelloWorldView *view;
	    BRect rect(Bounds());
	    view = new HelloWorldView(rect, "HelloWorldView", "Hello, World!");
	    AddChild(view);
	}	
	
	virtual bool QuitRequested() {
	    be_app->PostMessage(B_QUIT_REQUESTED);	
	    return true;
	}
};

class HelloWorldApp: public BApplication {
public:
    HelloWorldApp() :
	    BApplication("application/x-vnd.Be-HelloWorld") {
	HelloWorldWin *wnd;
	BRect rect;

	rect.Set(100, 80, 260, 120);
	wnd = new HelloWorldWin(rect);

	wnd->Show();
    }		
};

int main(int argc, char **argv) {
    HelloWorldApp app;
    app.Run();
    return 0;
}

Assuming the code was placed in a source file named hello_world.cpp it could be compiled with:

~> g++ -lbe hello_world.cpp

Conclusion

Haiku is only in alpha. Pretty darn young. Be forewarned that any software beyond what's bundled may require some effort to get to work. Still, it's easy enough to use for a non-hacker, but powerful enough for a hacker.

It seems to support that which most people need to do and good bit of what they want to do. It's definitely worth a look.

Wed Feb 10 2010 13:02:00 GMT+0000 (UTC)

Comments
< 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 >
Follow Chris
RSS Feed
Twitter
Facebook
CodePlex
github
LinkedIn
Google