Chris Umbel

Quick Notes on ScriptTransformers in Solr DataImportHandlers

One thing that's impressed me with Solr is the flexibility of the Data Import Handlers (DIHs). When I was new to Solr there were several times I thought for sure I'd have to write my own extension of DataImportHandler. Every time that's happened I've been wrong. A transformer or something handled my needs. Sometimes it's wonderful to be wrong! Especially when it means less code I have to write myself!

One of the aspects of DIH's that provide such great flexibility is transformers like RegexTransformer and TemplateTransformer. In this post, however, I'm going to *quickly* cover the ScriptTransformer wich allows you to employ your own custom JavaScript code in the processing of imports.

Prerequisites

Obviously you'll need a functional Solr instance. Also, ScriptTransformers require Java 6 due to JavaScript support. I'll also assume you have an understanding of how dynamicFields work.

Objective

At the office I've recently used a ScriptTransformer to build the field names of dynamicFields and I'm going to do the same in this article. The actual use-case I dealt with was very esoteric and honestly a bit proprietary so I'll substitute an example data scenario here.

Basically I'll import data about students grades for various courses from different institutions. In the resultant Solr index I'll provide a dynamicField for every course to provide easy sorting of students by their grades in the courses they took.

Consider the following MySQL schema and data and try to think beyond this sample data. Think about hundreds of schools, thousands of courses and, well, a ton of students.

create table schools (
  id int auto_increment primary key,
  name varchar(255)  
);

insert into schools (name) values ('Pitt');
insert into schools (name) values ('Penn State');

create table students (
  id int auto_increment primary key,
  first_name varchar(255),
  last_name varchar(255),
  current_school_id int references schools(id)
);

insert into students (first_name, last_name, current_school_id) values 
('John', 'Doe', 1);
insert into students (first_name, last_name, current_school_id) values 
('Bill', 'Miller', 1);
insert into students (first_name, last_name, current_school_id) values 
('Jane', 'Dow', 2);
insert into students (first_name, last_name, current_school_id) values 
('Dennis', 'Itchison', 2);

create table courses (
  id int auto_increment primary key,
  school_id int references schools(id),
  course_number varchar(10),
  name varchar(255)
);

insert into courses (school_id, course_number, name) values
(1, 'CS1501', 'Algorithm Implementations');
insert into courses (school_id, course_number, name) values
(1, 'CS1541', 'Introduction to Computer Architecture');
insert into courses (school_id, course_number, name) values
(2, 'CMPSC465', 'Data Structures and Algorithm');
insert into courses (school_id, course_number, name) values
(2, 'CMPSC473', 'Operating Systems');

create table grades (
  id int auto_increment primary key,
  value FLOAT,
  course_id int references courses(id),
  student_id int references students (id)
);

insert into grades (value, course_id, student_id) values (4.0, 1, 1);
insert into grades (value, course_id, student_id) values (2.5, 2, 1);
insert into grades (value, course_id, student_id) values (3.0, 3, 1);
insert into grades (value, course_id, student_id) values (3.0, 1, 2);
insert into grades (value, course_id, student_id) values (3.5, 2, 2);
insert into grades (value, course_id, student_id) values (3.5, 3, 3);
insert into grades (value, course_id, student_id) values (2.5, 4, 3);
insert into grades (value, course_id, student_id) values (3.0, 3, 4);
insert into grades (value, course_id, student_id) values (2.0, 4, 4);

Keep in mind that an idea here is that there would be far too many courses to conceivably have a sparse-style column per course if we were denormalizing a list of students. A student can also have taken courses at several of the institutions despite where they're enrolled now.

Solr Schema

The data above will be transformed into the following Solr schema:


  
  
  

  

DIH Configuration

In order to facilitate the transformation of the data into the schema defined above I'll employ the following DIH configuration:


  

  
  
    
      
      
      

      
      
    
  

See the script tag? That's where I've defined a pivotGrades javascript function to turn the data from grade sub-entity on its side into dynamicFields. In the real world you might expect to see some more intense text manipulation here to warrant the ScriptTransformation I s'pect.

Querying

All the work I've done above was done specifically so I can easily and concisely sort students by their grades in specific courses. Here's the money:

http://localhost:8080/solr/students/select/?q=*:*&version=2.2&sort=grade_CS1541%20desc

Resulting in:

 
 
 
 0 
 1 
  
  grade_CS1541 desc 
  on 
  *:* 
  2.2 
  
 
 
  
  Bill 
  3.0 
  3.5 
  2 
  Miller 
  
  
  John 
  3.0 
  2.5 
  1 
  Doe 
  
  
  Jane 
  3.5 
  2.5 
  3 
  Dow 
  
  
  Dennis 
  3.0 
  2.0 
  4 
  Itchison 
  
 
 

Sat Mar 20 2010 22:03:00 GMT+0000 (UTC)

Follow Chris
RSS Feed
Twitter
Facebook
CodePlex
github
LinkedIn
Google