GNOMEI've generally not had a reason to work with the GObject type system despite appreciating its fruits through GNOME for years. Then the other day I ran across a language called Vala which intrigued me enough to start hacking away.

Vala's claim to fame is that it simplifies GObject development by exposing it in a C#/Java like language. Unlike C# and Java Vala is translated to C and then compiled to a native binary. Presumably this leads to performant execution and a tight memory footprint compared to CLI and Java bytecode.

The GObject type system and Vala are new to me so I'm in no position to kick knowledge, but I'll share some of what I've written early in my learning process.

Example 1 Hello World:

using GLib;

public class HelloWorld : Object {
       public void run() {
              stdout.printf("Hello World\n");
       }

       public static int main(string[] args) {
              HelloWorld hellower = new HelloWorld();
              hellower.run();
              return 0;
       }
}

producing the output:

Hello World

Example 2 Getting Twitter Status XML:

using GLib;

public class Twitter {
  static int main (string[] args) {
    /* get the username from the command line */
    string username = args[1];

    /* format the URL to use the username as the filename */
    string url = "http://twitter.com/users/%s.xml".printf(username);

    stdout.printf("Getting status for %s\n".printf(username));

    /* create an HTTP session to twitter */
    Soup.SessionAsync session = new Soup.SessionAsync();
    Soup.Message message = new Soup.Message ("GET", url);

    /* send the HTTP request */
    session.send_message(message);

    /* output the XML to stdout */
    stdout.printf(message.response_body.data);

    return 0;
  }
}

This example will retrieves a twitter user's status via the REST API and outputs the XML response to the console. The Soup library is employed for the HTTP communication.

Example 3 XML parsing and XPath queries:

/* parse the xml into a document object */
Xml.Doc* status_doc = Parser.parse_memory(
  message.response_body.data,
  (int)message.response_body.length);

/* create the basic plumbing for XPath */
XPathContext* xpath = new XPathContext(status_doc);

/* execute an xpath query */
XPathObject* result = xpath->eval_expression("/user/status/text");

/* slap the result in a string */
string status = result->nodesetval->item(0)->get_content();

stdout.printf("%s\n", status);

The above code could be grafted into example 2 (requires slapping in a "using Xml;" directive) and would actually pull the status out of the XML response.

Conclusion

Vala is a young language, but an interesting one. It certainly seems like it could make native GNOME development a bit more accessible to C#/Java developers.

A few things that justify it's use over C++ that I haven't covered here are its support for "modern" features such as assisted memory management, the foreach construct, and exception handling.

Created on 2009-07-16 17:36:47 UTC
 
RubyParsing HTML is a frequent and somewhat annoying task programmers are commissioned with occasionally. Activities such as screen-scraping have become rare since the advent of RSS, but still... There's always content out there that you have to get at that leaves you no choice but to parse it out yourself.

One of the more elegant bits that I've seen for this purpose is Nokogiri which is a Ruby library that supports querying HTML content by both an XPath and CSS selector syntax.

XPath

First I'll demonstrate how to parse some content out of a page via the XPath syntax. This code uses the ruby documentation for the Bignum class as a parsing medium and essentially extracts the method names.

require 'nokogiri' 
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.ruby-doc.org/core/classes/Bignum.html'))

doc.xpath('//span[@class="method-name"]').each do | method_span |
	puts method_span.content
	puts method_span.path
	puts
end

The above code simply iterates through a set of Node objects that represent every span tag with the CSS class "method-name" applied. It prints out the inner text and absolute XPath via the "content" and "path" properties respectively. Below is a sample of the output:

power!
/html/body/div[3]/div/div[24]/div[1]/span[1]

big.quo(numeric) => float
/html/body/div[3]/div/div[25]/div[1]/a/span

quo
/html/body/div[3]/div/div[26]/div[1]/a/span[1]

rdiv
/html/body/div[3]/div/div[27]/div[1]/span[1]

big.remainder(numeric)    => number
/html/body/div[3]/div/div[28]/div[1]/a/span

rpower
/html/body/div[3]/div/div[29]/div[1]/a/span[1]

CSS

Nokogiri also supports querying by way of CSS selector syntax. The following example iterates over every link that displays a javascript popup in the Bignum document used above and outputs its absolute css selector path and the text of the "onclick" attribute.

doc.css('a[onclick]').each do | popup_link |
  puts popup_link.css_path
  puts popup_link.attributes['onclick']
end

Practical

A real life use of this library and HTML parsing in general is Anemone which is a web spidering framework for Ruby. Like most things in Ruby it's programmer friendly and delivers quite a bit of power without much work.

The following Anemone example uses Nokogiri under the covers to crawl all links on this site and print out the URLs of articles.

require 'anemone'
require 'open-uri'

# crawl this page
Anemone.crawl("http://www.chrisumbel.com") do | anemone |
  # only process pages in the article directory
  anemone.on_pages_like(/article\/[^?]*$/) do | page |
    puts "#{page.url} indexed."
  end
end

Also, the WebRat DSL (which powers the Cucumber web acceptance testing framework) employs Nokogiri.

Conclusion

While the need for screen-scraping and HTML parsing has diminished over time the need still exists. It's nice to know that when we do have to do it the process is made simple by libraries like Nokogiri.

Created on 2009-07-12 11:07:11 UTC
 
One of the most important factors in getting optimal performance out of Amazon's SimpleDB is keeping the total number of requests to a minimum and making the most out of the ones you make. At one time this was tricky from a write perspective because only a single item could be updated in a PUT operation.

Last spring Amazon eased the pain a little by allowing us to batch PUT operations into a single command.

In order to demonstrate the use of this feature and analyze its performance I'll use C# and Amazon's .Net SimpleDB library.

Single PUTs

To establish a baseline I'm going to run a test with some sample data (Sun Micro's stock data from 1/3/05 to 7/9/2009 obtained from Yahoo! finance) and write it into a "StockPrices" domain one item at a time. This data is in CSV form and contains 1137 rows.

I conducted three trials locally and three trials on a small EC2 instance.

AmazonSimpleDB service = new AmazonSimpleDBClient("ENTER YOUR KEY HERE", 
"ENTER YOUR SECRET KEY HERE");

using (stockStreamReader = new StreamReader(File.OpenRead(@"JAVA.csv")))
{
    PutAttributesRequest putRequest;
  
    /* read column names */
    stockStreamReader.ReadLine();
      
    while (!stockStreamReader.EndOfStream)
    {
        putRequest = new PutAttributesRequest();
        putRequest.DomainName = "StockPrices";

        string line = stockStreamReader.ReadLine();
        string[] tokens = line.Split(',');

        putRequest.ItemName = string.Format("{0}_{1}", tokens[0], ticker);

        putRequest.Attribute.Add(new ReplaceableAttribute() 
            { Name = "Ticker", Value = ticker });
        putRequest.Attribute.Add(new ReplaceableAttribute() 
            { Name = "Date", Value = tokens[0] });
        putRequest.Attribute.Add(new ReplaceableAttribute() 
            { Name = "Open", Value = tokens[1] });
        putRequest.Attribute.Add(new ReplaceableAttribute() 
            { Name = "High", Value = tokens[2] });
        putRequest.Attribute.Add(new ReplaceableAttribute() 
            { Name = "Low", Value = tokens[3] });
        putRequest.Attribute.Add(new ReplaceableAttribute() 
            { Name = "Close", Value = tokens[4] });
        putRequest.Attribute.Add(new ReplaceableAttribute() 
            { Name = "Volume", Value = tokens[5] });
        putRequest.Attribute.Add(new ReplaceableAttribute() 
            { Name = "AdjustedClose", Value = tokens[6] });

        service.PutAttributes(putRequest);
    }  
}

Resulting in the following times (in seconds):

LocalEC2
Trial 15434
Trial 24932
Trial 35732
Avg53.332.6

As you can see there was a significant improvement simply by executing the code on Amazon's equipment (minimizing connection latency) but it's hard to argue that the performance was qualitatively bad even without something to compare it to.

Batched PUTs

Then I conducted a similar test using batched PUT operations as such:

AmazonSimpleDB service = new AmazonSimpleDBClient("ENTER YOUR KEY HERE", 
"ENTER YOUR SECRET KEY HERE");

using (stockStreamReader = new StreamReader(File.OpenRead(@"JAVA.csv")))
{
  BatchPutAttributesRequest batchPutRequest = new BatchPutAttributesRequest();
  batchPutRequest.DomainName = "StockPrices";

  while (!stockStreamReader.EndOfStream)
  {
    ReplaceableItem item = new ReplaceableItem();

    string line = stockStreamReader.ReadLine();
    string[] tokens = line.Split(',');

    item.ItemName = string.Format("{0}_{1}", tokens[0], ticker);

    item.Attribute.Add(new ReplaceableAttribute() 
      { Name = "Ticker", Value = ticker });
    item.Attribute.Add(new ReplaceableAttribute() 
      { Name = "Date", Value = tokens[0] });
    item.Attribute.Add(new ReplaceableAttribute() 
      { Name = "Open", Value = tokens[1] });
    item.Attribute.Add(new ReplaceableAttribute() 
      { Name = "High", Value = tokens[2] });
    item.Attribute.Add(new ReplaceableAttribute() 
      { Name = "Low", Value = tokens[3] });
    item.Attribute.Add(new ReplaceableAttribute() 
      { Name = "Close", Value = tokens[4] });
    item.Attribute.Add(new ReplaceableAttribute() 
      { Name = "Volume", Value = tokens[5] });
    item.Attribute.Add(new ReplaceableAttribute() 
      { Name = "AdjustedClose", Value = tokens[6] });

    batchPutRequest.Item.Add(item);

    /* Amazon limites batches to 25 items */
    if (batchPutRequest.Item.Count == 25)
    {
      service.BatchPutAttributes(batchPutRequest);
      batchPutRequest = new BatchPutAttributesRequest();
      batchPutRequest.DomainName = "StockPrices";
    }
  }

  /* send any that remain */
  if (batchPutRequest.Item.Count > 0)
    service.BatchPutAttributes(batchPutRequest);
}

resulting in:

LocalEC2
Trial 176
Trial 266
Trial 366
Avg6.36

Which was a marked improvement that effectively nullified the advantage the single writes got on EC2.

Conclusion

The overall comparison of the two approaches is as follows:

TypeLocalEC2
Single53.332.6
Batched6.36

Batched PUT operations offer a clear performance benefit regardless of where they're executed from. You just have to keep in mind you're limited to batch sizes of 25 items and 1 MB per request.

Created on 2009-07-10 12:29:15 UTC
 
In my last post I discussed the background-job and remoting systems of PowerShell 2.0. While I find those features interesting personally there are three more that I'd like to discuss that have a broader appeal: the Out-GridView CmdLet, ScriptCmdlets and the PowerShell Integrated Scripting Environment (ISE).

Note that PowerShell 2.0 is in CTP 3 at the time of writing and everything is subject to change.

Out-GridView

The Out-GridView cmdlet gives you a lightwieight sortable/searchable grid that you can easily pipe collections into as demonstrated below.

ls | Out-GridView

ScriptCmdlets

Previously the only way to develop your own Cmdlets was to resort to one of the higher level .Net languages such as VB.Net and C#. This restriction has been removed with the introduction of ScriptCmdlets.

Consider the following ScriptCmdLet that retrieves a user's Twitter status:

Cmdlet Get-UsersStatus
{
    # definition of the Cmdlet's parameters
     param([string] $username)
    
    $url = ("http://twitter.com/users/{0}.xml" -f $username)
    
    $webClient =  New-Object Net.WebClient
    $responseDoc = New-Object Xml.XmlDocument
    
    $responseDoc.LoadXml([Text.Encoding]::ASCII.GetString(
      $webClient.DownloadData($url)))
      
    $responseDoc.SelectSingleNode("/user/status/text")   
}

Which can now be executed Cmdlet style:

Get-UsersStatus "chrisumbel"

With a minor change to our param definition we can now accept input from the pipeline:

param([ValueFromPipeline][string] $username)

So our Cmdlet can be executes as such:

"chrisumbel", "wimbledon" | Get-UsersStatus

which gets the status all the users piped into it, chrisumbel and wimbledon in this case.

An interesting thing to consider is that ScriptCmdlets can override other Cmdlets, including those that are built-in.

PowerShell ISE

A weakness of PowerShell up to this point is that you typically had to resort to a third-party solution to get a rich development expirience. While the PowerShell ISE is certainly not a replacement for some of the fancy commercial offerings it's far more helpful than notepad. It grants the scripter syntax highlighting, one-click script running and a tabbed environment for working with multiple scripts.

Check out the following screenshot which contains the code from our Twitter status example:

Conclusion

These features go a long way to making a powerful shell environment even more powerful, not to mention quite a bit more friendly. Here's looking forward to the official release!

Created on 2009-07-05 17:07:21 UTC
 
An interesting feature released with the PowerShell 2 CTP3 is the ability to run background jobs consisting of arbitrary PowerShell code. In order to use this functionality you must download and install the PowerShell 2 CTP3 and the WinRM (Windows Remote Management) 2.0 CTP.

Keep in mind that installing the CTP requires uninstalling previous versions of PowerShell. Depending on the PowerShell and operating system versions involved the procedure can vary. Google/Bing is your friend.

Once bringing up PowerShell you have to enable remoting by invoking the aptly named Enable-PSRemoting cmdlet

Enable-PSRemoting -force

Now the meat. Consider the following code:

# copy bigfile.txt on a background thread
$job = start-job -scriptBlock { cp bigfile.txt bigfilecopy.txt }

# here's where we'd perform some other logic while our file copies

# wait for job to finish
wait-job $job

Note that you can also retrieve the status of jobs and the return values with the Get-Job and Receive-Job cmdlets respectively.

More interesting still is the ability to execute jobs remotely on other systems running PowerShell and WinRM. This can be demonstrated with the Invoke-Command cmdlet coupled with the -AsJob parameter as follows:

Invoke-Command -ComputerName Comp1, Comp2 -ScriptBlock { cp bigfile.txt bigfilecopy.txt } -AsJob

The -ComputerName parameter's values of Comp1 and Comp2 indicate that our script will will execute on two remote machines named Comp1 and Comp2.

In conclusion PowerShell 2 introduces some interesting features for asynchronous and remote operations. This opens many doors for administrators and developers alike with minimal code.

Created on 2009-06-30 19:50:15 UTC
 
Tags:
.Net .net framework 4.0 ADO.NET Android AppleScript Astoria BI BeOS C C++ Data Services EF GNOME GObject Groovy HTML Haiku JVM Java Lucene Mac MongoDB ORM Objective-C Operating Systems Oracle SSRS Solr VS 2010 Vala Web Services appengine c# clojure cloud clr cocoa touch concurrency couchdb cql curl database django dlr dynamic entity framework erlang exchange server filestream full-text functional go iPhone indexes ironpython ironruby jQuery linq lisp lucene mongodb monitoring natural language object oriented parallel performance podcasts powershell python rails refactoring remoting reporting services rs ruby scripting security setpolicies simpledb sql 2008 sql server systems programming testing tools vb virtualization wave webdav windows xml