Thursday, September 5, 2013

Clade Taxonomy with Solr 4.4 without getting burnt

I have recently had the opportunity to do some work with the Apache Solr search platform. It is a phenomenal tool and I am still discovering much of its power. I had the need to apply taxonomy to documents and essentially search using the taxonomy. Now, I understand that via the underlying Lucene search engine the newest versions of Solr have faceted searching and taxonomy support available, but there is also a nice open source tool called Clade which looked like it would fit the bill for working out some proof of concept ideas I had. The current version of Clade only supports Solr 3.6 out of the box. For my solution however, I was interested in utilizing the multicore support of the latest Solr version. Unfortunately, in addition to multicore support, there have been a number of other changes since Solr3.6. An added benefit is that the Solr UI has improved substantially since 3.6 so digging into the created index is now easier. Clade is also written largely in python, a language which I dive into only every other blue moon. This all led to bit of digging and as I could not find any documentation on how to get Clade working with the latest Solr version, here are the steps I took:

  1. Updated the schema config so that the uniqueKey definition for doc_id included the multiValued="false" attribute 
  2. Changed the solr_url in the to conform to the multicore standard (e.g.: solr_url = ‘http://localhost:8983/solr/collection1/’ for the default example install of solr) 
  3. Similarly, when copying the conf files over, you have to make sure that you put them into the proper cores folder… (i.e. solr/collection1/conf NOT solr/conf) 
  4. Replaced the included sunburnt with the latest sunburnt (just named the included sunburnt folder sunburntOLD and copied the new sunburnt folder in) I did this just for good measure as so much of Solr had changed and I wanted to make sure that the sunburnt I was using to interface with it was a current as possible 
  5. In the Clade lib/ the sunburnt query in get_docs_for_category is being executed with the fields limited to just score=true… this causes a KeyError as doc_id and title are not returned. There is a note in the code to “FIX” it… this may have worked against the old versions of Solr and sunburnt as perhaps that limiting was not working properly in those versions, however, now you need to change the line to: results = query.field_limit(["doc_id", "title"], score=True).paginate(rows=10).execute() (see: the Sunburnt Docs
  6. That brings back all three expected fields to create the desired tuple for return.

Doing the above got Clade to work the same with Solr4.4 on my Mac as I had got it working with Solr 3.6 on my Linux machine. 

One other thing I have noted with Clade: re-running the classify script on the same data appears to create dupes in the index. This is not the expected indexing behavior so I will have to look a little closer as to what Clade is doing in that script. Fortunately, just blowing away the solr data directory clears it out - but that feels a bit heavy handed.

I hope this helps someone. Now to start wiring up my Solr instance via Spring Data...