Apache Solr is a popular full-text search engine with RESTful interface, which makes it perfect search engine with most type of web sites. However, the quality of search results is dependent on language filters, with a good lemmatizer being the most essential.
That's why I've created a Solr module, which uses JSIs LemmaGen lemmatizer for Solr index building and search queries. The source to the lemmatizer and module is available on Bitbucket slovene_lemmatizer repository.
Installing files
1. Download lemmatizer files
Compile the files yourself of grab pre-compiled binaries from downloads section of the repository:
2. Install lemmatizer native library
Move libLemmatizer.so
to a path in your application servers java.library.path
. You can either copy it to a default search directory (e.g. /usr/local/lib
) or pass -Djava.library.path=/path/to/libLemmatizer.so
to your application server startup script.
For Tomcat 6 on Debian/Ubuntu edit /etc/default/tomcat6 and upadte JAVA_OPTS to:
JAVA_OPTS="-Djava.awt.headless=true -Xmx128m -XX:+UseConcMarkSweepGC -Djava.library.path=/path/to/folder/with/liblemmatizer/so"
3. Install lemmatizer Solr module
Copy sl_lemmatizer-1.0.jar
to your cores lib directory. E.g. if your core is located in /var/solr/core
, create a lib
folder next to conf
and data
folders of the core and copy the lemmatizer there.
4. Copy lemmatizer dictionary to folder accessible by Tomcat
Copy .bin file for your selected language from dictionaries folder to a folder readable by Tomcat. I usually put it into Solrs Catalina home directory. Dictionary file for slovenian language is named lem-me-sl.bin
.
Configuring Solr
Lemmatizer is a Solr filter, so you use it by passing a <filter>
parameter into index and query <analyzer>
parts in schema.xml
. The lemmatizer requires path to dictionary file as a parameter:
<filter class="si.virag.solr.RdrLemmatizerFactory" dictionary="/path/to/lem-me-sl.bin" />
Full example
The full Solr configuration example with schema.xml is available on my GitHub in solr_example repository. More information can also be found on the slides for my Solr talk in Kiberpipa.
Contributions to the library or examples welcome :)