Awesome

Turkish analysis components for Apache Lucene/Solr

The use of Open Source Software is gaining increasing momentum in Turkey. Turkish users on Apache Lucene/Solr (and other Apache Projects) mailing lists are increasing. This project makes use of publicly available Turkish NLP tools to create Apache Lucene/Solr plugins from them. I created this project in order to promote and support open source. Stock Lucene/Solr has SnowballPorterFilter(Factory) for the Turkish language. However, this stemmer performs poorly and has funny collisions. For example; altın, alim, alın, altan, and alıntı are all reduced to a same stem. In other words, they are treated as if they were the same word even though they have completely different meanings. I will post some other harmful collisions here.

How to enable this plugin? Quick way :new: :purple_heart:

If you do not want to build this library and patch solr: To avoid all the hassle, just download my solr-7.3.0.tgz build from https://www.dropbox.com/s/yygdvwoe4cc7d46/solr-7.3.0.tgz It is a link to my Dropbox account. The plugin is enabled in this distribution. All you need to download it and run bin/solr -start It has a core named zemberek activated by default. Just go to the admin/analysis page and select text_tr type and enter some Turkish text. You you press the analyze button, you would see Zemberek stem filter working nicely.

If you are a Docker user, please use Dockerfile and override the solr download location with e.g.: docker build -t mine --build-arg SOLR_DOWNLOAD_SERVER=https://www.dropbox.com/s/yygdvwoe4cc7d46/solr-7.3.0.tgz .

To make the best out of this library quickly, without going much into details, please do either:

TurkishAnalyzer for Solr Users

If you are a Solr user, please use the following field type definition for Turkish.

<!-- Turkish -->
<dynamicField name="*_txt_tr" type="text_tr"  indexed="true"  stored="true"/>
<fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ApostropheFilterFactory"/>
      <filter class="solr.TurkishLowerCaseFilterFactory"/>
      <filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory"/>
  </analyzer>
</fieldType>

TurkishAnalyzer for Lucene Users

If you are a Lucene user, please use the following custom analyzer declaration to create an analyzer for Turkish.

  Analyzer analyzer = CustomAnalyzer.builder()
                .withTokenizer("standard")
                .addTokenFilter("apostrophe")
                .addTokenFilter("turkishlowercase")
                .addTokenFilter(Zemberek3StemFilterFactory.class)
                .build();

How to obtain necessary JAR files?

To obtain the JAR files required to active Turkish Analysis plugin, please use the mvn clean package dependency:copy-dependencies maven command. It copies required jar files to the target/lib directory. Plus you need to manually copy target/TurkishAnalysis-*.jar to the lib directory.

Currently we have five custom TokenFilters. To load the plugins, place specified JAR files (along with TurkishAnalysis-*.jar, which can be created by executing mvn package command) in a lib directory in the Solr Home directory. This directory does not exist in the distribution, so you would need to create it for the first time. The location for the lib directory is near the solr.xml file.

TurkishDeASCIIfyFilter(Factory)

Translation of Emacs Turkish mode from Lisp into Java. This filter is intended to be used to allow diacritics-insensitive search for Turkish.

Arguments:

preserveOriginal: (true/false) If true, the original token is preserved. The default is false.

Example:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="org.apache.lucene.analysis.tr.TurkishDeASCIIfyFilterFactory" preserveOriginal="false"/>
</analyzer>

Zemberek3StemFilter(Factory)

Turkish Stemmer based on Zemberek3.

JARs: zemberek-morphology-0.11.1.jar zemberek-core-0.11.1.jar

Arguments:

strategy: Strategy to choose one of the multiple stem forms by selecting either longest or shortest stem. Valid values are maxLength (the default) or minLength.
dictionary: Zemberek3's dictionary (*.dict) files, which can be download from here and could be modified if required. You may want to add new dictionary items especially for product search. Usually product titles and descriptions are not pure Turkish. When it comes to product search, you may be well familiar with product titles such as Amigalar için oyun, iPadler için çanta, and so on. If you want to handle such non-Turkish product names inflected with Turkish suffixes, the most elegant way is to modify the dictionaries. See the example that adds tweetlemek as a verb to the dictionary, so that tweetledim, tweetlemişler, etc get recognized and stemmed correctly.

Example:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
</analyzer>

If you are happy with the standard dictionaries that shipped with Zemberek3, or you don't intent to alter them, you may prefer to use the no-args directive.

  <filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory"/>

Zemberek2StemFilter(Factory)

Turkish Stemmer based on Zemberek2.

JARs: zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar

Arguments:

strategy: Strategy to choose one of the multiple stem forms. Valid values are maxLength (the default), minLength, maxMorpheme, minMorpheme, frequency, or first.

Example:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="org.apache.lucene.analysis.tr.Zemberek2StemFilterFactory" strategy="minMorpheme"/>
</analyzer>

Zemberek2DeASCIIfyFilter(Factory)

Turkish DeASCIIfier based on Zemberek2.

JARs: zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar

Arguments: None

Example:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="org.apache.lucene.analysis.tr.Zemberek2DeASCIIfyFilterFactory"/>   
</analyzer>

TRMorphStemFilter(Factory)

Turkish Stemmer based on TRmorph. This one is not production ready yet. It requires Operating System specific foma executable. I couldn't find an elegant way to convert foma to java. I am using "executing shell commands in Java to call flookup" workaround advised in [FAQ] (http://code.google.com/p/foma/wiki/FAQ). If you know something better please let me know.

Arguments:

lookup: Absolute path of the OS specific foma executable.
fst: Absolute path of the stem.fst file.

Example:

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="org.apache.lucene.analysis.tr.TRMorphStemFilterFactory" lookup="/Applications/foma/flookup" fst="/Volumes/datadisk/Desktop/TRmorph-master/stem.fst" />
</analyzer>

I will post benchmark results of different field types (different stemmers) designed for different use-cases.

Dependencies

JRE 1.8 or above
Apache Maven 3.0.3 or above
Apache Lucene (Solr) 6.2.1 or

Author

Please feel free to contact Ahmet Arslan at iorixxx at yahoo dot com if you have any questions, comments or contributions.

Citation Policy

If you use this library for a research purpose, please use the following citation:

@article{
  author = "Ahmet Arslan",
  title = "DeASCIIfication approach to handle diacritics in Turkish information retrieval",
  journal = "Information Processing & Management",
  volume = "52",
  number = "2",
  pages = "326 - 339",
  year = "2016",
  doi = "http://dx.doi.org/10.1016/j.ipm.2015.08.004",
  url = "http://www.sciencedirect.com/science/article/pii/S0306457315001053"
}