Home

Awesome

Data files of German Decompounder for Apache Lucene / Apache Solr / Elasticsearch

This project was started to offer German decompounding out of box for users of Apache Lucene, Apache Solr, or Elasticsearch. The problem with the data files is their license, so be careful when packaging them. Apache Lucene is an Apache v2.0 licensed project, so the data files cannot be shipped together with the distribution.

For decompounding German words, the recommended approach is the following:

The dictionary file dictionary-de.txt is developed here and was created based on the fabulous data by Bj�rn Jacke: https://www.j3e.de/ispell/igerman98/

I used his large and high quality dictionary to make a dictionary file only containing the parts of German compounds. The dictionary therefore is not large, it contains about 14,500 tokens, that are commonly used to form compounds. The dictionary does not contain the compounds, only the parts that are used to create them. The dictionary was lowercased and the umlauts restored to their UTF-8 representation.

Keep in mind: The files provided here are for new German orthography (since 1998)!

Apache Solr example

Here is a config example for Apache Solr. To use it put the two data files into the core's config directory's lang subfolder. After that you can add the following definition to your Solr schema:

<!-- German -->
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="lang/de_DR.xml"
      dictionary="lang/dictionary-de.txt" onlyLongestMatch="true" minSubwordSize="4"/>
    <filter class="solr.GermanNormalizationFilterFactory"/>
    <filter class="solr.GermanLightStemFilterFactory"/>
  </analyzer>
</fieldType>

Important: Use the analyzer for both indexing and searching!

Elasticsearch example

Here is a config example for Elasticsearch. To use it put the two data files into the ${ES_HOME}/config/analysis directory of your ES node and add the following settings to your index. After that you can use the german_decompound analyzer in your mapping.

"settings": {
  "analysis": {
     "filter": {
        "german_decompounder": {
           "type": "hyphenation_decompounder",
           "word_list_path": "analysis/dictionary-de.txt",
           "hyphenation_patterns_path": "analysis/de_DR.xml",
           "only_longest_match": true,
           "min_subword_size": 4
        },
        "german_stemmer": {
           "type": "stemmer",
           "language": "light_german"
        }
     },
     "analyzer": {
        "german_decompound": {
           "type": "custom",
           "tokenizer": "standard",
           "filter": [
              "lowercase",
              "german_decompounder",
              "german_normalization",
              "german_stemmer"
           ]
        }
     }
  }
}

Important: Use the analyzer for both indexing and searching!

Lucene API example

Custom Analyzer for use with the Apache Lucene API.

Analyzer analyzer = CustomAnalyzer.builder(Paths.get("/path/to/german-decompounder"))
       .withTokenizer(StandardTokenizerFactory.NAME)
       .addTokenFilter(LowerCaseFilterFactory.NAME)
       .addTokenFilter(HyphenationCompoundWordTokenFilterFactory.NAME,
                       "hyphenator", "de_DR.xml",
                       "dictionary", "dictionary-de.txt",
                       "onlyLongestMatch", "true",
                       "minSubwordSize", "4")
       .addTokenFilter(GermanNormalizationFilterFactory.NAME)
       .addTokenFilter(GermanLightStemFilterFactory.NAME)
       .build();

Important: Use the analyzer for both indexing and searching!

Help Out!

If you have suggestions for improving the German dictionary, please send a pull request, thanks! Be sure to only send "plain words", no compounds!

License

See NOTICE.txt for more information!