Home

Awesome

KD: Keyphrase Digger

Keyphrase Digger (KD) is a rule-based system for keyphrase extraction. It is a Java re-implementation of KX tool (Pianta and Tonelli, 2010) with a new architecture and new features. KD combines statistical measures with linguistic information given by PoS patterns to identify and extract weighted keyphrases from texts.

Main Features:

Introduction

This document describes the API for starting and using in your code the KD tool for the keyphrase extraction. The tool uses both statistical measures and linguistic information to detect a weighted list of n-grams representing the most important concepts of a text. Demo

Requirements:

Java 1.8+ is needed

Clone and compile

1. git clone https://github.com/dhfbk/KD.git
2. cd KD
3. mvn package

The main runnable jar is in KD-Runnable/target/KD.jar

The library with all dependencies is in KD-Lib/target/KD-Lib-jar-with-dependencies.jar

if you want install KD in your .m2 folder type: mvn install

Input files:

There are 2 possible input formats depending on the language to process: RAW TEXT: available only for English and with the -us (use Stanford) option. CONLL format (i.e. tab separated): available for both English and Italian and for all tagsets. The format must include at least 3 columns: token, PoS, and lemma. It's possible to specify the column position through the column_configuration parameter, see the help for more information

How to Run:

open command line shell go to the KD folder (the folder containing KD.jar)

java -jar KD.jar -lang ENGLISH -p WEAK -us -v -n 50 -m 6 <Folder or File to be processed>

Hints:

drag folder containing data to the command line shell in order to obtain the correct and "wrapped" path to the files. run the tool with -STDOUT option in order to check the output directly from the console without write new file.

Parameter description used in the example:

-lang ENGLISH is the main language of the file
-p give a boost to more specific key-concept (ie. multi-token expressions). You can change the value of this option to have more or less multi-token expressions: NO | WEAK | MEDIUM | STRONG
-us use stanford tokenizer,lemmatizer and pos tagger (included in the tool) - only for English
-n is the number of concepts/key-phrases to be extracted, in the example above is set to 50
-m is the maximum length of the multi-token expressions to be extracted

For more information on the parameters, run:

java -jar KD.jar -h

Configuration and tuning:

Configuration files are in the following folder and are in txt format: `~/.kd/languages/{Language}/configuration_files

Please, do not change the folder hierarchy! This folder contains all the files used by the tool to increase performances and to obtain better results. The file name are self explaining and its format is really understandable and easy to use.

If you use the tool in your code remember to use the KD_loader object in order to update the serialized data file. e.g : KD_loader.run_the_updater(lang, configuration.languagePackPath);

How to set up new language:

The tool provides an easy way to setup new language.

It's possible to use the command line jar with -nl option or createNewEmptyLanguage method of the KD_core class inside the java library.

Both of these methods create a complete empty folder hierarchy inside the languages folder (by default in ~/.kd) of the tool.

To setup a new language only the files inside the configuration_files folder must be modified.

The hierarchy of a KD language is something like that:

|- <language name>.map.p  -- auto-generated by the tool 
|- <language name>.map  -- auto-generated by the tool
|- config.properties -- auto-generated by the tool
|- configuration_files -- folder containing configuration files
	|- capitalization_pos.txt
	|- idf_lang.txt
	|- keyconcept-no.txt
	|- keyconcept-yes.txt
	|- lemma-no.txt
	|- lemmalist.txt
	|- pos-no.txt
	|- properNounPosList.txt
	|- stoplist.txt
	|- synonyms.txt
	|- tagset
		|- CUSTOM
			|- patterns.txt
		|- STANFORD
			|- patterns.txt
		|- TEXTPRO
			|- patterns.txt
		|- TREETAGGER
			|- patterns.txt

The most important file is the patterns.txt.

In this file the user can define the extraction rules used by the tool according to the following part of speech tagsets: TEXTPRO, STANFORD, TREETAGGER and CUSTOM.

The CUSTOM folder can contains patterns referring to a user defined tagset.

At the beginning the file is empty and the user have to fill it. The syntax to define the rules is specified in each patterns.txt file.

Below the description of the other configuration files:

How to use in your code:

Below an example of code integration:

import java.util.LinkedList;
import eu.fbk.dh.kd.lib.KD_configuration;
import eu.fbk.dh.kd.lib.KD_core;
import eu.fbk.dh.kd.lib.KD_core.Language;
import eu.fbk.dh.kd.lib.KD_keyconcept;
import eu.fbk.dh.kd.lib.KD_loader;

public class Main {

    public static void main(String[] args) {
        String languagePackPath = args[0]; //taken from command line
        String pathToFIle = args[1]; //taken from command line

        Language lang = Language.ITALIAN; //Specify language
        KD_configuration configuration = new KD_configuration(); //Creates a new instance of KD_Configuration object

        // Configuration Setup
        configuration.numberOfConcepts = 20;
        configuration.max_keyword_length = 4;
        configuration.local_frequency_threshold = 2;
        configuration.prefer_specific_concept = KD_configuration.Prefer_Specific_Concept.MEDIUM;
        configuration.skip_proper_noun = false;
        configuration.skip_keyword_with_proper_noun = false;
        configuration.rerank_by_position = false;
        configuration.lemmatization = KD_configuration.Lemmatize.NONE;
        configuration.column_configuration = KD_configuration.ColumExtraction.TOKEN_POS_LEMMA;
        configuration.only_multiword = false;
        configuration.tagset = KD_configuration.Tagset.TEXTPRO;

        configuration.languagePackPath = languagePackPath;//Overrides the default path with the new one taken from the command line parameter

        KD_loader.run_the_updater(lang, configuration.languagePackPath); //Updates the configuration file if something is changed

        KD_core kd_core = new KD_core(KD_core.Threads.TWO);//Create an instance of the KD core

        LinkedList<KD_keyconcept> concept_list = kd_core.extractExpressions(lang, configuration, pathToFIle, null);
        for (KD_keyconcept k : concept_list) { //loop over the extracted key_phrases and print the results
            System.out.println(k.getString() + "\t" + k.getSysnonyms() + "\t" + k.score + "\t" + k.frequency);
        }
    }
}

How compile and run the example:

compile Example.java :

1. git clone https://github.com/dhfbk/KD.git
2. cd KD
3. mvn package
4. cd examples_and_test
5. javac -cp ../KD-Lib/target/KD-Lib-jar-with-dependencies.jar:. Example.java  

run with: java -cp ../KD-Lib/target/KD-Lib-jar-with-dependencies.jar:. Example test_files/eng_treetagger_tagged_file.txt

How run KD on test iles:

1. git clone https://github.com/dhfbk/KD.git
2. cd KD
3. mvn package
4. cd examples_and_test

run KD on an english pre-tagged file

5. java -jar ../KD-Runner/target/KD.jar -lang ENGLISH -c TOKEN_POS_LEMMA -ts TREETAGGER -p MEDIUM -STDOUT -l 20 test_files/eng_treetagger_tagged_file.txt

run KD on a raw english file using the stanford pos tagger

5. java -jar ../KD-Runner/target/KD.jar -lang ENGLISH -c TOKEN_POS_LEMMA -us -p MEDIUM -STDOUT -n 20 test_files/eng_raw_text.txt

run KD on an italian pre-tagged file

5. java -jar ../KD-Runner/target/KD.jar -lang ITALIAN -c TOKEN_POS_LEMMA -p MEDIUM -STDOUT -n 20 test_files/ita_textpro_tagged_text.txt

Support

This software is provided as it is. For new versions and updates please check the project web page at : KD Key-Phrases Digger at DH FBK

License:

Keyphrase Digger (KD_Lib) is released under Apache License 2.0.

If you want to use KD-Runner and KD-StanfordAnnotator you have to apply GPLv3 or later due to Stanford CoreNLP license extension.

Acknowledgment:

The French patterns have been kindly provided by Tien-Duc Cao (Inria Saclay) and Xavier Tannier (Sorbonne Université).

Reference:

Moretti, G., Sprugnoli, R., Tonelli, S. "Digging in the Dirt: Extracting Keyphrases from Texts with KD". In Proceedings of the Second Italian Conference on Computational Linguistics (CLiC-it 2015), Trento, Italy.