Projects for Natural Language Processing (NLP)

POS taggers and lemmatizers for English, German, Dutch, Spanish, Italian and French

For the Dutch, English, French, German, Italian, and Spanish we adopted existing POS taggers from OpenNLP tools and the POS models provided by the OpenNLP community. Upon the models we implemented a wrapper that takes a plain text document as input and outputs a plain text file that contains the tokens (words), the corresponding POS tags and lemmas of the words. To obtain the lemmas we follow two different strategies: using Helsinki Finite-State Transducer Technology (HFST) and word-lemma dictionaries obtained from Wiktionary.

The HFST approach returns for each token (word) all possible morphological variants. This makes it difficult to directly find the lemma of the word. We implemented rules that use the POS tag information and extracts the correct lemma. E.g. for the word "computers" the HFST returns the following options:

compute[V]+ER[V/N]+N+PL+GEN

computer[N]+N+PL+GEN

Since we know that "computers" is a noun - based on the OpenNLP POS taggers - we can use that information and extract from the HFST the correct lemma - "computer".

The word-lemma dictionaries are used when the HFST does not return any suggestion - does not know the word. In this case we look up the word-lemma dictionary and return in case of a match the lemma from the dictionary. However, word forms may be the same but differ in the lemma depending on what morphological form they have. Because of this reason the dictionary look-up has to be controlled in a similar way as performed for the HFST case. E.g. in German the word "arbeiten" means either "to work" when it is a verb or "the works" when it is a noun. Depending on the POS type the lemma will differ as well - "arbeiten" (in case of a verb) or "arbeit" (in case of a noun). To disambiguate this we again make use of the OpenNLP POS taggers and obtain for each word its POS type. Depending on the POS type we look up the Wiktionary dictionary that contains only the words in that POS type. E.g. if the word is a noun then we look at the Wiktionary that contains only nouns - that information is given in Wiktionary.

We have compared the output of our wrapper to the one of TreeTagger. For each language we used random 200 documents from the comparable corpora collected using the Wikipedia tool. We ran our wrapper on the documents but also the TreeTagger. We counted the times where both the wrapper and the TreeTagger had the same lemma or where the lemmas differed from each other. The following table shows the results.

Language

Same

Different

%

EN

31386

654

. 97

ES

12664

242

.98

IT

30823

207

.99

FR

20941

291

.98

NL

5890

79

.98

DE

21613

273

.98

We can see from the table that the wrapper performs similar to the TreeTagger. In contrast the wrapper is free to use both for commercial and research purposes whereas the TreeTagger is only free for research purposes.

Input data format

The tool expects the following inputs:

java -jar POSTaggerALanguage.jar inputFile outputFile/Folder language resourcesFolder isList(0|1)


OR

java -jar POSTaggerSpanishLanguage.jar inputFile outputFile/Folder language resourcesFolder isList(0|1)

Note:

POSTaggerALanguage.jar: This tool covers English, German, Dutch, Italian and French.
POSTaggerSpanishLanguage.jar: This tool covers only Spanish.

inputFile: the plain text file to be processed. In case the isList argument is set to "1" then this file is treated as containing a list of files to be processed.

outputFile/Folder: where to save the results. In case isList is "1" then this argument is treated as the folder where all the processed files have to be saved.

Language: e.g., "en"

resourcesFolder: this folder is important and contains all the required resources by the wrapper such as the POS models, the lemma models, etc. These resources are delivered with the wrapper.

isList(0|1): To treat the inputFile as single or list of files.

Output data format

The output of the POS tagger and lemmatiser wrapper is shown below:

computers NN computer

houses NN house

Download

POS Tagger and Lemmatizer for English, Dutch, French, German and Italian

POS Tagger and Lemmatizer for Spanish