germalemma

GermaLemma – Lemmatizer for German language text Markus Konrad <markus.konrad@wzb.eu>, WZB Mai 2017

In order to use GermaLemma, you will need to download the TIGER corpus from the University of Stuttgart from http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html Their corpus is free to use for non-commercial purposes.

It’s supposed to work with a corpus that employs the STTS tagset: http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html

Then, you should convert the corpus into pickle format for faster loading by running:

python germalemma.py tiger_release_[…].conll09

This will place a lemmata.pickle file in the “data” directory which is then automatically loaded when you use GermaLemma like this:

` from germalemma import GermaLemma lemmatizer = GermaLemma() `

Module Contents

Classes

GermaLemma(self,**kwargs) Lemmatizer for German language text main class.
class GermaLemma(**kwargs)

Lemmatizer for German language text main class.

__init__(**kwargs)

Initialize GermaLemma lemmatizer. By default, it will load the lemmatizer data from ‘data/lemmata.pickle’. You can also pass a manual lemmata dictionary via lemmata or load a corpus in CONLL09 format via tiger_corpus or load pickled lemmatizer data from pickle. Force usage of pattern.de module by setting use_pattern_module to True (or False for not using). By default, it will try to use pattern.de if it is installed.

find_lemma(w, pos_tag)

Find a lemma for word w that has a Part-of-Speech tag pos_tag. pos_tag should be a valid STTS tagset tag (see http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html) or a simplified form with: - ‘N’ for nouns - ‘V’ for verbs - ‘ADJ’ for adjectives - ‘ADV’ for adverbs All other tags will raise a ValueError(“Unsupported POS tag”)! Return the lemma or, if no lemma was found, return w.

Lemmata dictionary lookup for word w with POS tag pos. Return lemma if found, else None.

_adj_lemma(w)

Try to lemmatize adjectives using prevalent German language adjective suffixes. Return possibly lemmatized adjective.

_composita_lemma(w)

Try to split a word w that is possibly made of composita. Return the lemma if found, else return None.

_lemma_via_patternlib(w, pos)

Try to find a lemma for word w that has a Part-of-Speech tag pos_tag by using pattern.de module’s functions. Return the lemma or w if lemmatization was not possible with pattern.de

load_corpus_lemmata(corpus_file)
add_to_lemmata_dicts(lemmata_lower, token, lemma, pos)
save_to_pickle(pickle_file)
load_from_pickle(pickle_file)