Click here to download Lingua Toolkit, a natural language kit containing two tools: a dependency parser and a thesaurus generator, both implemented with Perl. This kit also provides you with Tree-tagger, a POS tagger engine. The computational requirements are:
This package contains two NLP tools: a multilingual parser (MultiLingua) and a thesaurus generator (AutoThesaurus). It takes as input a plain text file and gives as result a thesaurus, where each word is assigned its top-N most similar words. It works on 5 languages:
This is the installation of the whole kit: MultiLingua and AutoThesaurus. The installation also includes the POS tagger Tree-Tagger
If you wish, you can install separately only one of the two tools: either MultiLingua or AutoThesaurus.
./lingua.sh <tagger> <lang> <input_file> [TOP]
tagger = freeling, treetagger
lang = gl, es, en, pt, fr
TOP = 1..N
This script requires 4 arguments to be executed: the name of a POS tagger (either treetagger or freeling), the abreviation of a language (en, es, gl, pt, or fr), the input file, and the top-N similar words we want to be selected for each word. For instance:
./lingua.sh treetagger gl input_file.txt 5
The last argument TOP is optional. The by default value of TOP is 10.
Note: if Freeling has not been installed, don't use flag 'freeling'.
The input file is just plain text. File codification must be ISO-8859-1.
LinguaToolkit computes 11 different similarity measures using a parsing method to build the cooccurrences file. The 11 measures are the following:
baseline, diceBin, diceMin, jaccard, cosineBin, cosine, cityblock, euclidean, js, lin, jaccardMax
This way, you can see and compare results in order to
select the best measure to the specific task of computing word
similarity. In our previous experiments, the best measures turned out
to be diceMin, jaccardMax, diceBin, jaccardBin, and cosineBin.