MultiLingua

Software download

Click here to download Lingua Toolkit, a natural language kit containing two tools: a dependency parser and a thesaurus generator, both implemented with Perl. This kit also provides you with Tree-tagger, a POS tagger engine. The computational requirements are:

A linux distribution
Perl interpreter in /usr/bin
(Optional) POS tagger Freeling

Description

This package contains two NLP tools: a multilingual parser (MultiLingua) and a thesaurus generator (AutoThesaurus). It takes as input a plain text file and gives as result a thesaurus, where each word is assigned its top-N most similar words. It works on 5 languages:

English (treetagger, freeling)
Spanish (treetagger, freeling)
Galician (treetagger, freeling)
French (treetagger)
Portuguese (treetagger)

How to install

(1) tar xzvf LinguaToolkit.tgz
(2) sh install-lingua.sh

This is the installation of the whole kit: MultiLingua and AutoThesaurus. The installation also includes the POS tagger Tree-Tagger

If you wish, you can install separately only one of the two tools: either MultiLingua or AutoThesaurus.

How to use

./lingua.sh <tagger> <lang> <input_file> [TOP]

tagger = freeling, treetagger

lang = gl, es, en, pt, fr

TOP = 1..N

This script requires 4 arguments to be executed: the name of a POS tagger (either treetagger or freeling), the abreviation of a language (en, es, gl, pt, or fr), the input file, and the top-N similar words we want to be selected for each word. For instance:

./lingua.sh treetagger gl input_file.txt 5

The last argument TOP is optional. The by default value of TOP is 10.

Note: if Freeling has not been installed, don't use flag 'freeling'.

Input File

The input file is just plain text. File codification must be ISO-8859-1.

Output Files

parsed.txt: A file containing the syntactic dependencies extracted from the input file.
cooccur.txt: A file with the cooccurrences between lemmas and lexico-syntactic contexts. This is the input file of AutoThesaurus.
11 __gz output files containing the thesaurus information. Each file was generated with a specific similarity coefficient. The best coefficients are diceMin and jaccardMax.

Similarity Measures

LinguaToolkit computes 11 different similarity measures using a parsing method to build the cooccurrences file. The 11 measures are the following:

baseline, diceBin, diceMin, jaccard, cosineBin, cosine, cityblock, euclidean, js, lin, jaccardMax

This way, you can see and compare results in order to select the best measure to the specific task of computing word similarity. In our previous experiments, the best measures turned out to be diceMin, jaccardMax, diceBin, jaccardBin, and cosineBin.