Pipeline architecture

A DepPattern parser file is a Perl script taking as input the result of translating the output of either Treetagger or Freeling into a new file with a shared layout.

In order to analyse an English text stored in the input file 'mytext.txt', we need the following scripts:

a Perl script containing the DepPattern parser (for instance, 'parser-en'.
the command required to run a PoS tagger, for instance 'tree-tagger-english', which use the English parameters trained with Treetagger.
the script 'ChangeTreetagger-en.perl', which is used to change the output of 'tree-tagger-english' into a new file likely to be read by 'parser-en'.

In fact, the following command:

dp.sh  -a treetagger en mytext.txt parser-en > mytext.dep

generates the following pipeline:

cat  mytext.txt | tree-tagger-english | scripts/AdapterTreetagger-en.perl | parser-en.perl -a > mytext.dep

So, to analyse a plain text, we'll need to organise 3 processes in a pipeline, i.e., a chain of processing elements, arranged so that the output of each element is the input of the next.

When no parser is available, we can generate it from a DepPattern grammar (e.g., 'user_grammar.txt'). So the following command:

dp.sh  -a treetagger en mytext.txt parser-en user_grammar.txt > mytext.dep

generates the following pipeline:

ruby compi-beta.rb user_grammar.txt parser-en
cat  mytext.txt | tree-tagger-english | scripts/AdapterTreetagger-en.perl | parser-en.perl -a > mytext.dep

The grammar compiler 'compi-beta.rb' was developped, in Ruby, by Isaac González. To build well-formed DepPattern grammars, look up the corresponding tutorial in 'doc'.

Pablo Gamallo 2009-10-02