Tagset

The tagset depends on the system used to tag the input text. For instance, the tagset of the English Tree-Tagger is different from that of the Spanish Tree-Tagger, which is different from that of the Spanish Freeling, etc. However, the parser uses as input the output of a tool whose aim is to convert the main PoS tags of all those taggers into a shared list of tags. The shared list is the following: ADJ (adjective), ADV (adverb), NOUN (noun), PRP (preposition), CARD (cardinal number), CONJ (conjunction), DT (determiner), PRO (pronoun), VERB (verb), I (interjection), and 25 more tags for punctuation marks. In addition, there are still some PoS tags belonging to only one tagger. For instance, the English tree-tagger also contains specific tags such as: PoS ('s), PCLE (particle), EX (existential 'there'), etc.

The configuration file where the names of tags are declared is called tagset.conf. Each line contains two columns. The second column contains the names of tags actually used by the system. These names correspond to both the list of PoS tags shared by all PoS taggers, and those PoS tags which are specific to each PoS tagger. The first column shows the names chosen by the user to build the grammar. The user is free to use whatever name. All regular PoS tags are written with upper-case letters. Let's see an example:

ADJECTIVE ADJ
ADVERB ADV
PREP PRP
C CONJ
NUMBER CARD
DET DT
NOUN NOUN
PRON PRO
V VERB
INT I
POS POS
PCLE PCLE


It is also possible to create short-cuts using regular expressions, such as:

X [A-Z]+
NOTVERB [^ V][^ E]+
PUNCT F[a-z]+


Variable X stands for whichever tag name, NOTVERB for whatever tag except those containing the string VE (like VERB), and PUNCT all tags containing the string F followed by some lower-case letters (i.e., punctuation marks). To define more specific shorcuts, we can also use the dysjunction operation “|”:

NOMINAL PRON|NOUN




Subsections