Logo da USCProjecto Extralex

CorpusPedia

This software automatically downloads wikipedia database (xml) in five languages: English, Portuguese, Spanish, French, and Galician. Then, it creates a more elaborate XML corpus for each wikipedia language, which contains the following fields: title, plain_text, text in wiki format (the original xml), category, links to other articles, related articles and links to the same article in other languages (interlanguage link).

This is the alpha version, you can try it, just follow instructions in the README file. If you find problems or bugs, please contact us by mail.

Download CorpusPedia (Alpha version)

Valid HTML 4.01 Strict Valid CSS!