A new world in Floresta Sintá(c)tica – the Portuguese treebank
Abstract
Floresta Sintá(c)tica is a publicly available treebank for Portuguese, created as a collaboration project between Linguateca and the VISL project. It consists of Brazilian and European Portuguese texts automatically annotated by the parser PALAVRAS (Bick, 2000) and manually revised. In this paper, we present two new corpora, Selva (composed by literary, scientific and transcribed spoken texts, partially revised) and Amazonia, (a huge corpus of 3.8 million words, unrevised), and a user-friendly web based corpus tool, Milhafre. We also present how we manage to balance (a) our user, which can have different linguistic background, (b) the need for a grammar that is rich and complex enough in order to process real language (our corpora); and (c) the absence of a consensual syntactic model.
Key words: Portuguese treebank, annotated corpus, revised corpus, user-friendly corpus tool.Downloads
Published
How to Cite
Issue
Section
License
I grant the journal Calidoscópio the first publication of my article, licensed under Creative Commons Attribution license (which allows sharing of work, recognition of authorship and initial publication in this journal).
I confirm that my article is not being submitted to another publication and has not been published in its entirely on another journal. I take full responsibility for its originality and I will also claim responsibility for charges from claims by third parties concerning the authorship of the article.
I also agree that the manuscript will be submitted according to the journal’s publication rules described above.