A new world in Floresta Sintá(c)tica – the Portuguese treebank

Authors

  • Claudia Freitas
  • Paulo Rocha
  • Eckhard Bick

Abstract

Floresta Sintá(c)tica is a publicly available treebank for Portuguese, created as a collaboration project between Linguateca and the VISL project. It consists of Brazilian and European Portuguese texts automatically annotated by the parser PALAVRAS (Bick, 2000) and manually revised. In this paper, we present two new corpora, Selva (composed by literary, scientific and transcribed spoken texts, partially revised) and Amazonia, (a huge corpus of 3.8 million words, unrevised), and a user-friendly web based corpus tool, Milhafre. We also present how we manage to balance (a) our user, which can have different linguistic background, (b) the need for a grammar that is rich and complex enough in order to process real language (our corpora); and (c) the absence of a consensual syntactic model.

Key words: Portuguese treebank, annotated corpus, revised corpus, user-friendly corpus tool.

Published

2021-05-27

How to Cite

Freitas, C., Rocha, P., & Bick, E. (2021). A new world in Floresta Sintá(c)tica – the Portuguese treebank. Calidoscópio, 6(3), 142–148. Retrieved from https://www.revistas.unisinos.br/index.php/calidoscopio/article/view/5256