Seminario sobre Avances en Corpus Orais

Salón de Graos da Facultade de Filoloxía,
Santiago de Compostela, 24-25 de setembro de 2015

In English


Xoves, 24 de setembro
9:00 - 9:30 Benvida e presentación
9:30 - 11:30 Eckhard Bick (University of Southern Denmark)
The Grammatical Annotation of Speech Corpora: Techniques and Perspectives
This talk discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic markers for speechlikeness ("orality") in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include phonetic variation, emoticons and syntactic features. For ordinary, transcribed speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, the modified "oral" CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93-95% for syntactic function.
11:30 - 12:00 Pausa café
12:00 - 13:00 Nelleke Oostdijk (Radboud University)
Experiences from the Spoken Dutch Corpus
The Spoken Dutch Corpus that was compiled between 1998 and 2003 is a corpus of contemporary (standard) Dutch spoken by adult native speakers from the Netherlands and Flanders. The corpus comprises samples of various types of speech ranging from informal everyday face-to-face conversations to radio- and television broadcasts. With each sample the audio recording is available as well as various transcriptions and annotations.

In my presentation I shall go into the design considerations and experiences with the construction of the corpus, while I will also be discussing the corpus in the light of more recent developments such as CLARIN which aims to implement a common language resources and technology infrastructure.
13:00 - 15:30 Pausa xantar
15:30 - 17:30 Inês Duarte & Ana Isabel Mata (Universidade de Lisboa)
Exploring European Portuguese Spontaneous Speech: Prosodic, Syntactic and Pragmatic Annotation Guidelines across Domains and Corpora
Studies on prosody-syntax-discourse interface relations based on naturally occurring speech are gaining growing interest. Corpora annotated with all these levels of linguistic information are not very common (e.g., Calhoun et al., 2010 and references therein).

The COPAS project team selected a balanced corpus of European Portuguese (wrt discourse types, subjects gender and age), isolated utterances illustrating different contrast and parallel structures, provided annotation guidelines and applied them to the subset of utterances referred to hereafter as the COPAS corpus.

The COPAS corpus includes (i) a subset of the CPE-FACES corpus (Mata 1999; Mata et al., 2014), 16h of recorded spontaneous and prepared unscripted speech collected in high schools, 3 teachers and 25 teenage students (from both genders), all speakers of Standard European Portuguese (Lisbon); (ii) a subset of the CORAL corpus (Viana et al. 1998; Trancoso et al. 1998), 9h of spoken dialogue, following the main guidelines of the HCRC Map Task Corpus – 64 dialogues between 32 young-adult speakers (from both genders), Lisbon region.

The annotation guidelines were defined by a multidisciplinary team. After the forced alignment of data (phone, syllable, word), four annotation tiers were associated with the speech signal: (i) an orthographic tier, enriched with punctuation marks, disfluencies, and paralinguistic events; (ii) two prosodic tiers (following the ToBI framework), one for tones and the other for break indices; (iii) three syntactic tiers, for construction type, for construction position in the structure and for syntactic function; and (iv) a discourse tier, for the discourse function of the target constituents. The manual annotation was performed using Praat (Boersma & Weenink, 2013). A data-base was built, with all these tiers time-aligned with the target structures.

In this presentation, we will focus on the multi-level annotation process of left periphery structures. In fact, in the development of the project, the analysis of this subset of target structures, coded for syntactic features, discourse function, prosodic prominence and phrasing, in a time-aligned way, provided fundamental training and testing material for further application in other target structures (namely, clefts and right periphery constituents). Besides, there was time to thoroughly review the annotation and to explore the correlations shown by the statistical analysis.

As we will try to show you, the first exploration of the results obtained for the left periphery strengthen our belief that speech corpora with a multi-level annotation are a valuable resource to look into grammar module relations in language use from an integrated viewpoint.


  • Boersma, P., & Weenink, D. (2013). Praat: doing phonetics by computer [Computer program]. Version 5.3.56, retrieved 15 September 2013 from http://www.praat.org
  • Calhoun, S., Carletta, J., Brenier, J. M., Mayo, N., Jurafsky, D., Steedman, M., & Beaver, S. (2010). The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources & Evaluation (2010) 44, pp. 387–419.
  • Duarte, I., et al. (2013). Left Periphery: the (mainly) syntactic part of the annotation. “First Workshop of COPAS”, Lisbon, May.
  • Mata, A. I. (1999). Para o Estudo da Entoação em Fala Espontânea e Preparada no Português Europeu: Metodologia, Resultados e Implicações Didácticas. PhD Thesis, University of Lisbon.
  • Mata, A. I., et al. (2014). "Teenage and adult speech in school context: building and processing a corpus of European Portuguese". In Proceedings of 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland: 3914-3919.
  • Mata, A. I., et al. (2014). "Prosodic, syntactic, semantic guidelines for topic structures across domains and corpora2. In Proceedings of 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland: 1188-1193.
  • Mata, A. I. (2015). "Prosodic Cues to Topic Status in Spontaneous Speech". Workshop “Prosody-syntax-semantics interfaces in Portuguese: exploring spontaneous speech corpora”, Lisbon, July.
  • Trancoso, I., et al. (1998). "Corpus de diálogo CORAL". In PROPOR’98, Porto Alegre, Brazil.
  • Viana, M. C., et al. (1998). "Apresentação do Projecto CORAL - Corpus de Diálogo Etiquetado". In Workshop I de Linguística Computacional, Lisboa, Portugal.
  • Viana, C., Frota, S., Falé, I., Fernandes, F., Mascarenhas, I., Mata, A. I., Moniz, H. & Vigário, M. (2007). "Towards a P_ToBI". PAPI2007. Workshop on the Transcription of Intonation in Ibero-Romance. University of Minho, Portugal.
17:30 - 17:45 Pausa
17:45 - 18:45 Mario Barcala & Victoria Vázquez (Universidade de Santiago)
The ESLORA Corpus of Spoken Spanish: Design, Compilation and Search Engine
ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations recorded in Galicia between 2007 and 2015. The initial design had a two-fold objective: to register the use of a variety of Spanish which to date has been scarcely documented and to study the resulting effects of using different techniques for eliciting speech on the registered samples. Every step of the construction of the corpus has compelled us to deal with unforeseen difficulties, which in turn have made us reflect on a range of theoretical and practical aspects involved in the compilation of spoken corpora.

After presenting the main features of the ESLORA corpus, we will discuss some critical questions arising from the coding of “non-standard” varieties in a bilingual context, as is the case here in Galicia. The analysis of the distribution of past verb forms will then be introduced as an example of how useful this kind of corpus may be in the research of linguistic variation and change. Next, we will explain the processing stages that all the documents of the corpus must follow before being included in the search engine and we will also describe its current search capabilities. Finally, we will run some examples to show the usefulness of the search engine.
Venres, 25 de setembro
9:30 - 11:30 Gaëtanelle Gilquin (Université Catholique de Louvain)
Developing and Exploiting Spoken Learner Corpora: Challenges and Opportunities
The first learner corpora that were compiled consisted of written data produced by foreign/second language learners, e.g. the Longman Learners' Corpus or the International Corpus of Learner English. More recently, spoken learner corpora have become available too, thus opening up new possibilities for the analysis of interlanguage. In this talk, I will focus on spoken learner corpora, and how they can be developed and exploited for research and teaching purposes. I will describe some of the challenges that spoken learner corpora can present, starting with the compilation of such corpora, which is a time-consuming and painstaking process, and ending with their use in the language classroom, which should only be done cautiously and with proper guidance. I will also highlight the many opportunities that are offered by spoken learner corpora for the description of interlanguage as well as the learning and teaching of languages. This will involve a review of some of the ways in which spoken learner corpora can be further developed (e.g. through annotation) and the possible applications they can lead to. These considerations will be illustrated through concrete examples from the Louvain International Database of Spoken English Interlanguage (LINDSEI) and other spoken learner corpora.
11:30 - 12:00 Pausa café
12:00 - 13:00 Mario Cal Varela & Xabier Fernández Polo (Universidade de Santiago)
Investigating Spoken Academic English with Corpus Tools
In this presentation we intend to reflect retrospectively on our research trajectory in the study of English as an academic lingua franca (ELF). A survey carried out at our own university brought to our attention the difficulties experienced by non-native speaker researchers when they presented their findings at international conferences. With the aim of exploring the nature of these difficulties in depth, we compiled a corpus of presentations at international conferences. In the talk we shall discuss various issues regarding the process of compilation and preparation of these data, present some major results of our analysis and reflect on the limitations of this experience specifically derived from both the human and financial resources available. Lately, we have become involved in a large-scale project participated by a number of European and American universities seeking to compile a corpus of computer-mediated conversations by ELF speakers in academic contexts. Although some of the hurdles encountered in the previous project do not obtain in this new one, some of the issues inherent to the collection and preparation of spoken data remain of central concern for all the participants and for any researcher involved in the compilation of oral corpora.
13:00 - 15:30 Pausa xantar
15:30 - 17:30 Brian Clancy (University of Limerick)
Using Spoken Corpora to Investigate Pragmatic Variation
One of the major contributions to current linguistic knowledge derived from corpora has been the insight spoken corpora has afforded into the nuances and particulars of inter- and intra-varietal variation. This workshop will focus on the potential of spoken corpora for unearthing linguistic patterns that characterise pragmatic variation at both of these levels. Comparing spoken corpora affords insights into not only the lexico-grammatical features present, but also into the nature of different pragmatic systems (e.g. Barron and Schneider, 2005; Schneider and Barron, 2008). A highly iterative corpus pragmatic approach will be taken in order to focus on similarities and differences amongst varieties of English and also amongst different contexts within these varieties in which pragmatic forms and functions interact.

The workshop will primarily utilise the Limerick Corpus of Irish English (LCIE), a one-million-word corpus of spoken Irish English collected from a number of different context-types in the Republic of Ireland. Data taken from LCIE will be complemented by insights from the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA).

The starting point for any corpus analysis is, in the main, the frequency list and this corpus method, in addition to others such as keyword and concordance, will be thoroughly introduced, exemplified and systematically employed in order to illustrate the benefits of even the most basic corpus investigative procedures. From a context-specific point of view, the workshop will focus on data taken from intimate discourse – the spoken language of couples, families and close friends in private, non-professional settings. Individual phenomena such as pronouns, pragmatic markers, vocatives and taboo language will be examined, as will features of conversational organisation such as turn-taking, in order to demonstrate the nature of pragmatic variation within intimate discourse when compared to data from other context-types such as the workplace or the classroom. Questioning the ways in which linguistic items are used, particularly if they occur in differing proportions in different corpora, can provide insights, both intuited and unexpected, about language use in context and can empirically bring to light the varietal and contextual nuances of different pragmatic phenomena.


  • Barron, A., & Schneider, K., eds. (2005). The Pragmatics of Irish English. Berlin: Mouton de Gruyter.
  • Schneider, K., & Barron, A., eds. (2008). Variational Pragmatics: A focus on regional varieties in pluricentric languages. Amsterdam: John Benjamins.
17:30 - 18:00 Síntese final e despedida


A asistencia ao seminario é de balde. Porén, dado que o aforo é limitado os interesados deberán inscribirse mediante un formulario en liña. O prazo de inscrición comeza o día 9 de setembro e estará aberto até o 22.


Seminario organizado polos grupos de investigación Gramática do español (proxecto O estudo da lingua oral: explotación e análise de corpus - ESLORA2, ref. FFI2014-52287-P) e SPERTUS (proxecto Modern Spoken English: The Interpersonal Component in Informal and Academic Contexts, ref. FFI2012-31450).