 |
 |
 |
|
 |
 |
 |
 |
Integrated Tools for Exploitation of a Spontaneous Speech Corpus
of Spanish. (Antonio Moreno-Sandoval, José M. Guirao)
- Antonio Moreno-Sandoval
- antonio.msandoval@uam.es
- Universidad Autónoma de Madrid
- Madrid
- Spain
- José M. Guirao
- jmguirao@ugr.es
- Universidad de Granada
- Granada
- Spain
This communication will present a query system to a corpus of spontaneous speech,
concretely the Spanish C-ORAL-ROM corpus (Cresti & Moneglia eds. 2005;
Moreno et al. 2005). The corpus consists of 181 transcripted sessions, in
different registers and communicative situations. With over 42 hours of recorded
data and almost 500 speakers, the corpus has 312,000 tokens (words) of 21,000
different types.
The system can be currently accessed through a web page, although it will be also
available as an independent application. The system consists of three main
components:
- A concordancer of text and sound: the system looks for words or
multi-words expressions in all the texts and retrieves every “utterance”
where the searched string appears along with the original sound fragment (in
mp3). This way the user can hear the original source, not only its
transcription. (Figure 1)
- A morphological analyser of Spanish, based on broad-coverage lexicon,
which provides all the possible analyses for a given wordform. (Figure 2)
- A Part-of-Speech tagger for sentences in Spanish, which provides the
surface syntactic analysis for the sequence. (Figure 3)
The potential uses of this tool enhance the possibilities of the original John
Benjamins version published in DVD format. In particular, some examples of its
application to teaching/learning Spanish as a second language, as well as to
describing properties of spoken Spanish will be given.
 |
| Figure 1 shows the concordances for the multi-word al fin y al cabo (at last)
|
 |
| Figure 2 displays all possible analyses for the word sobre (about, envelop, to be left over)
|
 |
| Figure 3 shows the PoS analysis for the sentence "John put an envelop on the table". |
References
- Cresti, Emanuela, and Máximo Moneglia,
eds. 2005. C-ORAL-ROM Integrated Reference
Corpora for Spoken Romance Languages. Amsterdam: John Benjamins.
- Moreno, Antonio, Guillermo de la
Madrid, Manuel Alcántara, et al. 2005.
'The Spanish
corpus'
. In C-ORAL-ROM Integrated Reference Corpora
for Spoken Romance Languages, 135-161. Amsterdam: John
Benjamins
|
 |
 |
 |
 |
|
|