OLAC Record: CORAL Corpus

OLAC Record
oai:catalogue.elra.info:ELRA-S0367

Metadata

Title: CORAL Corpus

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2014-07-11

Date Issued (W3CDTF): 2014-07-11

Date Modified (W3CDTF): 2014-07-11

Description: The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus in European Portuguese, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic.- Linguistic Contents:56 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some have slightly different names in the two maps (e.g. curvas perigosas vs. troço sinuoso). In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena:o Sequences with /l/ favouring or not its velarization (e.g. sala malva, sal amargo)o Sequences with /s/ in word final position followed by another coronal fricative (e.g. barcos salva-vidas)o Sequences of plosives formed across word boundaries (e.g. clube de tiro)o Sequences of obstruents formed within and across word boundaries (e.g. bairros degradados) The last three items were designed to allow a more comprehensive study of consonant clusters formed within and across word boundaries and should, therefore, be jointly investigated.- Number and Type of Speakers:The original 32 speakers were divided into 8 quartets and, in each quartet, organized to take part in 8 dialogues. The available database contains 7 quartets, corresponding to 28 speakers. Given the reduced number of speakers, they were chosen to achieve an adequate balance of sexes, but were restricted in terms of age (under-graduate or graduate students) and accent (Lisbon area). Speakers were chosen in pairs who know each other, so that half of the conversations take place between "friends" and half between people who do not knew each other.- Data Collection:The recordings take place in a sound proof room, with no visual contact between the speakers. They wear close-talking microphones and the recordings are made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring is done once the dialogues start, after adjusting recording levels.- Annotation:Only orthographic transcription was done for the whole corpus. A pilot recording was annotated in several levels.Four files per dialogue are provided:a) two RAW files: audio fileb) two TRS files: containing the manual transcriptions. The TRS format is a kind of XML format that a standard transcription software such as Transcriber can open. Annotations in the TRS files are at word-level. They are fine-grained transcriptions that include disfluencies. The characters in the text files are encoded in ISO-8859-1 (Latin1).The corpus consists of 112 TRS and corresponding WAV files, and contains about 57K word tokens. The disk size is about 1.5 MB for the TRS files and 1.2 GB for the WAV files.

Identifier: ELRA-S0367

ISLRN: 499-311-025-331-2

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0367/

Language: Portuguese

Language (ISO639): por

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0367

DateStamp: 2014-07-11

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2014. ELRA (European Language Resources Association).
Terms: area_Europe country_PT dcmi_Sound iso639_por olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0367
Up-to-date as of: Wed Jul 15 7:04:12 EDT 2026

Metadata
Title:		CORAL Corpus
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2014-07-11
Date Issued (W3CDTF):		2014-07-11
Date Modified (W3CDTF):		2014-07-11
Description:		The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus in European Portuguese, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic.- Linguistic Contents:56 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some have slightly different names in the two maps (e.g. curvas perigosas vs. troço sinuoso). In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena:o Sequences with /l/ favouring or not its velarization (e.g. sala malva, sal amargo)o Sequences with /s/ in word final position followed by another coronal fricative (e.g. barcos salva-vidas)o Sequences of plosives formed across word boundaries (e.g. clube de tiro)o Sequences of obstruents formed within and across word boundaries (e.g. bairros degradados) The last three items were designed to allow a more comprehensive study of consonant clusters formed within and across word boundaries and should, therefore, be jointly investigated.- Number and Type of Speakers:The original 32 speakers were divided into 8 quartets and, in each quartet, organized to take part in 8 dialogues. The available database contains 7 quartets, corresponding to 28 speakers. Given the reduced number of speakers, they were chosen to achieve an adequate balance of sexes, but were restricted in terms of age (under-graduate or graduate students) and accent (Lisbon area). Speakers were chosen in pairs who know each other, so that half of the conversations take place between "friends" and half between people who do not knew each other.- Data Collection:The recordings take place in a sound proof room, with no visual contact between the speakers. They wear close-talking microphones and the recordings are made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring is done once the dialogues start, after adjusting recording levels.- Annotation:Only orthographic transcription was done for the whole corpus. A pilot recording was annotated in several levels.Four files per dialogue are provided:a) two RAW files: audio fileb) two TRS files: containing the manual transcriptions. The TRS format is a kind of XML format that a standard transcription software such as Transcriber can open. Annotations in the TRS files are at word-level. They are fine-grained transcriptions that include disfluencies. The characters in the text files are encoded in ISO-8859-1 (Latin1).The corpus consists of 112 TRS and corresponding WAV files, and contains about 57K word tokens. The disk size is about 1.5 MB for the TRS files and 1.2 GB for the WAV files.
Identifier:		ELRA-S0367
Identifier:		ISLRN: 499-311-025-331-2
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0367/
Language:		Portuguese
Language (ISO639):		por
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0367
DateStamp:		2014-07-11
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2014. ELRA (European Language Resources Association).
Terms:		area_Europe country_PT dcmi_Sound iso639_por olac_primary_text