OLAC Record: The CINTIL Corpus – International Corpus of Portuguese

OLAC Record
oai:catalogue.elra.info:ELRA-W0050

Metadata

Title: The CINTIL Corpus – International Corpus of Portuguese

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2009-06-09

Date Issued (W3CDTF): 2009-06-09

Date Modified (W3CDTF): 2009-06-09

Description: CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus is developed over raw textual materials of several types, of which 30% are spoken materials. This spoken subcorpus includes materials from several registers (ranging from formal to informal) and several communicative situations (e.g. phone calls, media broadcasts, conversations, monologues, formal exposition, etc.). The CINTIL corpus comprises the transcriptions of spoken texts but does not include the sound files with the recorded interviews. The remaining subcorpus is composed of written texts from several genres: newspaper, books, magazines, journals and miscellaneous (proceedings, dissertations, pamphlets, etc.). A detailed overview of the corpus composition is presented below:• Written = 689,124 tokens: oNews: 58.7% - 404,690 tokens oFiction: 29% - 200,194 tokens oOther: 12.2% - 84,240 tokens• Spoken = 502,622 tokens: oInformal/Private: 43.2% - 217,604 tokens oInformal/Public: 9.5% - 48,221 tokens oInformal/Phone: 0.4% - 2,287 tokens oFormal/Natural: 19.3% - 97,499 tokens oFormal/Media: 17.6% - 88,727 tokens oFormal/Phone: 9.6% - 48,284 tokens• Total = 1,191,746 tokensLinguistic information:The corpus associates to raw text linguistic information of different nature and from different levels of sophistication. This information is encoded under the usual format of tags, checked for their accuracy by trained linguists, covering four levels of information:•Segmentation: The boundaries of each sentence are tagged and every token is circumscribed by blanks. Contractions are expanded, clitics in enclisis and mesoclisis are detached into autonomous tokens, and punctuation is associated with explicit information concerning the blanks surrounding them in the raw version. Multi-word expressions from some POS classes (e.g. Conjunctions, Prepositions, etc) are identified as forming a lexical unit.•POS: By means of POS tags, each token is associated with the indication of its morpho-syntactic category.•Inflection: Information concerning inflectional morphology: every inflected token is associated with the corresponding lemma, and with explicit information encoding their values for Mood, Tense, Person and Number, if they are from verbal classes, or Number and Gender if they are nominals. Nominals include also information about their degree, namely superlative for Adjectives, and diminutive for both Adjectives and Nouns.•Multiword Lexical Units (MWU) for Named Entity Recognition (NER): Delimitation and classification of multi-word expressions for Named Entities following the usual IOB tagging schema for NER, and the typical classes of Number, Date, Person, Location, etc.The annotation manual is provided together with the corpus.The corpus can be browsed online: http://cintil.ul.pt/

Identifier: ELRA-W0050

ISLRN: 176-775-844-396-0

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-W0050/

Language: Portuguese

Language (ISO639): por

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-W0050

DateStamp: 2009-06-09

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2009. ELRA (European Language Resources Association).
Terms: area_Europe country_PT dcmi_Text iso639_por olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0050
Up-to-date as of: Wed Jul 15 7:05:28 EDT 2026

Metadata
Title:		The CINTIL Corpus – International Corpus of Portuguese
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2009-06-09
Date Issued (W3CDTF):		2009-06-09
Date Modified (W3CDTF):		2009-06-09
Description:		CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus is developed over raw textual materials of several types, of which 30% are spoken materials. This spoken subcorpus includes materials from several registers (ranging from formal to informal) and several communicative situations (e.g. phone calls, media broadcasts, conversations, monologues, formal exposition, etc.). The CINTIL corpus comprises the transcriptions of spoken texts but does not include the sound files with the recorded interviews. The remaining subcorpus is composed of written texts from several genres: newspaper, books, magazines, journals and miscellaneous (proceedings, dissertations, pamphlets, etc.). A detailed overview of the corpus composition is presented below:• Written = 689,124 tokens: oNews: 58.7% - 404,690 tokens oFiction: 29% - 200,194 tokens oOther: 12.2% - 84,240 tokens• Spoken = 502,622 tokens: oInformal/Private: 43.2% - 217,604 tokens oInformal/Public: 9.5% - 48,221 tokens oInformal/Phone: 0.4% - 2,287 tokens oFormal/Natural: 19.3% - 97,499 tokens oFormal/Media: 17.6% - 88,727 tokens oFormal/Phone: 9.6% - 48,284 tokens• Total = 1,191,746 tokensLinguistic information:The corpus associates to raw text linguistic information of different nature and from different levels of sophistication. This information is encoded under the usual format of tags, checked for their accuracy by trained linguists, covering four levels of information:•Segmentation: The boundaries of each sentence are tagged and every token is circumscribed by blanks. Contractions are expanded, clitics in enclisis and mesoclisis are detached into autonomous tokens, and punctuation is associated with explicit information concerning the blanks surrounding them in the raw version. Multi-word expressions from some POS classes (e.g. Conjunctions, Prepositions, etc) are identified as forming a lexical unit.•POS: By means of POS tags, each token is associated with the indication of its morpho-syntactic category.•Inflection: Information concerning inflectional morphology: every inflected token is associated with the corresponding lemma, and with explicit information encoding their values for Mood, Tense, Person and Number, if they are from verbal classes, or Number and Gender if they are nominals. Nominals include also information about their degree, namely superlative for Adjectives, and diminutive for both Adjectives and Nouns.•Multiword Lexical Units (MWU) for Named Entity Recognition (NER): Delimitation and classification of multi-word expressions for Named Entities following the usual IOB tagging schema for NER, and the typical classes of Number, Date, Person, Location, etc.The annotation manual is provided together with the corpus.The corpus can be browsed online: http://cintil.ul.pt/
Identifier:		ELRA-W0050
Identifier:		ISLRN: 176-775-844-396-0
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-W0050/
Language:		Portuguese
Language (ISO639):		por
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-W0050
DateStamp:		2009-06-09
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2009. ELRA (European Language Resources Association).
Terms:		area_Europe country_PT dcmi_Text iso639_por olac_primary_text