OLAC Record: Speechtera Pronunciation Dictionary

OLAC Record
oai:catalogue.elra.info:ELRA-S0408

Metadata

Title: Speechtera Pronunciation Dictionary

Access Rights: Rights available for: nonCommercialUse, commercialUse

Coverage: Brazil

Date Available (W3CDTF): 2020-02-10

Date Issued (W3CDTF): 2020-02-10

Description: The SpeechTera Pronunciation Dictionary is a machine-readable pronunciation dictionary for Brazilian Portuguese and comprises 737,347 entries. Its entries were primarily designed for Speech Technologies, such as Automatic Speech Recognition Systems and Speech Synthetizers. However, it may be used by linguists, speech therapists, lexicographers, students of Brazilian Portuguese as a second language, and whoever is interested in the sound structure of Brazilian Portuguese.Its phonetic transcription is based on 13 linguistics varieties spoken in Brazil : São Paulo (capital city), countryside of São Paulo State, Rio de Janeiro (RJ), Brasilia (Federal District), Belo Horizonte (MG), Curitiba (PR), Manaus (AM), Porto Alegre (RS), Salvador (BA), Goiâna (GO), Belém (PA), Vitoria (ES) and Cuiabà (MT). The transcription was generated using in-house grapheme-to-phoneme converter and then, its output was manually revised by Brazilian linguists. The SpeechTera Pronunciation Dictionary contains the pronunciation of the frequent word forms found in the transcription data of the SpeechTera's speech and text database (literary, newspaper, movies, miscellaneous). Each one of the thirteen dialects comprises 56,719 entries, including:-44,396 entries including common nouns, adjectives, verbs, adverbs, articles, pronouns, numbers, prepositions, conjunctions;-8,074 proper nouns (including person names, family names, cities, streets, companies and brand names);-1,400 acronyms-1,994 heterophonic homographs-26 unstressed words (clitics)-92 prefixes constituted by the middle vowels "e" and "o"-40 common nouns with metaphonic plurals-698 foreign words frequently used in BrazilThe phone set for each one of the 13 varieties of Brazilian Portuguese were derived individually from the literature, following best practices for automatic speech processing. Detailed information about the phone set used can be found in the handbook for corpora annotation, written by SpeechTera's experts team, provided with the dictionary. It has mappings from words to their pronunciations in the ARPAbet phoneme set, but a mapping between the ARPAbet, the International Phonetic Alphabet (IPA) and the Speech Assessment Methods Phonetic Alphabet (SAMPA) is also provided for the purpose of understanding the phonetic symbol used in the transcriptions. Syllable carries a lexical stress marker, for example, "abacaxi aa bb aa kk aa1 sh iy".The dictionary was created semi-automatically using in-house grapheme-to-phoneme converter. In the first step, initial pronunciations of all word forms appearing in the SpeechTera Pronunciation Dictionary transcriptions. After the automatic creation process, the dictionary was manually cross-checked by linguists' native speakers, correcting potential errors of the automatic pronunciation generation process.

Identifier: ELRA-S0408

ISLRN: 645-563-102-594-8

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0408/

Language: Portuguese

Language (ISO639): por

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): lexicon

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0408

DateStamp: 2020-02-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2020. ELRA (European Language Resources Association).
Terms: area_Europe country_PT dcmi_Text iso639_por olac_lexicon

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0408
Up-to-date as of: Wed Jul 15 7:04:38 EDT 2026

Metadata
Title:		Speechtera Pronunciation Dictionary
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Coverage:		Brazil
Date Available (W3CDTF):		2020-02-10
Date Issued (W3CDTF):		2020-02-10
Description:		The SpeechTera Pronunciation Dictionary is a machine-readable pronunciation dictionary for Brazilian Portuguese and comprises 737,347 entries. Its entries were primarily designed for Speech Technologies, such as Automatic Speech Recognition Systems and Speech Synthetizers. However, it may be used by linguists, speech therapists, lexicographers, students of Brazilian Portuguese as a second language, and whoever is interested in the sound structure of Brazilian Portuguese.Its phonetic transcription is based on 13 linguistics varieties spoken in Brazil : São Paulo (capital city), countryside of São Paulo State, Rio de Janeiro (RJ), Brasilia (Federal District), Belo Horizonte (MG), Curitiba (PR), Manaus (AM), Porto Alegre (RS), Salvador (BA), Goiâna (GO), Belém (PA), Vitoria (ES) and Cuiabà (MT). The transcription was generated using in-house grapheme-to-phoneme converter and then, its output was manually revised by Brazilian linguists. The SpeechTera Pronunciation Dictionary contains the pronunciation of the frequent word forms found in the transcription data of the SpeechTera's speech and text database (literary, newspaper, movies, miscellaneous). Each one of the thirteen dialects comprises 56,719 entries, including:-44,396 entries including common nouns, adjectives, verbs, adverbs, articles, pronouns, numbers, prepositions, conjunctions;-8,074 proper nouns (including person names, family names, cities, streets, companies and brand names);-1,400 acronyms-1,994 heterophonic homographs-26 unstressed words (clitics)-92 prefixes constituted by the middle vowels "e" and "o"-40 common nouns with metaphonic plurals-698 foreign words frequently used in BrazilThe phone set for each one of the 13 varieties of Brazilian Portuguese were derived individually from the literature, following best practices for automatic speech processing. Detailed information about the phone set used can be found in the handbook for corpora annotation, written by SpeechTera's experts team, provided with the dictionary. It has mappings from words to their pronunciations in the ARPAbet phoneme set, but a mapping between the ARPAbet, the International Phonetic Alphabet (IPA) and the Speech Assessment Methods Phonetic Alphabet (SAMPA) is also provided for the purpose of understanding the phonetic symbol used in the transcriptions. Syllable carries a lexical stress marker, for example, "abacaxi aa bb aa kk aa1 sh iy".The dictionary was created semi-automatically using in-house grapheme-to-phoneme converter. In the first step, initial pronunciations of all word forms appearing in the SpeechTera Pronunciation Dictionary transcriptions. After the automatic creation process, the dictionary was manually cross-checked by linguists' native speakers, correcting potential errors of the automatic pronunciation generation process.
Identifier:		ELRA-S0408
Identifier:		ISLRN: 645-563-102-594-8
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0408/
Language:		Portuguese
Language (ISO639):		por
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		lexicon
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0408
DateStamp:		2020-02-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2020. ELRA (European Language Resources Association).
Terms:		area_Europe country_PT dcmi_Text iso639_por olac_lexicon