OLAC Record: GlobalPhone Hausa

OLAC Record
oai:catalogue.elra.info:ELRA-S0347

Metadata

Title: GlobalPhone Hausa

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2012-10-29

Date Issued (W3CDTF): 2012-10-29

Date Modified (W3CDTF): 2017-06-26

Description: The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.Hausa is a member of the Chadic language family, and belongs together with the Semitic and Cushitic languages to the Afroasiatic language family. With over 25 million speakers, it is widely spoken in West Africa. The collection of the Hausa speech and text corpus followed the GlobalPhone collection standards. First, a large text corpus was built by crawling websites that cover main Hausa newspaper sources. Hausa’s modern official orthography is a Latin-based alphabet called Boko, which was imposed in the 1930s by the British colonial administration. It consists of 22 characters of the English alphabet plus five special characters. The collection is based on five main newspapers written in Boko. After cleaning and normalization, these texts were used to build language models and to select prompts for the speech data recordings. Native speakers of Hausa were asked to read prompted sentences of newspaper articles. The entire collection took place in 5 different locations in Cameroon. In total, the corpus contains 7,895 utterances spoken by 33 male and 69 female speakers in the age range of 16 to 60 years. The speech data contains a variety of accents: Maroua, Douala, Yaoundé, Bafoussam, Ngaoundéré, and Nigeria. The accents are documented in the speaker information files. All speech data was recorded with a headset microphone in different environmental conditions, with some slightly noisy parts. The data is sampled at 16 kHz with a resolution of 16 bits and stored in PCM encoding. The division of the Hausa GlobalPhone database into the training, development, and evaluation set is listed in the table below.
Set Male Female #utterances #tokens Duration
Training 24 58 5,863 40k 6 hrs 36 min
Development 4 6 1,021 6k 1 hrs 02 min
Evaluation 5 5 1,011 6k 1 hrs 06 min
Total 33 69 7,895 52k 8 hrs 44 min

Identifier: ELRA-S0347

ISLRN: 727-452-225-740-0

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0347/

Language: Hausa

Language (ISO639): hau

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0347

DateStamp: 2012-10-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2012. ELRA (European Language Resources Association).
Terms: area_Africa country_NG dcmi_Sound iso639_hau olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0347
Up-to-date as of: Wed Jul 15 7:06:21 EDT 2026