OLAC Record: NetDC Arabic BNSC (Broadcast News Speech Corpus)

OLAC Record
oai:catalogue.elra.info:ELRA-S0157

Metadata

Title: NetDC Arabic BNSC (Broadcast News Speech Corpus)

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2007-02-08

Date Issued (W3CDTF): 2007-02-08

Date Modified (W3CDTF): 2017-06-01

Description: The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States. The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period between November 2001 and January 2002 (37 broadcast news, including 32 from the 5.55 pm news and 5 from the 10.55 pm news, with about 90 distinct speakers identified). The language is Standard Arabic from the Middle East region. The database is stored on 1 DVD-ROM. The database was validated by SPEX, the Netherlands, to assess its compliance with NetDC specifications. Recordings were made through a Sangean ATS 909 radio receiver connected to a desktop PC. Encoding is 16 kHz, 16 bits, single channel. Format is raw PCM (.wav) with header information.The corpus was segmented, labelled and transcribed manually using the “Transcriber” software, developed by DGA (Délégation Générale pour l'Armement, France) and LDC (Linguistic Data Consortium, USA) (with an additional patch for Arabic). The transcriptions were done in Arabic characters and the software automatically generated the transliterations. Transcriptions include speaker turns, topics, channel information.Each speech file (extension .wav) has an accompanying ASCII SAM label file with recording information (extension .sam), and an accompanying file with the transcription in xml format (extension .trs) and channel information. A phonetic lexicon in Arabic SAMPA has also been included.

Identifier: ELRA-S0157

ISLRN: 663-177-513-755-1

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0157/

Language: Arabic

Language (ISO639): ara

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0157

DateStamp: 2007-02-08

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2007. ELRA (European Language Resources Association).
Terms: dcmi_Sound iso639_ara olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0157
Up-to-date as of: Wed Oct 1 0:54:03 EDT 2025

Metadata
Title:		NetDC Arabic BNSC (Broadcast News Speech Corpus)
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2007-02-08
Date Issued (W3CDTF):		2007-02-08
Date Modified (W3CDTF):		2017-06-01
Description:		The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States. The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period between November 2001 and January 2002 (37 broadcast news, including 32 from the 5.55 pm news and 5 from the 10.55 pm news, with about 90 distinct speakers identified). The language is Standard Arabic from the Middle East region. The database is stored on 1 DVD-ROM. The database was validated by SPEX, the Netherlands, to assess its compliance with NetDC specifications. Recordings were made through a Sangean ATS 909 radio receiver connected to a desktop PC. Encoding is 16 kHz, 16 bits, single channel. Format is raw PCM (.wav) with header information.The corpus was segmented, labelled and transcribed manually using the “Transcriber” software, developed by DGA (Délégation Générale pour l'Armement, France) and LDC (Linguistic Data Consortium, USA) (with an additional patch for Arabic). The transcriptions were done in Arabic characters and the software automatically generated the transliterations. Transcriptions include speaker turns, topics, channel information.Each speech file (extension .wav) has an accompanying ASCII SAM label file with recording information (extension .sam), and an accompanying file with the transcription in xml format (extension .trs) and channel information. A phonetic lexicon in Arabic SAMPA has also been included.
Identifier:		ELRA-S0157
Identifier:		ISLRN: 663-177-513-755-1
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0157/
Language:		Arabic
Language (ISO639):		ara
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0157
DateStamp:		2007-02-08
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2007. ELRA (European Language Resources Association).
Terms:		dcmi_Sound iso639_ara olac_primary_text