OLAC Record

Title:Arabic Broadcast News Transcripts
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Maamouri, Mohamed, David Graff, and Christopher Cieri. Arabic Broadcast News Transcripts LDC2006T20. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:Maamouri, Mohamed
Graff, David
Cieri, Christopher
Date (W3CDTF):2006
Date Issued (W3CDTF):2006-12-19
Description:*Introduction* This data set consists of eight text files containing transcripts for Voice of America satellite radio news broadcasts in Arabic. The broadcasts were recorded by the Linguistic Data Consortium at transmission time between June 2000 and January 2001. Six broadcasts are 60 minutes long, and two broadcasts are 120 minutes long. The file names indicate the date (YYYYMMDD) and the begin and end times (HHMM EST) of the original transmission. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201. LDC released the corresponding speech as Arabic Broadcast News Speech (LDC2006S46). Both corpora are also available as a single corpus from ELRA as NetDC Arabic BNSC (Broadcast News Speech Corpus) (ELRA-S0157). *Data* The character encoding is entirely in ASCII: Buckwalter transliteration is used for rendering the Arabic text content. Time alignment and structural markup are rendered via "pseudo-SGML" tags, which are presented one tag per line, with the first character of the line being an open angle bracket. The lines of transcription text (i.e. the speech and annotation content between the time-stamp tags) all begin with a single space character, and present exactly one token per line. (A "token" may be a spoken Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")", which represents an annotation of a non-speech condition or event (e.g. "music", "noise", "laugh", etc). *Samples* For an example of the data contained in this corpus, please examine this screenshot of the transcription.
Extent:Corpus size: 3584 KB
ISBN: 1-58563-420-4
ISLRN: 476-762-568-967-9
Language:Standard Arabic
Language (ISO639):arb
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2006T20
Rights Holder:Portions © 2000, 2001, 2002, 2005, 2006 Trustees of the University of Pennsylvania
Type (DCMI):Text


Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2006T20
DateStamp:  2015-02-18
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Maamouri, Mohamed; Graff, David; Cieri, Christopher. 2006. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb

Up-to-date as of: Thu Feb 19 0:21:52 EST 2015