OLAC Record: Turkish Broadcast News Speech and Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2012S06

Metadata

Title: Turkish Broadcast News Speech and Transcripts

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Saraçlar, Murat. Turkish Broadcast News Speech and Transcripts LDC2012S06. Web Download. Philadelphia: Linguistic Data Consortium, 2012

Contributor: Saraçlar, Murat

Date (W3CDTF): 2012

Date Issued (W3CDTF): 2012-05-16

Description: *Introduction* Turkish Broadcast News Speech and Transcripts was developed by Bogaziçi University, Istanbul, Turkey and contains approximatley 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval. The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio the 2009 broadcasts were recorded from digitial satellite transmissions. A quick manual segmentation and transcription approach was followed. Speech recognition and retrieval experiments using the larger corpus can be found in the following journal article: Ebru Arisoy, Dogan Can, Siddika Parlak, Hasim Sak, and Murat Saraclar, Turkish Broadcast News Speech and Transcripts Transcription and Retrieval, IEEE Transactions on Audio, Speech and Language Processing, 17(5):874-883, July 2009. For more information please visit http://busim.ee.boun.edu.tr/~speech or contact the principal investigator, Murat Saraçlar. *Data* The data was recrded at 32 kHz and resampled at 16 kHz. After screening for recording quality, the files were segmented, transcribed, and verified. The segmentation occurred in two steps, an initial automatic segmentation followed by manual correction and annotation which included information such as background conditions and speaker boundaries. The transcription guidelines were adapted from the LDC HUB4 and quick transcription guidelines. An English version of the adapted guidelines is provided with the data here. The manual segmentations and transcripts were created by native Turkish speakers at Bo?aziçi University using Transcriber. The transcriptions are provided in the ISO-8859-9 (Latin5) character set. *Samples* Please follow the links below for samples: * Audio * Transcript *Sponsorship* Funding for this corpus collection effort came from TUBITAK Project 105E102 and Bogazici University Research Fund Project 05HA202. *Updates* None at this time.

Extent: Corpus size: 14566923 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2012S06

https://catalog.ldc.upenn.edu/LDC2012S06

ISBN: 1-58563-614-2

ISLRN: 831-432-792-126-2

DOI: 10.35111/zxev-1k65

Language: Turkish

Language (ISO639): tur

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2012S06

Rights Holder: Portions © 2012 Murat Saraçlar, Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2012S06

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Saraçlar, Murat. 2012. Linguistic Data Consortium.
Terms: area_Asia country_TR dcmi_Sound iso639_tur olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2012S06
Up-to-date as of: Wed Oct 29 7:01:20 EDT 2025

Metadata
Title:		Turkish Broadcast News Speech and Transcripts
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Saraçlar, Murat. Turkish Broadcast News Speech and Transcripts LDC2012S06. Web Download. Philadelphia: Linguistic Data Consortium, 2012
Contributor:		Saraçlar, Murat
Date (W3CDTF):		2012
Date Issued (W3CDTF):		2012-05-16
Description:		Introduction Turkish Broadcast News Speech and Transcripts was developed by Bogaziçi University, Istanbul, Turkey and contains approximatley 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval. The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio the 2009 broadcasts were recorded from digitial satellite transmissions. A quick manual segmentation and transcription approach was followed. Speech recognition and retrieval experiments using the larger corpus can be found in the following journal article: Ebru Arisoy, Dogan Can, Siddika Parlak, Hasim Sak, and Murat Saraclar, Turkish Broadcast News Speech and Transcripts Transcription and Retrieval, IEEE Transactions on Audio, Speech and Language Processing, 17(5):874-883, July 2009. For more information please visit http://busim.ee.boun.edu.tr/~speech or contact the principal investigator, Murat Saraçlar. Data The data was recrded at 32 kHz and resampled at 16 kHz. After screening for recording quality, the files were segmented, transcribed, and verified. The segmentation occurred in two steps, an initial automatic segmentation followed by manual correction and annotation which included information such as background conditions and speaker boundaries. The transcription guidelines were adapted from the LDC HUB4 and quick transcription guidelines. An English version of the adapted guidelines is provided with the data here. The manual segmentations and transcripts were created by native Turkish speakers at Bo?aziçi University using Transcriber. The transcriptions are provided in the ISO-8859-9 (Latin5) character set. Samples Please follow the links below for samples: * Audio * Transcript Sponsorship Funding for this corpus collection effort came from TUBITAK Project 105E102 and Bogazici University Research Fund Project 05HA202. Updates None at this time.
Extent:		Corpus size: 14566923 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2012S06
		https://catalog.ldc.upenn.edu/LDC2012S06
		ISBN: 1-58563-614-2
		ISLRN: 831-432-792-126-2
		DOI: 10.35111/zxev-1k65
Language:		Turkish
Language (ISO639):		tur
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2012S06
Rights Holder:		Portions © 2012 Murat Saraçlar, Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2012S06
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Saraçlar, Murat. 2012. Linguistic Data Consortium.
Terms:		area_Asia country_TR dcmi_Sound iso639_tur olac_primary_text