OLAC Record: USC-SFI MALACH Interviews and Transcripts Czech

OLAC Record
oai:www.ldc.upenn.edu:LDC2014S04

Metadata

Title: USC-SFI MALACH Interviews and Transcripts Czech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Psutka, Josef V., et al. USC-SFI MALACH Interviews and Transcripts Czech LDC2014S04. Web Download. Philadelphia: Linguistic Data Consortium, 2014

Contributor: Psutka, Josef V.

Psutka, Josef

Vlasta, Radová

Ircing, Pavel

Jindřich, Matoušek

Luděk, Müller

Date (W3CDTF): 2014

Date Issued (W3CDTF): 2014-03-16

Description: *Introduction* USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation. Inspired by his experience making Schindlers List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. While most of those who gave testimony were Jewish survivors, the Foundation also interviewed homosexual survivors, Jehovahs Witness survivors, liberators and liberation witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes trials participants. The Foundation’s Visual History Archive holds nearly 55,000 video testimonies in 43 languages, representing 65 countries; it is the largest archive of its kind in the world. In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education. The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives. The focus was advancing the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak. USC-SFI MALACH Interviews and Transcripts Czech was developed for the Czech speech recognition experiments. LDC has also released USC-SFI MALACH Interviews and Transcripts English (LDC2012S05). *Data* The speech data in this release was collected beginning in 1994 under a wide variety of conditions ranging from quiet to noisy (e.g., airplane overflights, wind noise, background conversations and highway noise). Original interviews were recorded on Sony Beta SP tapes, then digitized into a 3 MB/s MPEG-1 stream with 128 kb/s (44 kHz) stereo audio. The sound files in this release are single channel FLAC compressed PCM WAV format at a sampling frequency of 16 kHz. Approximately 570 of all USC-SFI collected interviews are in Czech and average approximately 2.25 hours each. The interviews sessions in this release are divided into a training set (400 interviews) and a test set (20 interviews). The first fifteen minutes of the second tape from each training interview (approximately 30 total minutes of speech) were transcribed in .trs format using Transcriber 1.5.1. The test interviews were transcribed completely. Thus the corpus consists of 229 hours of speech (186 hours of training material plus 43 hours of test data) with 143 hours transcribed (100 hours of training material plus 43 hours of test data). Certain interviews include speech from family members in addition to that of the subject and the interviewer. Accordingly, the corpus contains speech from more than 420 speakers, who are more or less equally distributed between males and females. *Samples* Please view this audio sample and transcript . *Updates* None at this time.

Extent: Corpus size: 14521216 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2014S04

https://catalog.ldc.upenn.edu/LDC2014S04

ISBN: 1-58563-672-X

ISLRN: 310-213-848-753-5

DOI: 10.35111/v2nt-7j09

Language: Czech

Language (ISO639): ces

License: USC-SFI MALACH Interviews and Transcripts Czech For-Profit: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-interviews-and-transcripts-czech-for-profit.pdf

USC-SFI MALACH Interviews and Transcripts Czech Non-Member: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-czech-nonmember.pdf

USC-SFI MALACH Interviews and Transcripts Czech Not-for-Profit: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-czech-nfp.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 2014 USC Shoah Foundation Institute, © 2014 Trustees of the University of Pennsylvania
This USC SFI Malach Data is from the archive of the University of Southern California Shoah Foundation Institute for Visual History and Education

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2014S04

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Psutka, Josef V.; Psutka, Josef; Vlasta, Radová; Ircing, Pavel; Jindřich, Matoušek; Luděk, Müller. 2014. Linguistic Data Consortium.
Terms: area_Europe country_CZ dcmi_Sound dcmi_Text iso639_ces olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2014S04
Up-to-date as of: Wed Oct 29 7:01:26 EDT 2025

Metadata
Title:		USC-SFI MALACH Interviews and Transcripts Czech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Psutka, Josef V., et al. USC-SFI MALACH Interviews and Transcripts Czech LDC2014S04. Web Download. Philadelphia: Linguistic Data Consortium, 2014
Contributor:		Psutka, Josef V.
		Psutka, Josef
		Vlasta, Radová
		Ircing, Pavel
		Jindřich, Matoušek
		Luděk, Müller
Date (W3CDTF):		2014
Date Issued (W3CDTF):		2014-03-16
Description:		Introduction USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation. Inspired by his experience making Schindlers List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. While most of those who gave testimony were Jewish survivors, the Foundation also interviewed homosexual survivors, Jehovahs Witness survivors, liberators and liberation witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes trials participants. The Foundation’s Visual History Archive holds nearly 55,000 video testimonies in 43 languages, representing 65 countries; it is the largest archive of its kind in the world. In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education. The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives. The focus was advancing the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak. USC-SFI MALACH Interviews and Transcripts Czech was developed for the Czech speech recognition experiments. LDC has also released USC-SFI MALACH Interviews and Transcripts English (LDC2012S05). Data The speech data in this release was collected beginning in 1994 under a wide variety of conditions ranging from quiet to noisy (e.g., airplane overflights, wind noise, background conversations and highway noise). Original interviews were recorded on Sony Beta SP tapes, then digitized into a 3 MB/s MPEG-1 stream with 128 kb/s (44 kHz) stereo audio. The sound files in this release are single channel FLAC compressed PCM WAV format at a sampling frequency of 16 kHz. Approximately 570 of all USC-SFI collected interviews are in Czech and average approximately 2.25 hours each. The interviews sessions in this release are divided into a training set (400 interviews) and a test set (20 interviews). The first fifteen minutes of the second tape from each training interview (approximately 30 total minutes of speech) were transcribed in .trs format using Transcriber 1.5.1. The test interviews were transcribed completely. Thus the corpus consists of 229 hours of speech (186 hours of training material plus 43 hours of test data) with 143 hours transcribed (100 hours of training material plus 43 hours of test data). Certain interviews include speech from family members in addition to that of the subject and the interviewer. Accordingly, the corpus contains speech from more than 420 speakers, who are more or less equally distributed between males and females. Samples Please view this audio sample and transcript . Updates None at this time.
Extent:		Corpus size: 14521216 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2014S04
		https://catalog.ldc.upenn.edu/LDC2014S04
		ISBN: 1-58563-672-X
		ISLRN: 310-213-848-753-5
		DOI: 10.35111/v2nt-7j09
Language:		Czech
Language (ISO639):		ces
License:		USC-SFI MALACH Interviews and Transcripts Czech For-Profit: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-interviews-and-transcripts-czech-for-profit.pdf
		USC-SFI MALACH Interviews and Transcripts Czech Non-Member: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-czech-nonmember.pdf
		USC-SFI MALACH Interviews and Transcripts Czech Not-for-Profit: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-czech-nfp.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 2014 USC Shoah Foundation Institute, © 2014 Trustees of the University of Pennsylvania This USC SFI Malach Data is from the archive of the University of Southern California Shoah Foundation Institute for Visual History and Education
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2014S04
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Psutka, Josef V.; Psutka, Josef; Vlasta, Radová; Ircing, Pavel; Jindřich, Matoušek; Luděk, Müller. 2014. Linguistic Data Consortium.
Terms:		area_Europe country_CZ dcmi_Sound dcmi_Text iso639_ces olac_primary_text