OLAC Record: N4 NATO Native and Non-Native Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2006S13

Metadata

Title: N4 NATO Native and Non-Native Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Grieco, John, et al. N4 NATO Native and Non-Native Speech LDC2006S13. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Grieco, John

Benarousse, Laurent

Geoffrois, Edouard

Series, Robert

Steeneken, Herman

Stumpf, Hans

Swail, Carl

Thiel, Dieter

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-04-17

Description: *Introduction* N4 NATO Native and Non-Native Speech was developed by the NATO research group on Speech and Language Technology and contains approximately 9.5 hours of recorded multilingual speech and associated transcripts. The corpus was created to provide a military-oriented database for multilingual and non-native speech processing studies. Speech technology is covering an increasing number of languages, and systems are becoming more robust with regard to speech variability such as speaking style and accents. However, for real applications, especially in a multilingual and multinational context, further robustness to regional and even non-native accents is necessary. Among the numerous corpora available for speech research, few have specifically addressed this issue. The NATO Speech and Language Technology group decided to create a corpus geared towards the study of non-native accents. The group chose naval communications as the common task because it naturally includes a great deal of non-native speech and because there were training facilities where data could be collected in several countries. *Data* The database was collected in four countries (Germany, The Netherlands, United Kingdom, and Canada) during naval communication training sessions in 2000-2002. For each country, the main part of the recordings consists of a NATO Naval procedure in English where the typical sentence sounds like "This is alpha, whiskey, roger. I make two seven zero six hostile, two seven zero six. Out." In addition each speaker read a text, "The North Wind and the Sun," in English and his or her native language. The audio material was recorded on DAT and downsampled to 16 kHz, 16 bit. All the audio files have been manually transcribed and annotated with speakers' identities using the tool Transcriber. Navy procedure recordings and text readings have been stored in different files. The first digit in the filename indicates the type of speech Among speech segments, the duration of Navy procedure recordings range from 1.3 to 2.3 hours for a total of 7.5 hours. The duration of the native language text readings range from 1.5 to 22.9 minutes for a total of approximately one hour. CA GE NL UK All Signal 5.30 3.20 5.00 6.30 19.80 Silence 3.00 0.56 2.00 4.70 10.26 Speech 2.30 2.64 3.00 1.60 9.54 Speech 2.30 2.64 3.00 1.60 9.54 Navy proc 2.00 1.90 2.30 1.30 7.50 Read text 0.30 0.74 0.70 0.30 2.04 Read text 0.30 0.74 0.70 0.30 2.04 Non-native 0.27 0.37 0.32 0.00 0.96 Native 0.03 0.37 0.38 0.30 1.08 The database contains the following information about each speaker: gender, age, weight, length, possible speaking or hearing disorders, education level, living area, accent, second language, the year English was learned (for non-native speakers). The speaker accents vary widely from country to country. The speakers' average age was 22.6 years. Nineteen women participated, accounting for 18% of the study participants. There were a total of 115 speakers. CA GE NL UK All #Speakers 22 51 31 11 115 #Women 5 0 9 5 19 Age 22-35 17-23 17-61 19-62 17-62 Age mean 28.3 20.1 21 27.5 22.6 *Samples* For an example of the data in this corpus, please listen to this audio sample (SPH) and view this transcript sample (TRS). *Updates* None at this time.

Extent: Corpus size: 2243952 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2006S13

https://catalog.ldc.upenn.edu/LDC2006S13

ISBN: 1-58563-344-5

ISLRN: 632-458-830-271-0

DOI: 10.35111/rz1e-1575

Language: Dutch

English

German

Language (ISO639): nld

eng

deu

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

N4 NATO Native and Non-Native Speech Agreement: https://catalog.ldc.upenn.edu/license/n4-nato-native-and-non-native-speech.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006S13

Rights Holder: Portions © 2006 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006S13

DateStamp: 2021-07-02

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Grieco, John; Benarousse, Laurent; Geoffrois, Edouard; Series, Robert; Steeneken, Herman; Stumpf, Hans; Swail, Carl; Thiel, Dieter. 2006. Linguistic Data Consortium.
Terms: area_Europe country_DE country_GB country_NL dcmi_Sound iso639_deu iso639_eng iso639_nld olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006S13
Up-to-date as of: Wed Oct 29 7:00:55 EDT 2025

Metadata
Title:		N4 NATO Native and Non-Native Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Grieco, John, et al. N4 NATO Native and Non-Native Speech LDC2006S13. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Grieco, John
		Benarousse, Laurent
		Geoffrois, Edouard
		Series, Robert
		Steeneken, Herman
		Stumpf, Hans
		Swail, Carl
		Thiel, Dieter
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-04-17
Description:		Introduction N4 NATO Native and Non-Native Speech was developed by the NATO research group on Speech and Language Technology and contains approximately 9.5 hours of recorded multilingual speech and associated transcripts. The corpus was created to provide a military-oriented database for multilingual and non-native speech processing studies. Speech technology is covering an increasing number of languages, and systems are becoming more robust with regard to speech variability such as speaking style and accents. However, for real applications, especially in a multilingual and multinational context, further robustness to regional and even non-native accents is necessary. Among the numerous corpora available for speech research, few have specifically addressed this issue. The NATO Speech and Language Technology group decided to create a corpus geared towards the study of non-native accents. The group chose naval communications as the common task because it naturally includes a great deal of non-native speech and because there were training facilities where data could be collected in several countries. Data The database was collected in four countries (Germany, The Netherlands, United Kingdom, and Canada) during naval communication training sessions in 2000-2002. For each country, the main part of the recordings consists of a NATO Naval procedure in English where the typical sentence sounds like "This is alpha, whiskey, roger. I make two seven zero six hostile, two seven zero six. Out." In addition each speaker read a text, "The North Wind and the Sun," in English and his or her native language. The audio material was recorded on DAT and downsampled to 16 kHz, 16 bit. All the audio files have been manually transcribed and annotated with speakers' identities using the tool Transcriber. Navy procedure recordings and text readings have been stored in different files. The first digit in the filename indicates the type of speech Among speech segments, the duration of Navy procedure recordings range from 1.3 to 2.3 hours for a total of 7.5 hours. The duration of the native language text readings range from 1.5 to 22.9 minutes for a total of approximately one hour. CA GE NL UK All Signal 5.30 3.20 5.00 6.30 19.80 Silence 3.00 0.56 2.00 4.70 10.26 Speech 2.30 2.64 3.00 1.60 9.54 Speech 2.30 2.64 3.00 1.60 9.54 Navy proc 2.00 1.90 2.30 1.30 7.50 Read text 0.30 0.74 0.70 0.30 2.04 Read text 0.30 0.74 0.70 0.30 2.04 Non-native 0.27 0.37 0.32 0.00 0.96 Native 0.03 0.37 0.38 0.30 1.08 The database contains the following information about each speaker: gender, age, weight, length, possible speaking or hearing disorders, education level, living area, accent, second language, the year English was learned (for non-native speakers). The speaker accents vary widely from country to country. The speakers' average age was 22.6 years. Nineteen women participated, accounting for 18% of the study participants. There were a total of 115 speakers. CA GE NL UK All #Speakers 22 51 31 11 115 #Women 5 0 9 5 19 Age 22-35 17-23 17-61 19-62 17-62 Age mean 28.3 20.1 21 27.5 22.6 Samples For an example of the data in this corpus, please listen to this audio sample (SPH) and view this transcript sample (TRS). Updates None at this time.
Extent:		Corpus size: 2243952 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2006S13
		https://catalog.ldc.upenn.edu/LDC2006S13
		ISBN: 1-58563-344-5
		ISLRN: 632-458-830-271-0
		DOI: 10.35111/rz1e-1575
Language:		Dutch
		English
		German
Language (ISO639):		nld
		eng
		deu
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
License:		N4 NATO Native and Non-Native Speech Agreement: https://catalog.ldc.upenn.edu/license/n4-nato-native-and-non-native-speech.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006S13
Rights Holder:		Portions © 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006S13
DateStamp:		2021-07-02
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Grieco, John; Benarousse, Laurent; Geoffrois, Edouard; Series, Robert; Steeneken, Herman; Stumpf, Hans; Swail, Carl; Thiel, Dieter. 2006. Linguistic Data Consortium.
Terms:		area_Europe country_DE country_GB country_NL dcmi_Sound iso639_deu iso639_eng iso639_nld olac_primary_text