OLAC Record: The EMILLE/CIIL Corpus

OLAC Record
oai:catalogue.elra.info:ELRA-W0037

Metadata

Title: The EMILLE/CIIL Corpus

Access Rights: Rights available for: nonCommercialUse

Date Available (W3CDTF): 2004-09-15

Date Issued (W3CDTF): 2004-09-15

Date Modified (W3CDTF): 2009-03-06

Description: The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. ‘Developing Asian language corpora: standards and practice’ in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya.This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.

Identifier: ELRA-W0037

ISLRN: 039-846-040-604-0

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-W0037/

Language: Tamil

Sinhala; Sinhalese

Urdu

Panjabi; Punjabi

English

Kannada

Telugu

Marathi

Bengali

Kashmiri

Gujarati

Malayalam

Assamese

Oriya (macrolanguage)

Hindi

Language (ISO639): tam

sin

urd

pan

eng

kan

tel

mar

ben

kas

guj

mal

asm

ori

hin

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-W0037

DateStamp: 2004-09-15

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2004. ELRA (European Language Resources Association).
Terms: area_Asia area_Europe country_BD country_GB country_IN country_LK country_PK dcmi_Text iso639_asm iso639_ben iso639_eng iso639_guj iso639_hin iso639_kan iso639_kas iso639_mal iso639_mar iso639_ori iso639_pan iso639_sin iso639_tam iso639_tel iso639_urd olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0037
Up-to-date as of: Wed Jul 15 7:04:38 EDT 2026

Metadata
Title:		The EMILLE/CIIL Corpus
Access Rights:		Rights available for: nonCommercialUse
Date Available (W3CDTF):		2004-09-15
Date Issued (W3CDTF):		2004-09-15
Date Modified (W3CDTF):		2009-03-06
Description:		The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. ‘Developing Asian language corpora: standards and practice’ in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya.This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.
Identifier:		ELRA-W0037
Identifier:		ISLRN: 039-846-040-604-0
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-W0037/
Language:		Tamil
		Sinhala; Sinhalese
		Urdu
		Panjabi; Punjabi
		English
		Kannada
		Telugu
		Marathi
		Bengali
		Kashmiri
		Gujarati
		Malayalam
		Assamese
		Oriya (macrolanguage)
		Hindi
Language (ISO639):		tam
		sin
		urd
		pan
		eng
		kan
		tel
		mar
		ben
		kas
		guj
		mal
		asm
		ori
		hin
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-W0037
DateStamp:		2004-09-15
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2004. ELRA (European Language Resources Association).
Terms:		area_Asia area_Europe country_BD country_GB country_IN country_LK country_PK dcmi_Text iso639_asm iso639_ben iso639_eng iso639_guj iso639_hin iso639_kan iso639_kas iso639_mal iso639_mar iso639_ori iso639_pan iso639_sin iso639_tam iso639_tel iso639_urd olac_primary_text