OLAC Record
oai:catalogue.elra.info:ELRA-W0038

Metadata
Title:The EMILLE Lancaster Corpus
Abstract:The EMILLE Lancaster Corpus consists of monolingual corpora containing approximately 58,880,000 words for seven South Asian languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.
Access Rights:Rights available for: Commercial Use
Date Available (W3CDTF):2004-09-15
Date Issued (W3CDTF):2004-03-17
Date Modified (W3CDTF):2009-03-06
Description:Written Corpora
The EMILLE Lancaster Corpus consists of three components: monolingual, parallel and annotated corpora. There are monolingual corpora for seven South Asian languages: Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil, Urdu. The EMILLE monolingual corpora contain approximately 58,880,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. ?Developing Asian language corpora: standards and practice? in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya. This database is available only for commercial use. For research use by academic organisations, a more complete set of the EMILLE Lancaster Corpus is available under the reference ELRA-W0037 The EMILLE/CIIL Corpus.
Identifier:ELRA-W0038
http://catalog.elra.info/product_info.php?products_id=714
Language:Bengali
Gujarati
Hindi
Panjabi, Punjabi
Sinhala; Sinhalese
Tamil
Urdu
Language (ISO639):ben
guj
hin
pan
sin
tam
urd
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0038
DateStamp:  2004-09-15
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2004. ELRA (European Language Resources Association).
Terms: area_Asia country_BD country_IN country_LK country_PK dcmi_Text iso639_ben iso639_guj iso639_hin iso639_pan iso639_sin iso639_tam iso639_urd olac_primary_text


http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0038
Up-to-date as of: Mon Feb 27 0:30:30 EST 2017