OLAC Record: Tham Khasi annotated corpus

OLAC Record
oai:catalogue.elra.info:ELRA-W0321

Metadata

Title: Tham Khasi annotated corpus

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2022-03-09

Date Issued (W3CDTF): 2022-03-09

Description: The Tham Khasi annotated corpus is a Khasi corpus, an Austro-Asiatic language, comprising of Khasi sentences extracted from textbooks prescribed for students in secondary, higher secondary, graduation, and post-graduation in the year 2015-2016. In the corpus, each word is separated by a space and each sentence is marked with an end of sentence marker such as a period (.), a question mark (?) or an exclamation mark (!). The sentences are manually tagged for parts of speech using the BIS (Bureau of Indian Standards) tagset which is the standard annotation scheme prescribed for Indian languages. The corpus contains 83,312 words, 4,386 sentences, 5,465 word types which amounts to 94,651 tokens (including punctuations). The corpus is provided as one single file in text format.

Identifier: ELRA-W0321

ISLRN: 926-738-235-188-8

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-W0321/

Language: Khasi

Language (ISO639): kha

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-W0321

DateStamp: 2022-03-09

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2022. ELRA (European Language Resources Association).
Terms: area_Asia country_IN dcmi_Text iso639_kha olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0321
Up-to-date as of: Wed Oct 1 0:57:33 EDT 2025

Metadata
Title:		Tham Khasi annotated corpus
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2022-03-09
Date Issued (W3CDTF):		2022-03-09
Description:		The Tham Khasi annotated corpus is a Khasi corpus, an Austro-Asiatic language, comprising of Khasi sentences extracted from textbooks prescribed for students in secondary, higher secondary, graduation, and post-graduation in the year 2015-2016. In the corpus, each word is separated by a space and each sentence is marked with an end of sentence marker such as a period (.), a question mark (?) or an exclamation mark (!). The sentences are manually tagged for parts of speech using the BIS (Bureau of Indian Standards) tagset which is the standard annotation scheme prescribed for Indian languages. The corpus contains 83,312 words, 4,386 sentences, 5,465 word types which amounts to 94,651 tokens (including punctuations). The corpus is provided as one single file in text format.
Identifier:		ELRA-W0321
Identifier:		ISLRN: 926-738-235-188-8
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-W0321/
Language:		Khasi
Language (ISO639):		kha
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-W0321
DateStamp:		2022-03-09
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2022. ELRA (European Language Resources Association).
Terms:		area_Asia country_IN dcmi_Text iso639_kha olac_primary_text