OLAC Record: Training and test data for Arabizi detection and transliteration

OLAC Record
oai:catalogue.elra.info:ELRA-W0126

Metadata

Title: Training and test data for Arabizi detection and transliteration

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2018-06-06

Date Issued (W3CDTF): 2018-06-06

Description: The dataset is composed of two distinct resources:1) A collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts. The training part of the corpus contains: 522 tweets composed of 5,207 tokens (including 3,307 English tokens, 1,203 Arabizi tokens and 697 other tokens). Tokens are manually labelled as English (“e”), Arabizi (“a”), or other (“o”). The testing part contains: 475 tweets containing 3,533 tokens (803 English tokens; 1,965 Arabizi tokens; and 765 other tokens).2) A set of 3,452 Arabizi tokens manually transliterated into Arabic, and a set of 127 Arabizi tweets containing 1,385 word also manually transliterated into Arabic. This dataset was intended to train and test a system that performs Arabizi to Arabic transliteration.

Identifier: ELRA-W0126

ISLRN: 986-364-744-303-9

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-W0126/

Language: English

Arabic

Language (ISO639): eng

ara

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-W0126

DateStamp: 2018-06-06

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2018. ELRA (European Language Resources Association).
Terms: area_Europe country_GB dcmi_Text iso639_ara iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0126
Up-to-date as of: Wed Oct 1 0:56:25 EDT 2025

Metadata
Title:		Training and test data for Arabizi detection and transliteration
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2018-06-06
Date Issued (W3CDTF):		2018-06-06
Description:		The dataset is composed of two distinct resources:1) A collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts. The training part of the corpus contains: 522 tweets composed of 5,207 tokens (including 3,307 English tokens, 1,203 Arabizi tokens and 697 other tokens). Tokens are manually labelled as English (“e”), Arabizi (“a”), or other (“o”). The testing part contains: 475 tweets containing 3,533 tokens (803 English tokens; 1,965 Arabizi tokens; and 765 other tokens).2) A set of 3,452 Arabizi tokens manually transliterated into Arabic, and a set of 127 Arabizi tweets containing 1,385 word also manually transliterated into Arabic. This dataset was intended to train and test a system that performs Arabizi to Arabic transliteration.
Identifier:		ELRA-W0126
Identifier:		ISLRN: 986-364-744-303-9
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-W0126/
Language:		English
Language:		Arabic
Language (ISO639):		eng
Language (ISO639):		ara
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-W0126
DateStamp:		2018-06-06
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2018. ELRA (European Language Resources Association).
Terms:		area_Europe country_GB dcmi_Text iso639_ara iso639_eng olac_primary_text