OLAC Record: English-Urdu Religious Parallel Corpus

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-2582

Metadata

Title: English-Urdu Religious Parallel Corpus

Bibliographic Citation: http://hdl.handle.net/11234/1-2582

Creator: Jawaid, Bushra

Zeman, Daniel

Date (W3CDTF): 2018-01-05T15:38:19Z

Date Available: 2018-01-05T15:38:19Z

Description: English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.

Identifier (URI): http://hdl.handle.net/11234/1-2582

Language: English

Urdu

Language (ISO639): eng

urd

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Rights: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

http://creativecommons.org/licenses/by-nc-sa/4.0/

Subject: parallel corpus

religious text

machine translation

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-2582

DateStamp: 2021-06-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Jawaid, Bushra; Zeman, Daniel. 2018. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Asia area_Europe country_GB country_PK dcmi_Text iso639_eng iso639_urd olac_primary_text

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-2582
Up-to-date as of: Mon Jun 16 1:05:14 EDT 2025

Metadata
Title:		English-Urdu Religious Parallel Corpus
Bibliographic Citation:		http://hdl.handle.net/11234/1-2582
Creator:		Jawaid, Bushra
Creator:		Zeman, Daniel
Date (W3CDTF):		2018-01-05T15:38:19Z
Date Available:		2018-01-05T15:38:19Z
Description:		English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. Our modifications of crawled data include but are not limited to the following: 1- Manually corrected sentence alignment of the corpora. 2- Our data split (training-development-test) so that our published experiments can be reproduced. 3- Tokenization (optional, but needed to reproduce our experiments). 4- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
Identifier (URI):		http://hdl.handle.net/11234/1-2582
Language:		English
Language:		Urdu
Language (ISO639):		eng
Language (ISO639):		urd
Publisher:		Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:		Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Rights:		http://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:		parallel corpus
		religious text
		machine translation
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-2582
DateStamp:		2021-06-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Jawaid, Bushra; Zeman, Daniel. 2018. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms:		area_Asia area_Europe country_GB country_PK dcmi_Text iso639_eng iso639_urd olac_primary_text