OLAC Language Extension

Date issued:	2008-02-22
Status of document:	Recommendation. This document embodies an OLAC consensus concerning best current practice.
This version:	http://www.language-archives.org/REC/language-20080222.html
Latest version:	http://www.language-archives.org/REC/language.html
Previous version:	http://www.language-archives.org/REC/language-20071114.html
Abstract:	This document specifies the metadata extension used by OLAC to uniquely identify languages. It uses the three-letter identifiers of ISO 639-3 as its controlled vocabulary.
Editors:	Gary Simons, SIL International (mailto:gary_simons@sil.org) Steven Bird, University of Melbourne and University of Pennsylvania (mailto:sb@csse.unimelb.edu.au)
Changes since previous version:	The name of the extension was changed back to olac:language. Expands the vocabulary to valid codes from all part of ISO 639 and describes the normalization process and the validation service that will be performed by OLAC.

Copyright © 2008 Gary Simons (SIL International) and Steven Bird (University of Melbourne and University of Pennsylvania). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.

Introduction
The olac:language extension
Looking up the language for a code
Looking up the code for a language

References

1. Introduction

Identifying the specific languages involved is an important dimension of language resource description. However, using the character-string representation of language names as identifiers is problematic for several reasons:

Different languages (in different parts of the world) may have the same name.
The same language may have different names in different languages.
The same language may have different names in the various countries where it is spoken.
Within the same country, the preferred name for a language may change over time.
In the early history of discovering languages (before their names were standardized), different people referred to the same language by different names.
For languages having non-Roman orthographies, the language name may have several possible romanizations.

These facts taken together mean that identifying languages by name will not work. Rather, what is needed is a standard based on unique identifiers that do not change, combined with accessible documentation that clarifies the particular speech variety denoted by each identifier. For a deeper discussion of these issues see [CS2000] and [Simons2000].

The information technology community has a well-established standard for language identification, namely, ISO 639. Part 1 of the standard specifies two-letter codes for identifying about 180 of the world's major languages; part 2 specifies three-letter codes for identifying approximately 400 languages [ISO639-2]. These code sets in turn form the core of the standard followed by the Internet Engineering Task Force (IETF), namely, RFC 3066 [RFC3066]. This is the standard used for language identification in the xml:lang attribute of XML [XML-lang]. The Dublin Core Metadata Initiative [DCMT] defines encoding schemes dcterms:ISO639-2 and dcterms:RFC3066 for use in its Language element.

The above standards fall short of the coverage required by the language archiving community since they focus on the major languages of the world that are most frequently represented in the total body of the world's literature. A new part 3 of the standard [ISO639-3], adopted in 2007, has the purpose of defining three-letter identifiers for all known human languages. With over 7,500 codes, it attempts to provide a comprehensive enumeration of languages, including living, extinct, ancient, and constructed languages, whether major or minor. This is the standard that forms the basis for the OLAC recommendation on language identification in language resource description. The Dublin Core Metadata Initiative [DCMT] also recognizes dcterms:ISO639-3 as an encoding scheme for use in its Language element.

2. The olac:language extension

The complete vocabulary of language identifiers recommended for use by OLAC consists of all active codes for individual languages from any part of ISO 639. In the case of codes from Part 1 or Part 2, the OLAC harvester will normalize these to the equivalent Part 3 code before storing the record. The equivalencies are shown in the code tables at the web site of the [ISO639-3] Registration Authority. Every two-letter code from Part 1 has a three-letter equivalent in Part 3. Part 2 codes for individual languages are identical to Part 3 codes except in the case of around 20 languages for which Part 2 has both a "bibliographic" code and a "terminological" code. Part 3 matches the terminological set; thus any bibliographic code is normalized to its terminological equivalent. The OLAC harvester also normalizes language codes supplied in upper case to their lower case equivalents.

Following the extension mechanism defined in [OLAC-Metadata], a language identifier is expressed as the value of the olac:code attribute and the extension olac:language is named in the xsi:type attribute. A language identifier may be used with the <dc:language> element to identify a language that a resource is written or spoken in. Thus the following six elements are recognized as equivalent ways of specifying a document written in the German language:

<dc:language xsi:type="olac:language" olac:code="de"/>
<dc:language xsi:type="olac:language" olac:code="deu"/>
<dc:language xsi:type="olac:language" olac:code="ger"/>
<dc:language xsi:type="olac:language" olac:code="DE"/>
<dc:language xsi:type="olac:language" olac:code="DEU"/>
<dc:language xsi:type="olac:language" olac:code="GER"/>

All of the above are normalized by the OLAC harvester to the equivalent lower-case Part 3 identifier:

<dc:language xsi:type="olac:language" olac:code="deu"/>

Similarly, a language identifier may be used with the <dc:subject> element to identify a language that a resource is about. In both the Language element and the Subject element, free text in the element content may be used to identify the specific variety of the language. For instance, the following indicates that a resource is about the Lau language of Solomon Islands and, specifically, the Suafa dialect:

<dc:subject xsi:type="olac:language" olac:code="llu">Suafa dialect</dc:subject>

The formal definition of the vocabulary is in the following XML schema which conforms to the conventions for an OLAC metadata extension as defined in [OLAC-Metadata]:

http://www.language-archives.org/OLAC/1.1/olac-language.xsd

The schema requires only that the code value be a string of two or three letters, either upper case or lower case. Validity of the actual identifiers used in metadata is not tested as a precondition for harvesting, but is tested periodically by a service on the OLAC site. The latter approach is needed because individual identifiers may be retired from active use by the ISO 639-3 Registration Authority; the validation service thus alerts participating archives to identifiers in their metadata records that are no longer active and relays the instructions given by the ISO 639-3/RA as to the appropriate remedy. In addition to flagging retired identifiers, the validation service also flags identifiers that are undefined or for local use as errors, and those that are collective or for macrolanguages as less than best practice. See [ISO639-Scope] for the definition of local use, collective, and macrolanguage.

3. Looking up the language for a code

Given a particular three-letter code, the ISO 639-3 web site provides a means of looking up documentation for what the code represents. The three letters are appended to the documentation page URL as the value of the id parameter, as in:

http://www.sil.org/iso639-3/documentation.asp?id=abc

In addition to the basic information available on this page, more detailed information is available through a link to the corresponding page on the Ethnologue web site [Ethnologue] for a living or recently extinct language, or on the Linguist List website for an ancient [LL-Ancient] or constructed [LL-Constructed] language.

4. Looking up the code for a language

The easiest way to find the code for a living or recently extinct language is to type the name of a language or dialect or region into the Ethnologue's site search form at:

http://www.ethnologue.com/site_search.asp

When you know the country the language is spoken in, another approach is to use the country index to find a listing of all the languages in that particular country and then to browse the list:

http://www.ethnologue.com/country_index.asp

Linguist List offers a language search page based on name and country information downloaded from [Ethnologue-Codes] plus similar information they have compiled regarding ancient and constructed languages:

http://www.linguistlist.org/forms/langs/find-a-language-or-family.html

See [ISO639-3-Changes] for a description of the process by which changes are made to the international standard. Follow this process if you believe that something is missing or in error. Note that the standard reserves codes qaa through qtz for local use. That is, those codes will never be assigned as language identifiers. Thus, when users feel that a needed code is missing from the code set, they may use a local use code in their own database as a temporary measure until the outcome of a change request is known. Note, however, that local use codes are undefined for information interchange and should not be submitted to the OLAC harvester.

The standard is updated in major annual updates, and minor updates as needed. It takes time for these changes to propagate, so that Ethnologue and Linguist List are not always up to date. Note that [ISO639-3] is the authority in cases where the reference sites differ, which is typically due to their differing update schedules.

References

[CS2000]	Constable, Peter, and Gary Simons. 2000. Language identification and IT: Addressing problems of linguistic diversity on a global scale. SIL Electronic Working Papers, 2000-001. Dallas: SIL International. <http://www.sil.org/silewp/2000/001/SILEWP2000-001.html>
[DCMT]	DCMI Metadata Terms. <http://dublincore.org/documents/2008/01/14/dcmi-terms/>
[Ethnologue]	Ethnologue: Languages of the World. <http://www.ethnologue.com/>
[Ethnologue-Codes]	Three-letter Codes for Identifying Languages. <http://www.ethnologue.com/codes/>
[ISO639-2]	Codes for the Representation of Names of Languages-Part 2: Alpha-3 Code. <http://lcweb.loc.gov/standards/iso639-2/langhome.html>
[ISO639-3]	Codes for the Representation of Names of Languages-Part 3: Alpha-3 code for comprehensive coverage of languages. <http://www.sil.org/iso639-3/>
[ISO639-3-Changes]	ISO 639-3 Change Management. <http://www.sil.org/iso639-3/changes.asp>
[ISO639-Scope]	Scope of Denotation for Language Identifiers. <http://www.sil.org/iso639-3/scope.asp>
[LL-Ancient]	Linguist List Codes for Ancient and Extinct Languages. <http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/forms/langs/GetListOfAncientLgs.cfm?RequestTimeout=200>
[LL-Constructed]	Linguist List Codes for Constructed Languages. <http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/forms/langs/GetListOfConstructedLgs.cfm?RequestTimeout=200>
[OLAC-Metadata]	OLAC Metadata. <http://www.language-archives.org/OLAC/metadata.html>
[RFC3066]	Tags for the Identification of Languages. <http://www.ietf.org/rfc/rfc3066.txt>
[Simons2000]	Simons, Gary. 2000. Language identification in metadata descriptions of language archive holdings. Proceedings of Workshop on Web-Based Language Documentation and Description, 12-15 December 2000, Philadelphia, USA. <http://www.ldc.upenn.edu/exploration/expl2000/papers/simons/simons.htm>
[XML-lang]	Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation 16 August 2006. Section 2.12, Language Identification. <http://www.w3.org/TR/REC-xml#sec-lang-tag>

OLAC Language Extension

Table of contents

1. Introduction

2. The olac:language extension

3. Looking up the language for a code

4. Looking up the code for a language

References