|Status of document:||Proposed Recommendation. This document is in the midst of open review by the community.|
This document specifies the metadata extension used by OLAC to uniquely identify languages. It uses the three-letter identifiers of ISO 639-3 as its controlled vocabulary.
Steven Bird, University of Melbourne and University of Pennsylvania (mailto:email@example.com)
|Changes since previous version:||
A major rework of the original 2003 recommendation. That recommendation was based on an assembly of identifiers from multiple sources: ISO 639-1, the SIL language identifiers (prefixed by "x-sil-"), and the Linguist List identifiers (prefixed by "x-ll-"). Now with the release of ISO 639-3, which weaves together the codes from these sources into a single coherent code set, the OLAC recommendation is simplified to use that standard.
Copyright © 2007 Gary Simons (SIL International) and Steven Bird (University of Melbourne and University of Pennsylvania). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.
Identifying the languages involved is an important dimension of language resource description. However, using the character-string representation of language names as identifiers is problematic for several reasons:
Different languages (in different parts of the world) may have the same name.
The same language may have different names in different languages.
The same language may have different names in the various countries where it is spoken.
Within the same country, the preferred name for a language may change over time.
In the early history of discovering languages (before their names were standardized), different people referred to the same language by different names.
For languages having non-Roman orthographies, the language name may have several possible romanizations.
These facts taken together mean that identifying languages by name will not work. Rather, what is needed is a standard based on unique identifiers that do not change, combined with accessible documentation that clarifies the particular speech variety denoted by each identifier. For a deeper discussion of these issues see [CS2000] and [Simons2000].
The information technology community has a well-established standard for language identification, namely, ISO 639. Part 1 of the standard specifies two-letter codes for identifying about 160 of the world's major languages; part 2 specifies three-letter codes for identifying about 380 languages [ISO639-2]. These code sets in turn form the core of the standard followed by the Internet Engineering Task Force (IETF), namely, RFC 3066 [RFC3066]. This is the standard used for language identification in the xml:lang attribute of XML [XML-lang]. The Dublin Core Metadata Initiative [DCMT] defines encoding schemes dcterms:ISO639-2 and dcterms:RFC3066 for use in its Language element.
The above standards fall short of the coverage required by the language archiving community since they focus on the major languages of the world that are most frequently represented in the total body of the world's literature. A new part 3 of the standard [ISO639-3], adopted in 2007, has the purpose of defining three-letter identifiers for all known human languages. With over 7,500 codes, it attempts to provide a comprehensive enumeration of languages, including living, extinct, ancient, and constructed languages, whether major or minor. This is the standard that forms the basis for the OLAC recommendation on language identification in language resource description.
The complete vocabulary of language identifiers recommended for use by OLAC consists of all active codes from ISO 639-3 as documented at its official web site [ISO639-3]. This includes both individual language identifiers and macrolanguage identifiers [ISO639-Macro]. At any time the valid code set is the complete set of codes (both active and retired) that is downloadable from [ISO639-3]; the use of retired codes is deprecated, but not disallowed.
Following the extension mechanism defined in [OLAC-Metadata], a language identifier is expressed as the value of the olac:code attribute and the extension olac:ISO639-3 is named in the xsi:type attribute. A language identifier may be used with the <dc:language> element to identify a language that a resource is written or spoken in. Similarly, a language identifier may be used with the <dc:subject> element to identify a language that a resource is about. In both cases, free text in the element content may be used to identify the specific variety of the language. For instance, the following indicates that a resource is written in Japanese:
<dc:language xsi:type="olac:ISO639-3" olac:code="jpn"/>
While the following indicates that a resource is sbout the Lau language of Solomon Islands, and specifically, the Suafa dialect:
<dc:subject xsi:type="olac:ISO639-3" olac:code="llu">Suafa dialect</dc:subject>
The formal definition of the vocabulary is in the following XML schema which conforms to the conventions for an OLAC metadata extension as defined in [OLAC-Metadata]:
That schema in turn includes another schema that contains only the simple type that enumerates all the recognized code values, namely,
The latter schema provides a complete list of all language identifiers recognized by OLAC. Each enumerated value provides both a code (in the value attribute) and an associated language name (in the <annotation> element) that can be used for display purposes.
Given a particular three-letter code, the ISO 639-3 web site provides a means of looking up documentation for what the code represents. The three letters are appended to the documentation page URL as the value of the id parameter, as in:
In addition to the basic information available on this page, more detailed information is available through a link to the corresponding page on the Ethnologue web site [Ethnologue] for a living or recently extinct language, or on the Linguist List website for an ancient [LL-Ancient] or constructed [LL-Constructed] language.
The easiest way to find the code for a living or recently extinct language is to type the name of a language or dialect or region into the Ethnologue's site search form at:
When you know the country the language is spoken in, another approach is to use the country index to find a listing of all the languages in that particular country and then to browse the list:
Linguist List offers a language search page based on name and country information downloaded from [Ethnologue-Codes] plus similar information they have compiled regarding ancient and constructed languages:
See [ISO639-3-Changes] for a description of the process by which changes are made to the international standard. Follow this process if you believe that something is missing or in error. Note that the standard reserves codes qaa through qtz for local use. That is, those codes will never be assigned as language identifiers. Thus, when users feel that a needed code is missing from the code set, they may use a local use code in their own database as a temporary measure until the outcome of a change request is known. Note, however, that local use codes are undefined for information interchange and should not be submitted to the OLAC harvester.
|[CS2000]||Constable, Peter, and Gary Simons. 2000. Language
identification and IT: Addressing problems of linguistic diversity on a global
scale. SIL Electronic Working Papers, 2000-001. Dallas: SIL
|[Ethnologue]||Ethnologue: Languages of the
|[Ethnologue-Codes]||Three-letter Codes for Identifying
|[ISO639-2]||Codes for the Representation of Names of Languages-Part 2:
|[ISO639-3]||Codes for the Representation of Names of Languages-Part 3:
Alpha-3 code for comprehensive coverage of
|[ISO639-3-Changes]||ISO 639-3 Change Management.
|[ISO639-Macro]||"Macrolanguages," in Scope of Denotation for Language
|[LL-Ancient]||Linguist List Codes for Ancient and Extinct
|[LL-Constructed]||Linguist List Codes for Constructed
|[RFC3066]||Tags for the Identification of Languages.
|[Simons2000]||Simons, Gary. 2000. Language identification in metadata
descriptions of language archive holdings. Proceedings of Workshop on Web-Based
Language Documentation and Description, 12-15 December 2000, Philadelphia, USA.
|[XML-lang]||Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C
Recommendation 16 August 2006. Section 2.12, Language