|Status of document:||Proposed Recommendation. This document is in the midst of open review by the community.|
This document specifies the controlled vocabulary of language identifiers used by OLAC.
Steven Bird, University of Melbourne and University of Pennsylvania (mailto:email@example.com)
|Changes since previous version:||
A major rework of the earlier draft. It documents the sources of the codes listed in the schema for the OLAC-Language extension (OLAC-Language.xsd). A significant addition from the previous version is the incorporation of the Linguist List codes for ancient and constructed languages.
Copyright © 2003 Gary Simons (SIL International) and Steven Bird (University of Melbourne and University of Pennsylvania). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.
Language identification is an important dimension of language resource description. However, using the character-string representation of language names as identifiers is problematic for several reasons:
Different languages (in different parts of the world) may have the same name.
The same language may have a different name in each country where it is spoken.
Within the same country, the preferred name for a language may change over time.
In the early history of discovering new languages (before names were standardized), different people referred to the same language by different names.
For languages having non-Roman orthographies, the language name may have several possible romanizations.
The sum of these facts taken together suggests that a standard based on names will not work. Rather, what is needed is a standard based on unique identifiers that do not change, combined with accessible documentation that clarifies the particular speech variety denoted by each identifier.
The information technology community has a standard for language identification, namely, ISO 639 [ISO639]. Part 1 of this standard lists two-letter codes for identifying about 160 of the world's major languages; part 2 of the standard lists three-letter codes for identifying about 380 languages. ISO 639 in turn forms the core of the standard followed by the Internet Engineering Task Force (IETF), namely, RFC 3066 [RFC3066] (formerly RFC 1766 [RFC1766]). This is the standard used for language identification in the xml:lang attribute of XML [XML-lang] and in the language element of the Dublin Core Metadata Initiative [DCMT]. RFC 3166 provides a mechanism for users to register new language identification codes for languages not covered by ISO 639, but very few additional languages have been registered.
Unfortunately, the existing standard falls far short of meeting the needs of the language resources community since it fails to account for more than 90% of the world's languages, and it fails to adequately document what languages the codes refer to ([CS2000], [Simons2000]). In order to achieve complete coverage, OLAC makes use of SIL International's Ethnologue [Ethnologue] which provides a unique three-letter code for all known living languages, plus recently extinct languages. In order to extend coverage to all known languages, OLAC also uses the four-letter codes that have been developed by Linguist List to identify ancient [LL-Ancient] and constructed languages [LL-Constructed] that are not within scope for the Ethnologue. The extension mechanism of RFC 3066 is used in order to develop a means of referring to these codes that is compatible with the IETF's standard.
The complete vocabulary of OLAC language identifiers consists of three sets of codes:
all two-letter codes from ISO 639-1
all three-letter codes from SIL's Ethnologue (prefixed by x-sil- to make them compatible with RFC 3066)
all four-letter codes from Linguist List (prefixed by x-ll- to make them compatible with RFC 3066)
This complete vocabulary is defined in the following XML schema which conforms to the conventions for an OLAC metadata extension as defined in [OLAC-Metadata]:
This schema in turn includes another schema that contains only the simple type that enumerates all the recognized code values. The code list itself is contained in:
The latter schema provides a complete list of all language identifiers recognized by OLAC. Each enumerated value provides both a code (in the value attribute) and an associated language name (in the label attribute) that can be used for display purposes.
The SIL Ethnologue [Ethnologue] provides approximately 7,000 three-letter codes, along with detailed information about language names, genetic affiliations, and geographical locus, amongst other things. The easiest way to find the code for a language is to type a language name into the site search form at:
When you know the country the language is spoken in, another approach is to use the country index to find a listing of all the languages in that particular country and then to browse the list:
Linguist List offers a form for searching for three-letter codes, given all or part of a language name:
A three-letter Ethnologue code AAA will be represented in the OLAC language vocabulary as x-sil-AAA.
To build a language search facility into your own software, you may download tables of codes, language names, alternate names, and countries where spoken from the Ethnologue web site [Ethnologue-Codes], and load them into a relational database using the schema described in [Simons2000].
If you do not find the language you are looking for in the Ethnologue and it is an extinct or constructed language, then consult the Linguist List code tables at [LL-Ancient] and [LL-Constructed]. The Linguist List search form cited above finds codes from both the Ethnologue and Linguist List sets. A four-letter Linguist List code XAAA will be represented in the OLAC language vocabulary as x-ll-XAAA.
See [Ethnologue-Codes] for instructions on how to communicate with the Ethnologue staff about changes that might be needed to the code set. Note that the codes QVA through QZZ are reserved for local use. That is, they will never be assigned by SIL International as language identifiers. Thus, when users feel that a needed code is missing from the code set, they may freely use a code from the local use range as a temporary measure until the outcome of a change request is known.
This document is frozen in proposed status for the time being since major changes are anticipated in the near future. When the 15th edition of the Ethnologue is published in mid 2004, it is planned to incorporate a major revision of the three-letter codes in order to bring them into alignment with Part 2 of the ISO 639 standard. When this happens, nearly ten per cent of the Ethnologue codes will change. The result, however, will be a single universal standard for three-letter language code that is compatible with ISO 639-2 and has the full coverage of the Ethnologue and the Linguist List codes. At that time it is anticipated that the OLAC Language vocabulary will change to encompass just this universal code set.
We could offer an alternative to LanguageCodes.xsd, e.g. LanguageCodePatterns.xsd, that simply defines patterns for the two-, three- and four-letter codes for use with software that can't handle an enumeration with 8000 values.
|[CS2000]||Constable, Peter, and Gary Simons. 2000. Language identification and IT: Addressing problems
of linguistic diversity on a global scale. SIL Electronic Working Papers, 2000-001. Dallas:
|[DCMT]||DCMI Metadata Terms.
|[Ethnologue]||Ethnologue: Languages of the
|[Ethnologue-Codes]||SIL Three-letter Codes for Identifying Languages.
|[ISO639]||Codes for the Representation of Names of
Languages-Part 2: Alpha-3
|[LL-Ancient]||Linguist List Codes for Ancient and Extinct Languages.
|[LL-Constructed]||Linguist List Codes for Constructed Languages.
|[RFC1766]||Tags for the Identification of
|[RFC3066]||Tags for the Identification of Languages (replaces
|[Simons2000]||Simons, Gary. 2000. Language identification in metadata descriptions
of language archive
holdings. Proceedings of Workshop on Web-Based Language Documentation and Description,
12-15 December 2000, Philadelphia, USA.
|[XML-lang]||Extensible Markup Language (XML) 1.0 (Second Edition),
W3C Recommendation 6 October 2000. Section 2.12, Language Identification.