OLAC Language Extension

Date issued:2007-11-14
Status of document:Proposed Recommendation. This document is in the midst of open review by the community.
This version:http://www.language-archives.org/REC/language-20071114.html
Latest version:http://www.language-archives.org/REC/language.html
Previous version:http://www.language-archives.org/REC/language-20031213.html
Abstract:

This document specifies the metadata extension used by OLAC to uniquely identify languages. It uses the three-letter identifiers of ISO 639-3 as its controlled vocabulary.

Editors: Gary Simons, SIL International (mailto:gary_simons@sil.org)
Steven Bird, University of Melbourne and University of Pennsylvania (mailto:sb@csse.unimelb.edu.au)
Changes since previous version:

A major rework of the original 2003 recommendation. That recommendation was based on an assembly of identifiers from multiple sources: ISO 639-1, the SIL language identifiers (prefixed by "x-sil-"), and the Linguist List identifiers (prefixed by "x-ll-"). Now with the release of ISO 639-3, which weaves together the codes from these sources into a single coherent code set, the OLAC recommendation is simplified to use that standard.

Copyright © 2007 Gary Simons (SIL International) and Steven Bird (University of Melbourne and University of Pennsylvania). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.

Table of contents

  1. Introduction
  2. The ISO639-3 extension
  3. Looking up the language for a code
  4. Looking up the code for a language
References

1. Introduction

Identifying the languages involved is an important dimension of language resource description. However, using the character-string representation of language names as identifiers is problematic for several reasons:

These facts taken together mean that identifying languages by name will not work. Rather, what is needed is a standard based on unique identifiers that do not change, combined with accessible documentation that clarifies the particular speech variety denoted by each identifier. For a deeper discussion of these issues see [CS2000] and [Simons2000].

The information technology community has a well-established standard for language identification, namely, ISO 639. Part 1 of the standard specifies two-letter codes for identifying about 160 of the world's major languages; part 2 specifies three-letter codes for identifying about 380 languages [ISO639-2]. These code sets in turn form the core of the standard followed by the Internet Engineering Task Force (IETF), namely, RFC 3066 [RFC3066]. This is the standard used for language identification in the xml:lang attribute of XML [XML-lang]. The Dublin Core Metadata Initiative [DCMT] defines encoding schemes dcterms:ISO639-2 and dcterms:RFC3066 for use in its Language element.

The above standards fall short of the coverage required by the language archiving community since they focus on the major languages of the world that are most frequently represented in the total body of the world's literature. A new part 3 of the standard [ISO639-3], adopted in 2007, has the purpose of defining three-letter identifiers for all known human languages. With over 7,500 codes, it attempts to provide a comprehensive enumeration of languages, including living, extinct, ancient, and constructed languages, whether major or minor. This is the standard that forms the basis for the OLAC recommendation on language identification in language resource description.

2. The ISO639-3 extension

The complete vocabulary of language identifiers recommended for use by OLAC consists of all active codes from ISO 639-3 as documented at its official web site [ISO639-3]. This includes both individual language identifiers and macrolanguage identifiers [ISO639-Macro]. At any time the valid code set is the complete set of codes (both active and retired) that is downloadable from [ISO639-3]; the use of retired codes is deprecated, but not disallowed.

Following the extension mechanism defined in [OLAC-Metadata], a language identifier is expressed as the value of the olac:code attribute and the extension olac:ISO639-3 is named in the xsi:type attribute. A language identifier may be used with the <dc:language> element to identify a language that a resource is written or spoken in. Similarly, a language identifier may be used with the <dc:subject> element to identify a language that a resource is about. In both cases, free text in the element content may be used to identify the specific variety of the language. For instance, the following indicates that a resource is written in Japanese:

<dc:language xsi:type="olac:ISO639-3" olac:code="jpn"/>

While the following indicates that a resource is sbout the Lau language of Solomon Islands, and specifically, the Suafa dialect:

<dc:subject xsi:type="olac:ISO639-3" olac:code="llu">Suafa dialect</dc:subject>

The formal definition of the vocabulary is in the following XML schema which conforms to the conventions for an OLAC metadata extension as defined in [OLAC-Metadata]:

http://www.language-archives.org/OLAC/1.1/olac-ISO639-3.xsd

That schema in turn includes another schema that contains only the simple type that enumerates all the recognized code values, namely,

http://www.language-archives.org/OLAC/1.1/ISO639-3.xsd

The latter schema provides a complete list of all language identifiers recognized by OLAC. Each enumerated value provides both a code (in the value attribute) and an associated language name (in the <annotation> element) that can be used for display purposes.

3. Looking up the language for a code

Given a particular three-letter code, the ISO 639-3 web site provides a means of looking up documentation for what the code represents. The three letters are appended to the documentation page URL as the value of the id parameter, as in:

http://www.sil.org/iso639-3/documentation.asp?id=abc

In addition to the basic information available on this page, more detailed information is available through a link to the corresponding page on the Ethnologue web site [Ethnologue] for a living or recently extinct language, or on the Linguist List website for an ancient [LL-Ancient] or constructed [LL-Constructed] language.

4. Looking up the code for a language

The easiest way to find the code for a living or recently extinct language is to type the name of a language or dialect or region into the Ethnologue's site search form at:

http://www.ethnologue.com/site_search.asp

When you know the country the language is spoken in, another approach is to use the country index to find a listing of all the languages in that particular country and then to browse the list:

http://www.ethnologue.com/country_index.asp

Linguist List offers a language search page based on name and country information downloaded from [Ethnologue-Codes] plus similar information they have compiled regarding ancient and constructed languages:

http://www.linguistlist.org/forms/langs/find-a-language-or-family.html

See [ISO639-3-Changes] for a description of the process by which changes are made to the international standard. Follow this process if you believe that something is missing or in error. Note that the standard reserves codes qaa through qtz for local use. That is, those codes will never be assigned as language identifiers. Thus, when users feel that a needed code is missing from the code set, they may use a local use code in their own database as a temporary measure until the outcome of a change request is known. Note, however, that local use codes are undefined for information interchange and should not be submitted to the OLAC harvester.


References

[CS2000]Constable, Peter, and Gary Simons. 2000. Language identification and IT: Addressing problems of linguistic diversity on a global scale. SIL Electronic Working Papers, 2000-001. Dallas: SIL International.
<http://www.sil.org/silewp/2000/001/SILEWP2000-001.html>
[DCMT]DCMI Metadata Terms.
<http://dublincore.org/documents/2006/12/18/dcmi-terms/>
[Ethnologue]Ethnologue: Languages of the World.
<http://www.ethnologue.com/>
[Ethnologue-Codes]Three-letter Codes for Identifying Languages.
<http://www.ethnologue.com/codes/>
[ISO639-2]Codes for the Representation of Names of Languages-Part 2: Alpha-3 Code.
<http://lcweb.loc.gov/standards/iso639-2/langhome.html>
[ISO639-3]Codes for the Representation of Names of Languages-Part 3: Alpha-3 code for comprehensive coverage of languages.
<http://www.sil.org/iso639-3/>
[ISO639-3-Changes]ISO 639-3 Change Management.
<http://www.sil.org/iso639-3/changes.asp>
[ISO639-Macro]"Macrolanguages," in Scope of Denotation for Language Identifiers.
<http://www.sil.org/iso639-3/scope.asp#M>
[LL-Ancient]Linguist List Codes for Ancient and Extinct Languages.
<http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/forms/langs/GetListOfAncientLgs.cfm?RequestTimeout=200>
[LL-Constructed]Linguist List Codes for Constructed Languages.
<http://cf.linguistlist.org/cfdocs/new-website/LL-WorkingDirs/forms/langs/GetListOfConstructedLgs.cfm?RequestTimeout=200>
[OLAC-Metadata]OLAC Metadata.
<http://www.language-archives.org/OLAC/metadata.html>
[RFC3066]Tags for the Identification of Languages.
<http://www.ietf.org/rfc/rfc3066.txt>
[Simons2000]Simons, Gary. 2000. Language identification in metadata descriptions of language archive holdings. Proceedings of Workshop on Web-Based Language Documentation and Description, 12-15 December 2000, Philadelphia, USA.
<http://www.ldc.upenn.edu/exploration/expl2000/papers/simons/simons.htm>
[XML-lang]Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation 16 August 2006. Section 2.12, Language Identification.
<http://www.w3.org/TR/REC-xml#sec-lang-tag>