OLAC Metadata

Date issued:2003-05-31
Status of document:Candidate Standard. This document is on track to be accepted by the community as a standard; full adoption is awaiting successful implementation. The version that is ultimately adopted may incorporate changes based on feedback from implementers.
This version:http://www.language-archives.org/OLAC/metadata-20030531.html
Latest version:http://www.language-archives.org/OLAC/metadata.html
Previous version:http://www.language-archives.org/OLAC/metadata-20021211.html
Abstract:

This document defines the format used by the Open Language Archives Community [OLAC] for the interchange of metadata within the framework of the Open Archives Initiative [OAI]. The metadata set is based on Qualified Dublin Core [DCQ], but the format allows for the use of extensions to express community-specific qualifiers.

Editors: Gary Simons, SIL International (mailto:gary_simons@sil.org)
Steven Bird, University of Melbourne and University of Pennsylvania (mailto:sb@csse.unimelb.edu.au)
Changes since previous version:

The approach to OLAC metadata set out in this document was established at the IRCS Workshop on Open Language Archives (December 2002) as a suitable basis for the OLAC standard. Now that the OLAC infrastructure has been re-implemented to work with approach, the document is upgraded to candidate status, and archives are invited to implement the specification. The document has minor editorial changes throughout, including modifications to the namespaces of the XML examples.

Copyright © 2002 Gary Simons (SIL International) and Steven Bird (University of Melbourne and University of Pennsylvania). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.

Table of contents

  1. Introduction
  2. The metadata set
  3. The metadata format
  4. Using OLAC extensions
  5. Defining a third-party extension
  6. Documenting an extension
References

1. Introduction

This document defines the metadata format used by the Open Language Archives Community [OLAC] to describe language resources and to provide associated services. OLAC uses an XML format to interchange language-resource metadata within the framework of the Open Archives Initiative [OAI].

Section 2 of this document describes the set of metadata elements and qualifiers used in resource description. Section 3 goes on to describe the XML format used to represent metadata. Section 4 describes how OLAC extensions are used. Section 5 describes how a third-party extension is formally defined, while section 6 describes how an extension is documented.

2. The metadata set

The OLAC metadata set is based on the Dublin Core (DC) metadata set and uses all fifteen elements defined in that standard [DCMES]. (The rationale for following DC is discussed in the OLAC white paper [OLAC-WP].) To provide greater precision in resource description, OLAC follows the DC recommendation for qualifying elements by means of element refinements or encoding schemes [DCQ].

The qualifiers recommended by DC are applicable across a wide range of resources. However, the language resource community has a number of resource description requirements that are not met by these general standards. In order to meet these needs, members of OLAC have developed community-specific qualifiers, and the community at large has adopted some of them (following [OLAC-Process]) as recommended best practice for language resource description. These recommended qualifiers are listed in [OLAC-Extensions] and the manner of using them in resource description is documented below in OLAC extensions.

3. The metadata format

The XML implementation of OLAC metadata follows the "Guidelines for implementing Dublin Core in XML" [DCXML]. The OLAC metadata schema is an application profile [HP2000] that incorporates the elements from the two metadata schemas (Basic DC and DC Qualifiers) developed by the DC Architecture Working Group for implementing qualified DC [DCQXML]. In order to insulate OLAC from the unexpected effects of potential changes to these schemas, the OLAC metadata schema and the schemas for all OLAC extensions use local copies that are stored at:

The most recent version of the OLAC metadata schema (along with a sample record) can be found at:

The container for an OLAC metadata record is the element <olac>, which is defined in a namespace called "http://www.language-archives.org/OLAC/1.0/". By convention the namespace prefix olac is used, and the DC namespace is declared to be the default so that the metadata element tags need not be prefixed. For instance, the following is a valid OLAC metadata record:

<olac:olac xmlns:olac="http://www.language-archives.org/OLAC/1.0/"
   xmlns="http://purl.org/dc/elements/1.1/"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/ 
      http://www.language-archives.org/OLAC/1.0/olac.xsd">
   <creator>Bloomfield, Leonard</creator>
   <date>1933</date>
   <title>Language</title>
   <publisher>New York: Holt</publisher>
</olac:olac>

In addition to this basic DC metadata, an element may use a DC qualifier, following the guidelines given in [DCXML]. The element may specify a refinement (using an element defined in the dcterms namespace) or an encoding scheme (using a scheme defined in dcterms as the value of the xsi:type attribute), or both. Note that the metadata record must declare the dcterms namespace as follows: xmlns:dcterms="http://purl.org/dc/terms/". For instance, the following element represents a creation date encoded in the W3C date and time format:

<dcterms:created xsi:type="dcterms:W3C-DTF">2002-11-28</dcterms:created>

The xsi:type attribute is a directive that is built into the XML Schema standard [XMLS]. It functions to override the definition of the current element by the type definition named in its value. In this example, dcterms:W3C-DTF resolves to the definition for a complex type named W3C-DTF in the XML schema that defines the dcterms namespace.

Any element may also use the xml:lang attribute to indicate the language of the element content. For instance, the following represents a title in the Lau language of Solomon Islands and its translation into English:

<title xml:lang="x-sil-LLU">Na tala 'uria na idulaa diana</title>
<dcterms:alternative xml:lang="en">The road to good reading</dcterms:alternative>

Values of the xml:lang attribute are controlled by the Internet Engineering Task Force standard, "Tags for the Identification of Languages" [RFC1766]. That standard includes approximately 140 two-letter codes from the ISO standard for language identification [ISO639]. It also reserves the prefix x- for specifying user-defined language codes. In order to provide codes for all known languages, OLAC uses this mechanism to define codes for more than 7,000 languages, both living and extinct. See [OLAC-Language] for the complete definition of the language codes used by OLAC.

In the absence of the xml:lang attribute, service providers may assume that the language is English. Whenever the language of the element content is other than English, recommended best practice is to use the xml:lang attribute to identify the language. By using multiple instances of the metadata elements tagged for different languages, data providers may offer a metadata record in multiple languages.

4. Using OLAC extensions

The xsi:type mechanism has access to the full power of XML Schema, and may be used for a variety of purposes other than narrowing the meaning of the element, or restricting element content (as done for DC qualifiers). It may do both simultaneously, and it may also define additional attributes, which may in turn be restricted by patterns or enumerations.

OLAC extensions use a convention of defining an olac:code attribute to hold restricted element values. This leaves the element content to be used for an unrestricted comment. When code and content are used together, the content provides an escape hatch for expressing a more precise resource description than is possible with the restricted code value alone. The olac:code attribute is also defined to be optional, which provides a migration path for adding precision to legacy data that is not originally qualified. For instance, the following are three steps in the migration of describing a resource about the Dschang language of Cameroon:

1. <subject>Dschang</subject>
2. <subject xsi:type="olac:language">Dschang</subject>
3. <subject xsi:type="olac:language" code="x-sil-BAN"/>

All metadata extensions that have been adopted by OLAC as recommended best practice are defined in the olac namespace schema. See [OLAC-Extensions] for the complete list of the recommended extensions and links to their full documentation.

Some OLAC extensions use externally-defined vocabularies that are not controlled by OLAC and are not subject to the OLAC process (e.g. the language codes). In such cases, the document describing the OLAC extension does not define the vocabulary, but simply refers to the external definition.

Once an extension has been adopted as an OLAC recommendation, subsequent changes must be carefully controlled. Redefining a coded value to mean something different would cause problems for users and repositories that employ the existing coded value. In particular, when the interpretation of a coded value is narrowed in scope, the existing code must be expired and a new code adopted. (If it is not possible to meet this requirement, then the old version retains its adopted status while the new version is assigned candidate status.)

5. Defining a third-party extension

An OLAC metadata record may use extensions from other namespaces. This makes it possible for subcommunities within OLAC to develop and share metadata extensions that are specific to a common special interest. By using xsi:type, it is possible to extend the OLAC application profile without modifying the OLAC schema.

For instance, suppose that a given domain required greater precision in identifying the roles of contributors than is possible with the OLAC Role vocabulary [OLAC-Role], and included a term commentator. This specialized vocabulary and coded value could be represented as follows in a metadata element:

<contributor xsi:type="example:role" example:code="commentator">Sampson, Geoffrey</contributor>

In order to do this, an organization representing that domain (say, example.org) would define a new XML schema as follows:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema  xmlns:xs="http://www.w3.org/2001/XMLSchema"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns="http://www.example.org/"
            targetNamespace="http://www.example.org/"
            elementFormDefault="qualified"
            attributeFormDefault="qualified">

  <xs:import namespace="http://purl.org/dc/elements/1.1/"
    schemaLocation="http://www.language-archives.org/OLAC/1.0/dc.xsd"/>

  <xs:annotation>
    <xs:appinfo>
      <olac-extension xmlns="http://www.language-archives.org/OLAC/1.0/olac-extension.xsd">
        <shortName>role</shortName>
        <longName>Code for My Specialized Roles</longName>
        <versionDate>2002-08-16</versionDate>
        <description>A hypothetical extension for an individual archive, defining
          specialized roles not available in the OLAC Role vocabulary.</description>
        <appliesTo>creator</appliesTo>
        <appliesTo>contributor</appliesTo>
        <documentation>http://www.example.org/roles.html</documentation>
      </olac-extension>
    </xs:appinfo>
  </xs:annotation>

  <!-- Type for third party role refinement -->
  <xs:complexType name="role">
    <xs:complexContent mixed="true">
      <xs:extension base="dc:SimpleLiteral">
        <xs:attribute name="code" type="role-vocab" use="required"/>
      </xs:extension>
    </xs:complexContent>
  </xs:complexType>

  <!-- Third party role vocabulary -->
  <xs:simpleType name="role-vocab">
    <xs:restriction base="xs:string">
      <xs:enumeration value="calligrapher"/>
      <xs:enumeration value="censor"/>
      <xs:enumeration value="commentator"/>
      <xs:enumeration value="corrector"/>
    </xs:restriction>
  </xs:simpleType>

</xs:schema>

This schema contains four top-level elements: (1) a directive to import the schema for DC elements (which is needed since the complexType extends a type defined in that schema); (2) an annotation providing summary documentation for the extension; (3) a complexType declaration for the overriding element definition which defines a namespace:code attribute with values taken from the specialized vocabulary; and (4) a simpleType declaration which defines the vocabulary itself. Refer to the XML Schema specification [XMLS] and primer [XMLSP] for documentation on how to define types in XML.

The extension schema is associated with a target namespace (namely, http://www.example.org/) and stored in a specific location on the developer's web site (e.g., http://www.example.org/example.xsd). Now an OLAC metadata record using the desired metadata element can be created as follows:

<?xml version="1.0" encoding="UTF-8"?>
<olac:olac xmlns="http://purl.org/dc/elements/1.1/"
           xmlns:dcterms="http://purl.org/dc/terms/"
           xmlns:olac="http://www.language-archives.org/OLAC/1.0/"
           xmlns:example="http://www.example.org/"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.example.org/
               http://www.example.org/example.xsd
             http://www.language-archives.org/OLAC/1.0/
               http://www.language-archives.org/OLAC/1.0/olac.xsd">

<!-- Third party extension -->

  <contributor xsi:type="example:role" example:code="commentator">Sampson, Geoffrey</contributor>

  <!-- other metadata elements ... -->

</olac:olac>

Many more examples of extension definitions are provided in http://www.language-archives.org/drafts/third-party/.

Developers of third-party extensions should note that standard OLAC service providers harvest four things from each metadata element: the tag name, the element content, the value of the xsi:type attribute, and the value of the olac:code attribute. Third-party extensions may define other attributes that could be exploited by specialized subcommunity service providers. However, these attributes are ignored by standard OLAC services.

In order to be listed on the OLAC website, a third-party extension must be germane to the mission of OLAC, and it must represent a new perspective on resource description. The latter condition prevents extensions which simply rename all the terms in an existing vocabulary, or which copy an existing vocabulary with minor modifications or additions. Note that, when the purpose of a third-party extension is to augment an existing OLAC extension by adding more vocabulary items, the third-party extension must only provide the new terms, to avoid proliferating copies of OLAC terms.

6. Documenting an extension

Each extension should be accompanied with human-readable documentation that provides the semantics for the vocabulary. Additionally, an extension should provide summary documentation as shown below.

This documentation should be placed after the import statement(s) and before the type declarations. The <olac-extension> element should be embedded in <xs:appinfo> within <xs:annotation>). The <olac-extension> element is defined in:

http://www.language-archives.org/OLAC/1.0/olac-extension.xsd

The summary documentation includes six mandatory elements:

shortName

The short name by which the extension is accessed. This should be the same as the name of the <complexType> that defines the extension.

longName

The full name of the extension for use as a title in documentation.

versionDate

The date of the latest version of the extension. The date should be modified only when the extension definition changes in such a way as to alter the set of element instances that would be accepted as valid. The version date should not be modified when only documentation has changed. Use the W3C date format, e.g. 2002-11-30 (for 30 November 2002).

description

A summary description of what the extension is used for.

appliesTo

The content names a Dublin Core element with which the extension may be used. This element is repeated if the extension applies to more than one element.

extensionDoc

The URI for a complete document that defines and exemplifies the extension. If the extension involves a controlled vocabulary, the document should also enumerate and define the terms of the vocabulary.

The information contained in the summary documentation is extracted for display on the OLAC website in the Third Party Extensions document [OLAC-Third-Party].


References

[DCMES]Dublin Core Metadata Element Set, Version 1.1: Reference Description.
<http://dublincore.org/documents/1999/07/02/dces/>
[DCQ]Dublin Core Qualifiers.
<http://dublincore.org/documents/2000/07/11/dcmes-qualifiers/>
[DCQXML]Recommendations for XML Schema for Qualified Dublin Core.
<http://www.ukoln.ac.uk/metadata/dcmi/xmlschema/20021007/>
[DCXML]Guidelines for implementing Dublin Core in XML.
<http://dublincore.org/documents/2002/09/09/dc-xml-guidelines/>
[HP2000]Heery, Rachel and Manjula Patel, 2000. Application profiles: mixing and matching metadata schemas. Ariadne, Issue 25.
<http://www.ariadne.ac.uk/issue25/app-profiles/>
[ISO639]Codes for the Representation of Names of Languages.
<http://lcweb.loc.gov/standards/iso639-2/langhome.html>
See also: <http://www.geo-guide.de/info/tools/languagecode.html>
[OAI]Open Archives Initiative.
<http://www.openarchives.org/>
[OLAC]Open Language Archives Community.
<http://www.language-archives.org/>
[OLAC-Extensions]Recommended metadata extensions.
<http://www.language-archives.org/REC/olac-extensions.html>
[OLAC-Language]OLAC Language Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Process]OLAC Process.
<http://www.language-archives.org/OLAC/process.html>
[OLAC-Role]OLAC Role Vocabulary.
<http://www.language-archives.org/OLAC/role.html>
[OLAC-Third-Party]Third Party Extensions.
<http://www.language-archives.org/NOTE/third-party-extensions.html>
[OLAC-WP]White Paper on Establishing an Infrastructure for Open Language Archiving
<http://www.language-archives.org/docs/white-paper.html>
[RFC1766]Tags for the Identification of Languages.
<http://www.ietf.org/rfc/rfc1766.txt>
[XMLS]XML Schema, Part 1: Structures.
<http://www.w3.org/TR/xmlschema-1/>
[XMLSP]XML Schema, Part 0: Primer.
<http://www.w3.org/TR/xmlschema-0/>