OLAC Metadata Set

Date issued:2001-04-25
Status of document:Draft Standard. This is only a preliminary draft that is still under development; it has not yet been presented to the whole community for review.
This version:http://www.language-archives.org/OLAC/olacms-20010425.html
Latest version:http://www.language-archives.org/OLAC/olacms.html
Previous version:http://www.language-archives.org/OLAC/olacms-20010406.html
Abstract:

This document specifies the metadata set used by the Open Language Archives Community [OLAC] for the interchange of metadata within the framework of the Open Archives Initiative [OAI].

Editors: Gary Simons, SIL International ( mailto:gary_simons@sil.org)
Steven Bird, Linguistic Data Consortium ( mailto:sb@ldc.upenn.edu)
Changes since previous version:

Fixes broken URLs in references.

Copyright © . This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.

Table of contents

  1. Introduction
  2. Attributes
  3. Elements
References

1. Introduction

The OLAC metadata set is based on the Dublin Core metadata set [DCMES1.1]. The rationale for this is discussed in the OLAC white paper [OLAC-WP].

All fifteen Dublin Core elements are used in the OLAC metadata set. In order to suit the specific needs of the language archiving community, the elements have been qualified following principles articulated in [DC-Q] and exemplified in [DCQ-HTML]. A further principle followed in developing the OLAC implementation of qualified DC is that any element may use at most one encoding scheme. In this way, an XML DTD or schema can be used in validating encoded values; by contrast, if the definition of validity for one attribute depends on the value of another, then XML validation mechanisms could not be employed. Thus when all the refinements of an element use the same encoding scheme, the refinement is implemented by means of the refine attribute. However, when a particular refinement requires that the element value use a different encoding scheme, then a unique element has been be defined. The names for these refined elements have been formed as in [DCQ-HTML] by concatenating the DC element name and the refinement name with an intervening dot.

The most recent version of the XML schema for the OLAC metadata set (though it does not yet match this specification) is as follows:

Section 2 below describes the attributes used in implementing the OLAC metadata set. Section 3 then describes each of the elements that make up the OLAC metadata set.

2. Attributes

Three attributes—refine, code, and lang—are used throughout the metadata set to handle most qualifications to Dublin Core. Some elements in the OLAC metadata set use the refine attribute to identify element refinements. These qualifiers make the meaning of an element narrower or more specific. A refined element shares the meaning of the unqualified element, but with a more restricted scope [DC-Q].

Some elements in the OLAC metadata set use the code attribute to hold metadata values that are taken from a specific encoding scheme. When an element may take this attribute, the attribute value specifies a precise value for the element taken from a controlled vocabulary or formal notation described in another OLAC document. In such cases, the element content may also be used to specify a freeform elaboration of the coded value.

Every element in the OLAC metadata set may use the lang attribute. It specifies the language in which the text in the content of the element is written. The value for the attribute comes from the controlled vocabulary defined by [OLAC-Language]. By default, the lang attribute has the value "en", for English. Whenever the language of the element content is other than English, the lang attribute should be used to identify the language. By using multiple instances of the metadata elements tagged for different languages, data providers may offer their metadata records in multiple languages.

In addition, there is a lang attribute on the <olac> element that contains the metadata elements for a given metadata record. It lists the languages in which the metadata record is designed to be read. This attribute holds a space-delimited list of language codes from the [OLAC-Language] controlled vocabulary. By default, this attribute has the value "en", for English, indicating that the record is aimed only at English readers. If an explicit value is given for the attribute, then the record is aimed at readers of all the languages listed.

Service providers should use this information in order to offer multilingual views of the metadata. When a metadata record lists only one alternative language, then all elements are displayed (regardless of their individual languages), unless the user has requested to suppress all records in that language. When a metadata record has multiple alternative languages, the user should be able to select one and have display of elements in the other languages suppressed. An element in a language not included in the list of alternatives should always be displayed (for instance, the vernacular title of a work).

3. Elements

Each element of the OLAC metadata set is described in one of the following subsections. The heading gives the generic identifier of the XML tag used to encode the element. Under the heading, the element is described in five ways. Name gives a descriptive label for the tag. Definition is a one-line summary of what the element is used for. Comments offers details on how to use the element; the first paragraph typically repeats the comment from [DCMES1.1], while the remaining paragraphs give further specification for how OLAC uses the element. Attributes describes the XML attributes used with the element. Examples shows samples of properly encoded elements.

In a given metadata record, every element is optional and every element is repeatable.

DefinitionAn entity responsible for making contributions to the content of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Examples of a Contributor include a person, an organization, or a service. Typically, the name of a Contributor should be used to indicate the entity. [DCMES1.1]

The name should be given in a form that is ready for sorting within an index. For the names of persons, this means that the name should be given in inverted order with the surname first. For the names of organizations, this means that any initial article should be omitted. When a resource has more than one contributor, use a separate Contributor element for each one.

Contributor is closely related to Creator. The Contributor designation is used for those entities whose role in the creation of the resource is not great enough to merit recognition as a primary source of the intellectual content.

Examples

A generic contributor:

<contributor>Smith, John L.</contributor>

A funding agency:

<contributor refine="funder">National Science
            Foundation</contributor>

To do

Is it indeed right to list sponsors and funders here?

DefinitionThe extent or scope of the content of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Coverage will typically include spatial location (a place name or geographic coordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity). Recommended best practice is to select a value from a controlled vocabulary (for example, the Thesaurus of Geographic Names [TGN]) and that, where appropriate, named places or time periods be used in preference to numeric identifiers such as sets of coordinates or date ranges. [DCMES1.1]

Examples

To do

We need help from our librarians and archivists here. How do we want to recommend using this element? I suspect we will want to modify the above comment from DCMI. Since our database of the 6,800+ languages of the world already records where they are spoken we don't need to encode that in the metadata. The service providers can supply options to search by country, and then use the language database to find the languages spoken in that country. In the OLAC context, what would be good ways to use the Coverage element?

DefinitionAn entity primarily responsible for making the content of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Examples of a Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity. [DCMES1.1]

The name should be given in a form that is ready for sorting within an index. For the names of persons, this means that the name should be given in inverted order with the surname first. For the names of organizations, this means that any initial article should be omitted. When a resource has more than one creator, use a separate Creator element for each one.

Creator is closely related to Contributor. In determining whether an entity is a Creator (as opposed to a Contributor), use the same criteria that are followed for deciding that an entity should be listed in the "author" slot of a bibliographic reference as a primary source of the intellectual content. Entities that do not merit that level of recognition should be treated as Contributors.

Examples

A personal author:

<creator>Bloomfield, Leonard</creator>

An institutional author:

<creator>Linguistic Society of America</creator>

An editor:

<creator refine="editor">Sapir, Edward</creator>

To do

Develop the controlled vocabulary and write the OLAC-Role document.

DefinitionA date associated with an event in the life cycle of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Following [DCMES1.1], we recommend that the date be encoded in YYYY-MM-DD format as defined in [W3CDTF]. Use two digits for month and day, even when the value is less than ten. This guarantees that a service provider can always do a straight alphanumeric sort to put values into correct chronological order.

Examples

A typical year of publication:

<date>1992</date>

A resource modified on October 16, 1996:

<date refine="modified">1996-10-16</date>

DefinitionAn account of the content of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content, or a free-text account of the content. [DCMES1.1]

No formatting conventions are defined within the text of Description. Service providers may format the entire Description as a single paragraph, collapsing adjacent white space characters into a single space.

When there is a URL for a document that describes the resource, use a separate Description element to encode just that URL. A Description that begins with "http:" will be interpreted by service providers as consisting solely of a URL and will be presented as a link in user interfaces. Service providers are not obliged to search other Description text for the occurrence of URLs.

Examples

A prose description of a resource:

<description>The CALLHOME Japanese corpus of telephone
            speech consists of 120 unscripted telephone conversations between native
            speakers of Japanese. All calls, which lasted up to 30 minutes, originated in
            North America and were placed to locations overseas (typically Japan). Most
            participants called family members or close friends. This corpus contains
            speech data files ONLY, along with the minimal amount of documentation needed
            to describe the contents and format of the speech files and the software
            packages needed to uncompress the speech data. </description>

A reference to an existing on-line description:

<description>http://www.ldc.upenn.edu/Catalog/LDC96S37.html
</description>

To do

[DC-Q] defines two refinements for Description: table of contents and abstract. Do we want to introduce this?

DefinitionThe physical or digital manifestation of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Typically, Format may include the media-type or dimensions of the resource. Format may be used to determine the software, hardware or other equipment needed to display or operate the resource. Examples of dimensions include size and duration. [DCMES1.1]

Examples

For a digitally encoded dictionary:

<format code="text/xml">5,237 entries in a 1.2M XML
            file.</format>

For a digitally recorded text:

<format code="audio/wav">Duration: 153 seconds. Size: 3.3M.
            Sampling: 1 channel, 22 KHz, 8 bits.</format>

To do

We need to develop the vocabulary for Format. It should be based on the list of Internet Media Types [MIME] but we will still want our own vocabulary document at least for the purpose of explaining and exemplifying the use of MIME types. But further than that, we may want to pull out a subset of MIME types. We also may want to add some new categories and subtypes for our purposes in order to cover archive holdings that are not digital, e.g. manuscript, print, microform, and so on. The library or archive world probably has such a controlled vocabulary already.

DefinitionThe CPU required to use a software resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

This element is used in the description of executable programs to identify the kind of CPU that is needed to run them.

Examples

Software that runs on a Power PC:

<format.cpu code="ppc"/>

Software that runs on the Intel family of processors but needs at least 64 megabytes of memory:

<format.cpu code="x86">At least 64M
            memory</format.cpu>

To do

We need to develop the vocabulary for CPU.

DefinitionAn encoded character set used by a digital resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

For a resource that is a digitally encoded text, Format.encoding names the encoded character set it uses. For a resource that is a font, Format.encoding names an encoded character set that it is able to render. For a resource that is a software application, Format.encoding names an encoded character set that it can read as input or write as output. Service providers will use this information to match data files with the software tools that can be applied to them.

Examples

To do

We need to develop the controlled vocabulary for Format.encoding. The IANA registry of character set names [IANA-CS] could be used as a starting point, but we will need to innovate beyond this. For instance, we will need to add something about levels of Unicode conformance as defined by our Character Encoding working group.

DefinitionA markup scheme used by a digital resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

For a resource that is a text file including markup, Format.markup identifies the markup system it uses, such as the SGML DTD, the XML Schema, the set of Standard Format markers, and the like. For a resource that is a stylesheet or a software application, Format.markup names a markup scheme that it can read as input or write as output. Service providers will use this information to match data files with the software tools that can be applied to them.

The content of the element should be a URI giving an OAI identifier for the markup scheme itself as a resource in an OLAC archive. Thus, if the DTD, Schema, or markup documentation is not already archived in an OLAC repository, the depositer of a marked-up resource must also deposit the documentation for the markup scheme. A resource identified in Format.markup should not also be listed with the requires refinement of Relation.

Examples

To do

Do we want to go a step further and have markup schemes be deposited at OLAC so that we can try to avoid duplicate ids for the same DTD? They could have identifiers like oai:olac:markup:... and be defined as belonging to a set named markup at the OLAC community data provider so that a single OAI harvesting request would retrieve the complete set of known markup schemes.

DefinitionAn operating system required to use a software resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

This element is used in the description of executable programs to identify the operating system environment that is needed to run them.

Examples

Software that runs under OS/2:

<format.os code="OS/2"/>

Software that runs only under Windows NT 4.0 or higher:

<format.os code="MSWindows">NT 4.0 or
            higher</format.os>

To do

We need to develop the vocabulary for OS.

DefinitionA programming language of software distributed in source form.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Identifies a programming language used by software that is distributed in source code form.

Examples

Source code that is written in C++:

<format.sourcecode code="C++"/>

Source code that is written in Java using the version 1.2 library:

<format.sourcecode code="Java">Version 1.2
            library</format.sourcecode>

To do

We need to develop the vocabulary for Sourcecode.

DefinitionAn unambiguous reference to the resource within a given context.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Recommended best practice is to identify the resource by means of a string or number conforming to a globally-known formal identification system. Example formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL)), the Digital Object Identifier (DOI), and the International Standard Book Number (ISBN). [DCMES1.1]

In the case of a resource that is not electronically encoded, but is housed in a conventional archive, Identifier may be used to give a local shelf or box number, or whatever scheme is used to locate a resource within the collection.

Identifiers that begin with "http:" will be interpreted by service providers as URLs and be presented as links in user interfaces. Note that Identifier is to be used only for a URL that retrieves the actual resource; use Description for a URL that retrieves just a description of the resource.

Do not specify the "oai:" identifier for the resource as a value of Identifier, since it is already given in the header of the metadata record.

Examples

A Uniform Resource Locator for retrieval of an electronically encoded resource:

<identifier>http://arxiv.org/abs/cs.CL/0010033</identifier>

A local identifier for retrieval within a physical collection:

<identifier>Shelf 12, Box 7</identifier>

DefinitionA language of the intellectual content of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Language is used for a language the resource is in, as opposed to the language it describes (see Subject.language). It is related to the audience for the work in that it identifies a language that the creator of the resource assumes that its eventual user will understand. When a resource is in more than one language, use a separate Language element for each language.

For a work of literature or other monolingual document aimed at the speakers of a particular language, use Language to identify that language. For a sound recording, use Language for the language being spoken in the recording. For a grammatical description, for instance, use Language for the language the grammar is written in; use Subject.language for the language whose grammar is being described. For an annotated text, use Language for the language in which the annotations are made; use Subject.language for the language of the base text that is being annotated. For a bilingual dictionary, use Language for the language in which the definitions are written; use Subject.language for the language whose words are being defined.

Examples

A resource in English about the Sikaiana language:

<language code="en"/>
<subject.language code="x-sil-sky"/>

A Yemba-French dictionary, where the alternate name Dschang is preferred.

<language code="fr"/>
<subject.language code="x-sil-ban">Dschang</subject.language>

The American Heritage Dictionary, which is both in and about American English:

<language code="en-us"/>
<subject.language code="en-us"/>

A resource about a language for which the controlled vocabulary does not yet provide a code:

<subject.language>Ancient
            Sumerian</subject.language>

To do

Add an example of specifying a dialect.

Write the OLAC-Language document.

DefinitionAn entity responsible for making the resource available
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Examples of a Publisher include a person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity. [DCMES1.1]

The name should be given in a form that is ready for sorting within an index. For the names of persons, this means that the name should be given in inverted order with the surname first. For the names of organizations, this means that any initial article should be omitted. When a resource has more than one publisher, use a separate Publisher element for each one.

Examples

A typical publisher:

<publisher>Oxford University Press</publisher>

DefinitionA reference to a related resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

This element is used to document relationships between resources. Recommended best practice is to reference the related resource by means of an "oai:" identifier; this means that a metadata record for it should be placed in an archive. In cases where the metadata for the related resource is not in an archive, check if Source is the right element. Otherwise, enter a free text description of the related resource.

For a required markup definition (like a DTD or Schema) use Format.markup rather than Relation.

A Relation that begins with "oai:" should be presented by service providers as an active link that retrieves the metadata for that resource.

Examples

A link to a required font:

<relation
            refine="requires">oai:sil:software/ipafont</relation>

Links to the component pieces of a collected work:

<relation
            refine="hasPart">oai:somearchive:holding126</relation>
<relation refine="hasPart">oai:somearchive:holding127</relation>
<relation refine="hasPart">oai:somearchive:holding128</relation>
<relation refine="hasPart">oai:somearchive:holding129</relation>
<relation refine="hasPart">oai:somearchive:holding130</relation>

DefinitionInformation about rights held in and over the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Typically, a Rights element will contain a rights management statement for the resource, or reference a service providing such information. Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. If the Rights element is absent, no assumptions can be made about the status of these and other rights with respect to the resource. [DCMES1.1]

Examples

To do

Write the OLAC-Rights document.

Add examples after we work out the controlled vocabulary.

DefinitionA reference to a resource from which the present resource is derived.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

This element describes where the resource came from. For instance, it may be the bibliographic information about a printed book of which this is the electronic encoding or from which the information was extracted. It may be a journal, volume, and page reference if the resource represents a published article. It may be the name and dates of a conference at which a paper was originally presented.

Examples

An encoded edition of a book:

<source>An encoded edition of Lau Dictionary, by Charles E.
            Fox. Pacific Linguistcs C-24, 1974.</source>

A conference paper:

<source>A paper presented at Workshop on Web-based Language
            Documentation and Description, Philadelphia, PA, 12-15 December
            2000.</source>

Data extracted from a published source:

<source>Kwara'ae flora vocabulary extracted from Guide to
            the Forests of the British Solomon Islands, by T. C. Whitmore. Oxofrd
            University Press, 1966.

To do

Need some help from our librarians and archivists here. [DCMES1.1] says "The present resource may be derived from the Source resource in whole or in part. Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system. " I find this hard to distinguish from Relation.isVersoinOf and Relation.isPartOf. Thus I've come up with a definition that makes Source and Relation distinct. I would also propose to change the last two words of the definition from "is derived" to "originally came". Am I on the right track?

DefinitionThe topic of the content of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Typically, a Subject will be expressed as keywords, key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme. [DCMES1.1]

Examples

To do

Need help here from our librarians and archivists. Is there a controlled vocabulary we want to recommend, or should we just recommend freeform use of keywords?

DefinitionA language which the content of the resource describes or discusses.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

See Language for a complete discussion (with examples) of using the Language and Subject.language elements.

Examples

DefinitionA name given to the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Typically, a Title will be a name by which the resource is formally known. [DCMES1.1]

A translation of the title can be supplied in a second Title element. Use the lang attribute to identify the language of these elements.

Examples

A typical title:

<title>A Dictionary of the Nggela
            Language</title>

A vernacular title with translation:

<title lang="x-sil-llu">Na tala 'uria na idulaa
            diana</language>
<title lang="en">The road to good reading</language>

DefinitionThe nature or genre of the content of the resource.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

Type includes terms describing general categories, functions, genres, or aggregation levels for content. To describe the physical or digital manifestation of the resource, use the Format element. [DCMES1.1]

Examples

The resource is a video recording:

<type code="image"/>

DefinitionThe nature or genre of the content of the resource from a linguistic standpoint.
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

For a resource that is information in or about a language, Type.data identifies what kind of informaton it is from a linguistic standpoint. For a resource that is a software tool, Type.data identifies what kind of information it processes. Service providers may use this information to match data files with software tools that might be applied to them.

Examples

The resource describes the grammar of a language:

<Type.data code="description/grammar"/>

The resouce includes the orthographic transcription of text:

<Type.data code="transcription/orthographic"/>

To do

Write the OLAC-Data document.

DefinitionSoftware Functionality
Refinements

There are no refinements for this element.

Schemes

There are no encoding schemes for this element.

Usage notes 

This element is used with resources that are software applications to classify what they are used for.

Examples

To do

Write the OLAC-Functionality document. We may want to base it on the HLT Survey http://cslu.cse.ogi.edu/HLTsurvey/ as advocated by the ACL/DFKI Natural Language Software Registry.

Add examples after we figure out the vocabulary.


To do

There is not yet a provision for handling subject categorization by linguistic classification.

Rights.software has been left in limbo. It may be possible to unify it with Rights.


References

[DC-Q]Dublin Core Qualifiers.
<http://dublincore.org/documents/2000/07/11/dcmes-qualifiers/>
[DC-Type]DCMI Type Vocabulary.
<http://dublincore.org/documents/2000/07/11/dcmi-type-vocabulary/>
[DCMES1.1]Dublin Core Metadata Element Set, Version 1.1: Reference Description.
<http://dublincore.org/documents/1999/07/02/dces/>
[DCQ-HTML]Recording qualified Dublin Core metadata in HTML meta elements.
<http://dublincore.org/documents/2000/08/15/dcq-html/>
[IANA-CS]Internet Character Sets.
<http://www.isi.edu/in-notes/iana/assignments/character-sets>
[MIME]Internet Media Types.
<http://www.isi.edu/in-notes/iana/assignments/media-types/media-types>
[OAI]Open Archives Initiative.
<http://www.openarchives.org/>
[OLAC]Open Language Archives Community.
<http://www.language-archives.org/>
[OLAC-CPU]OLAC CPU Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Data]OLAC Data Type Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Encoding]OLAC Encoding Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Format]OLAC Format Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Functionality]OLAC Functionality Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Language]OLAC Language Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-OS]OLAC Operating System Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Rights]OLAC Rights Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Role]OLAC Role Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-Sourcecode]OLAC Source Code Vocabulary.
<http://www.language-archives.org/OLAC/???>
[OLAC-WP]White Paper on Establishing an Infrastructure for Open Language Archiving
<http://www.language-archives.org/docs/white-paper.html>
[W3CDTF]Date and Time Formats, W3C Note.
<http://www.w3.org/TR/NOTE-datetime>