|Status of document:||Proposed Informational Note. This document is in the midst of open review by the community.|
In addition to the olac metadata format, the OLAC Aggregator [OLACA] serves records in two other formats: olac_display and oai_dc. This document provides the specification for how an OLAC record is transformed into these other two formats. The first of these formats is a reader-friendly view of OLAC metadata that may be used by someone building a service that displays OLAC metadata; it translates coded values into their human-readable equivalents. The latter format is the standard format used by the Open Archives Initiative (OAI) for metadata interchange. Thus the OLAC Aggregator serves as a crosswalk that transforms the olac format records supplied by OLAC's participating archives to oai_dc format records that can be used by the wider OAI community.
|Changes since previous version:||
This draft describes a total reimplementation of the two formats that was completed in May 2009. The original implementation and specification maintained a one-to-one correspondence between the elements of the original olac record and the elements of the olac_display and oai_dc records. The philosophy of transformation is now very different in that a one-to-many mapping of elements is allowed. The result is oai_dc records that are more in keeping with best practice in the OAI community.
Copyright © 2009 Gary Simons (SIL International). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.
In order to improve recall and precision in searching, the OLAC metadata format [OLAC-Metadata] defines an extension mechanism (involving the xsi:type and olac:code attributes) to support resource description using community-defined controlled vocabularies. Service providers may use these attributes to support precise search. However, those same service providers also need to be able to display metadata records to users in a manner that shows all available information in a form they can understand. This means, for instance, that coded attribute values (such as three-letter language codes) need to be translated into friendly display forms. Still other service providers, such as in the general Open Archives Initiative (OAI) community, will not be interested in the community-specific extensions and will prefer to work with metadata from OLAC participants in the generic oai_dc form without special codes or attributes.
In order to enhance the repurposing of OLAC metadata, the OLAC Aggregator [OLACA] offers such translation services. Neither the OLAC data providers nor potential service providers need worry about the problem of translation. Rather, OLAC data providers need only supply their records in OLAC metadata format to the OLAC Aggregator which in turn disseminates them to service providers in any of three formats: the olac format, the olac_display format, or the oai_dc format. For instance, the following request to OLACA retrieves a record from the Audio Archive of Linguistic Fieldwork (Berkeley, CA) in olac format as it was supplied by the archive:
By changing the requested metadataPrefix to olac_display, the same record is returned in a format that still conforms to the OLAC metadata standard, but which is enriched by the translation of community-specific codes to human-readable display forms:
Finally, changing the requested metadataPrefix to oai_dc causes the same record to be "dumbed down" into the simple Dublin Core (DC) format that serves as the standard for the OAI community:
Section 2 of this document discusses general design principles that underlie the mapping process for the two formats. Section 3 then gives the specification for the mapping from olac format to olac_display format. Finally, section 4 gives the specification for the transformation to oai_dc format, which in fact is a mapping based on the olac_display format.
The OLAC metadata format is an application profile based on the full set of DC metadata terms, also known as "qualified DC" [DC-Q]. The standard algorithm for "dumbing down" qualified DC into the 15 basic DC elements, or "simple DC" [DC-Simple], is:
Translate dcterms elements (that is, the refinements) to their generic dc equivalent.
Drop all attributes in the element tag (that is, xsi:type for naming encoding schemes and xml:lang for identifying the language of element content).
The OLAC metadata format adds another attribute, olac:code, to hold the value for one of the community-specific vocabularies [OLAC-Extensions]. This is essential information that cannot be simply discarded in a dumb down process. Thus, the crosswalk needs to augment the above rules to specify what to do with each instance of olac:code. There are five controlled vocabularies for which olac:code is used to hold the value:
Code for Discourse Types (olac:discourse-type)
Code for Identifying Languages (olac:language)
Code for Linguistic Field (olac:linguistic-field)
Code for Linguistic Data Types (olac:linguistic-type)
Code for Participant Roles (olac:role)
In the first four cases, the value of olac:code is the primary value of the metadata element. Thus it must be moved to element content so that it is not lost in the dumb down to simple DC. In the fifth case, the value of olac:code is like a refinement of the metadata element. Thus, like other refinements, it is discarded in the dumb down process and so is not moved to element content.
Another general design principle is that a metadata element containing a value for olac:code may translate into multiple instances of the element. The olac:code and the element content translate to separate instances of the element. Furthermore, if the value of olac:code is an opaque code, an additional instance of the element is generated to hold a display label for the code value.
The purpose of the olac_display format returned by OLACA is to provide a feed that is optimized for metadata display. It is a bridge between the olac format and the oai_dc format. It performs the movement of the oai:code value to the element content and the generation of multiple instances of an element when that is needed. It stops short of the dumb-down process in which refined elements are translated to their generic equivalent and attributes are discarded.
The following principles apply in the transformation to the olac_display format:
The record conforms to the olac metadata schema (that is, it uses the <olac:olac> wrapper with the same set of possible elements and attributes).
No elements or attributes or content are discarded.
No elements are empty; all olac:code values that end up in element content for the oai_dc format are moved (without conversion of underscore) to the element content.
A single element may be transformed to multiple elements as needed for the oai_dc format.
The olac to olac_display transformation is done as follows. If the metadata element matches a pattern in the list below, then perform the operation specified below; otherwise, simply copy the element.
<dc:type xsi:type="olac:discourse-type">. Move the olac:code value to the element content (ignoring any content that may have been there).
<dc:type xsi:type="olac:linguistic-type">. Move the olac:code value to the element content (ignoring any content that may have been there).
<dc:subject xsi:type="olac:linguistic-field">. First, generate a <dc:subject> element that moves the olac:code value to the element content (ignoring any content that may have been there) and that drops xml:lang if it is present. Second, if there was original element content, generate a <dc:subject> element with that original content and no attributes (except the xml:lang if that is present).
<dc:subject xsi:type="olac:language">. First, generate a <dc:subject> element that moves the olac:code value to the element content (ignoring any content that may have been there) and that drops xml:lang if it is present. Second, generate a <dc:subject> with no attributes and with content containing the reference name associated with the code in the ISO 639-3 standard [ISO639-3]; append "language" unless the name of the language already contains that word. Third, if there was original element content and it is different from the language name used in the previous step, generate a <dc:subject> element with that original content and no attributes (except the xml:lang if that is present).
<dc:language xsi:type="olac:language">. First, generate a <dc:language> element that moves the olac:code value to the element content (ignoring any content that may have been there) and that drops xml:lang if it is present. Second, if there is element content, generate a <dc:language> element with that content (and if the content does not already contain the reference name associated with the code in the ISO 639-3 standard [ISO639-3] anywhere in the string, prepend that reference name followed by semicolon and space). Otherwise, in the case that there is no element content, generate a <dc:language> element with the reference name for the language as content.
The olac_display format is the basis for the human-readable displays of metadata on the OLAC site. For instance, an HTML view of the catalog record for the archive item used above as an example in section 1 can be seen at this URL:
The display is made from the olac_display form of the record by showing a label for the metadata element in the left column and the element content in the right column. An attribute, if present, is expressed in the parenthesized string following the metadata element label. If xsi:type="olac:role", then the string in parentheses is the label for the participant role (i.e. the value of olac:code). Otherwise, the string in parentheses is a transformation on the value of xsi:type which identifiers the encoding scheme for the element content. Click on the "OAI-PMH request for simple DC format" link toward the bottom of the page to view the oai_dc form of the record (as described in the next section).
In order to participate in the wider community of OAI service providers, OLAC data providers must also publish their metadata records in the simple Dublin Core format prescribed by the OAI [OAI_DC]. There is no need for OLAC data providers to store the records in both formats, however, since the information in the oai_dc format is a subset of the information in the olac format. An oai_dc record may thus be automatically derived from an OLAC record. A program that transforms a metadata record from one format to another is conventionally called a "crosswalk"; see [Zeng2007] for other examples of crosswalks and pointers to discussions of crosswalking issues.
The OLAC Aggregator also supports the oai_dc format. It thus functions as an OLAC-to-OAI_DC crosswalk since it harvests only OLAC metadata and performs the transformation to oai_dc format upon request. Transforming a metadata record from OLAC format to olac_display format goes most of the way toward implementing the OLAC-to-OAI_DC crosswalk. In order to complete the mapping and transform an eleemnt in an olac_display record to the corresponding element of the oai_dc record, the following special cases are observed:
If the element has an xsi:type of olac:linguistic-type, olac:linguistic-field, or olac:discourse-type, change all underscores in the element contents to spaces.
If the element is <dc:type xsi:type="olac:linguistic-type">, prepend "Linguistic type:" to the element contents.
If the element is <dc:type xsi:type="olac:discourse-type">, change it to <dc:description> and prepend "Discourse type:" to the element contents. (The logic behind this rule is that from the standpoint of the cataloging community in general, the OLAC discourse type is more like a description than a type.)
If the element is <dc:contributor olac:code="author">, generate the result as a <dc:creator> element.
If the element is <dc:subject xsi:type="olac:language"> and the record already has a <dc:language> with the same value for olac:code as this <dc:subject> element, then discard this element. Otherwise, generate a <dc:language> element with the value of olac:code as its contents. (The logic behind this rule is that the DC standard expects ISO 639 codes with the <dc:language> element, but not with <dc:subject>.)
Generate only one <dc:date> element using the contents of the first available date-related element in this order of preference: dc:date, dcterms:issued, dcterms:dateCopyrighted, dcterms:created, dcterms:available, dcterms:dateAccepted, dcterms:dateSubmitted, dcterms:modified, dcterms:valid. Discard all other date-related elements. (The logic behind this rule is that a simple DC record should have only one <dc:date> element; for instance, see [DRIVER].)
Then the following two dumb-down rules apply in general:
If the element is in the dcterms namespace, output it as its more generic dc equivalent.
Discard all attributes.
|[DC-Q]||DCMI Metadata Terms.
|[DC-Simple]||Dublin Core Metadata Element Set, Version 1.1.
|[DRIVER]||DRIVER Guidelines 2.0: Guidelines for content
providers — Exposing textual resources with OAI-PMH, Novermber
|[ISO639-3]||ISO 639-3 Downloads.
|[OAI_DC]||XML schema for OAI implementation of Dublin Core
|[OLAC-Extensions]||Recommended metadata extensions
|[OLACA]||OLACA: The OLAC Aggregator.
|[Zeng2007]||Zeng, Marcia Lei. 2007. Metadata Crosswalks.