OLAC Metadata Working Group

Dec 2007 review of three draft documents

The following documents were reviewed:

OLAC Metadata Usage Guidelines
http://www.language-archives.org/NOTE/usage-20071115.html
Best Practice Recommendations for Language Resource Description
http://www.language-archives.org/REC/bpr-20071115.html
OLAC Metadata Quality Metrics
http://www.language-archives.org/NOTE/metrics-20071117.html

This document compiles the full thread of discussion on each issue raised in feedback (see METADATA archives – December 2007 for original messages). It then reports, on each issue, the conclusion of the document editors on changes to be made in the next version of the documents.

Issues raised during the review:

Creator vs. Contributor [Conclusion]
Relation [Conclusion]
Enhancements to Linguistic Type [Conclusion]
Granularity Recommendation [Conclusion]
Element Content of Subject Language [Conclusion]
Reference to Subject Language in Lang Attribute [Conclusion]
Coverage [Conclusion]
Proposal for Place Element [Conclusion]
Typos

1. Creator vs. Contributor

If we have roles already why do we separate out contributor and creator and not just list them as we do all other roles? Is this a way of enforcing that each item must have a creator and contributor? (Nicholas Thieberger 4-Dec-2007)

I would like to add my support… If use of Creator is deprecated, as the guidelines imply, why keep mentioning it? Let's drop it and make sure the examples for Contributor include obvious 'creators', like 'author' and 'performer'. (Heidi Johnson 5-Dec-2007)

I too am still confused by the distinction between creator and contributor. Our db distinguishes roles, so we have no easy way to map these. We instead map all of them to contributor, as in the alternate usage schema. However, in terms of resource discovery, the is probably not an issue. (Gary Holton 5-Dec-2007)

[I agree] with Heidi on deprecating "creator" -- at least that is what we have done at LACITO. Among "contributors" our documents have a "depositor" and usually a "researcher" (often the same person) and a "speaker", etc. (Boyd Michailovsky 10-Dec-2007)

Conclusion: Creator vs. Contributor

We seem to have a consensus that the Creator vs. Contributor distinction is confusing and that our usage notes need to do better at explaining. The DCMI has also run into this problem. They have not deprecated Creator, so in order to maintain compatibility with the mainstream, we don't think that we should either. However, they have come to the conclusion that Creator is a kind of Contributor (as indicated in the latest version of the DCMI Metadata Terms, issued 2008-01-14) and that a vocabulary to refine roles is applicable only to Contributor (as indicated in the Library Application Profile). Our usage notes have therefore been revised to permit olac:role only on Contributor and to explain that Creator should be used only when the role vocabulary has no suitable more specific role. The OLAC-to-Simple-DC crosswalk in the OLAC Aggregator will be updated to map Contributor roles like author to a Creator element (so as to maintain interoperability with more general metadata standards like the Eprints Application Profile).

2. Relation

While a transcript could be considered under some of these refinements (requires, haspart, hasversion), none of them is specific enough. As discussed on the Olac-Implementers list in December 2005, could OLAC extend its metadata to include the following relation pair:

hasTranscript
IsTranscriptOf

(Nicholas Thieberger 4-Dec-2007)

Let me second Nick's suggestion to add isTranscriptOf and hasTranscript for Relation. I've been puzzled by their absence for quite some time. I imagine that there may be a few others relations linguists might ultimately need, but these two seem like pretty basic and necessary additions to me right now. (Jeff Good 4-Dec-2007)

I third the vote to add hasTranscript and isTranscriptOf. I doubt most linguists would regard a transcription as a format variant. (Heidi Johnson 5-Dec-2007)

I want to 'fourth' the recommendation that we add hasTranscript and isTranscriptOf. (Helen Aristar-Dry 5-Dec-2007)

I agree with Nick et seqq. on "hasTranscript" and "isTranscriptOf". (Boyd Michailovsky 10-Dec-2007)

Also, what relation should be used for annotation files, like interlinear text? It's not a format variant or a version. The current list of relations was taken directly from dcterms without, to the best of my recollection, any review or revision by us. I think we should take this opportunity to revisit the whole list. We should at least have straightforward terms to characterize relations among components in a typical language documentation bundle: recording, annotation, photograph. (Heidi Johnson 5-Dec-2007)

hasTranscript and isTranscriptOf are definitely needed. I use these for all types of annotations, whether IT or not. That would include transcriber, shoebox, and ELAN. Is that not BP? This is the large view of annotation, where plain-text transcription as simply one point on a continuum. Perhaps hasAnnotation would be more appropriate? Either way, we certainly want to distinguish the transcript/annotation from the resource itself. As Heidi points out, the transcription is not just a format variant. Indeed, a single resource may have multiple different transcripts by different creators (er .. contributors). (Gary Holton 5-Dec-2007)

Conclusion: Relation

Agreed that we need the hasTranscript and isTranscriptOf pair. The trick is how to implement it. Since these are refinements, we would be adding to the set of markup elements defined by the DC schema for expressing qualified Dublin Core in XML (which is what we are following). To this point, all of our metadata extensions have been as encoding schemes that are used in attribute values. We therefore could not implement this without a significant departure from the constraints we have followed to this point. The DCMI is working on a major revision of their guidelines for expressing qualified DC in XML which will provide an easy way for us to add refinements. Thus we want to hold off and handle this kind of change in a future round of more radical changes in which we would align with the DCMI’s new guidelines after they come out. We are thinking of that as a version 2.0 revision of OLAC metadata. In the meantime, our immediate goal is a version 1.1 revision which will have minimal disruption.

3. Enhancements to Linguistic Type

Is this the right time to suggest an enhancement of OLAC-linguistic-type to include more fine-grained details, e.g. Extensive dictionary/ large wordlist/ wordlist Grammatical description/ grammar sketch Making this distinction will help as we may be looking to establish a documentation index from these metadata, in order to know that language XXX has YYY amount of information available about it. (Nicholas Thieberger 4-Dec-2007)

I'm glad he remembered to bring this up, because it's very important. We should do whatever we can to support the development of a documentation index; that could turn out to be the single most useful thing OLAC has ever done. (Heidi Johnson 5-Dec-2007)

Conclusion: Enhancements to Linguistic Type

Agreed. This will take the form of a new version of the Recommendation on OLAC Linguistic Data Type Vocabulary. That is too extensive a change for version 1.1, and so we are listing that among the goals of version 2.0.

4. Granularity Recommendation

I would like to see recommendation on the so-called "granularity" issue. That is, from OLAC's perspective, what is the appropriate unit on which a metadata record should be based. From Rosetta's perspective (whose OAI server has been defunct for a while, and for that my apologies), for example, our own worldview is that our database consists of language nodes with resources attached. However, I think the general OLAC worldview is that the world consists of linguistic resources. Rosetta can serve metadata on our worldview or OLAC's, and I don't have any strong feelings about this. But, I would like to see an OLAC recommendation on this topic. (To bring up another example, is an IMDI session the appropriate thing to expose to OLAC? Or should it be the more fine-grained components of the session? I imagine the corpus is too big.) (Jeff Good 4-Dec-2007)

I would like to see that, too. There was talk, once upon a time, of returning data at some kind of 'collection' level, but I don't remember any resolution on this issue. As AILLA's collection grows, I can clearly see that it would be more useful to searchers to retrieve summaries of collections than scores of repetitive results for individual files. (Heidi Johnson 5-Dec-2007)

There are three levels: the file, the bundle (IMDI session), and the collection (e.g. the Joel Sherzer Kuna language collection.) Can't there be a way for users or archives to specify which level they want to expose? (Heidi Johnson 5-Dec-2007)

I second Jeff's comments--especially the suggestion that the usage document discuss suggested granularity. We have found that this is the most difficult decision that anyone has to make (whether archive or individual) when attempting to describe their collection. We tell individual linguists (who don't want to write too many metadata descriptions) to simply list a corpus of documentation treating a single language as a single resource. On the other hand, this procedure leaves IMDI listed as having only 15 (?) resources. Guidance would be helpful. (Helen Aristar-Dry 5-Dec-2007)

Regarding the granularity issue, I seem to recall us batting this around quite a bit in early OLAC discussions, with no general consensus. Archivists relentlessly complain to me that we language folks are too granular. Some have suggested that there should be just 20 items in the ANLC archive -- one for each language! Well, that's a bit extreme, but collections are useful. ANLC maintains a subcollection field which is useful for searching. This is often a donor, say Irene Reed or Knut Bergsland. Materials within those subcollections also show up in subject.language searches, of course, but often the subcollection is of interest. Indeed, this is the classic type of organization for traditional archives. Other subcollections identify funded projects, or particular speakers. Maybe this is an archive-internal issue and not one which is relevant to resource discovery. However, as Heidi points out, it would be useful OLAC service providers to be able to offer more or less granular search opportunities, so that users could have the option of not having to wade through hundreds of very closely related records. (Gary Holton 5-Dec-2007)

Certainly it would be nice to be able to access metadata by collection. (Boyd Michailovsky 10-December-2007)

The granularity issue is addressed in section 6, “Guidelines concerning relevance and granularity” of our Standard, OLAC Repositories. It points out that too fine a level of granularity degrades the signal-to-noise ratio in searching and that this is legitimate grounds for rejecting the registration application of a repository. It specifically says that creating a separate record for each file that makes up the documentation of a single speech event is not appropriate. Going to the other extreme of having a single record to describe an entire collection is in fact considered good practice at the resource discovery level, since the thing people will be using OLAC search for is primarily to discover the existence of the corpus, rather than to select individual items from within the corpus. The latter is typically done after entering the corpus. (Gary Simons 14-Dec-2007)

I agree with the intuition that 500 [records for individual files in a corpus] is too many, but one seems like too few. Under a scenario where I'm looking for anything on language X, this might seem reasonable but what if, for example, I'm looking for extensive materials from some particular genre. Then, I would want those 500 texts broken down by genre and I would want to have them all bunched together. I don't know what the right balance is here, but this strikes me as shifting the balance too much towards the language-based search. But, I wonder what people who run big archives think about this. ... Something I would find useful here is specific recommendations for mapping records using the IMDI session-corpus model to the OLAC model.

... To sum up, I wonder if somewhere there shouldn't be (i) an enumeration of representative example search scenarios for the data and service provider to consult as a kind of "cheat sheet" guiding them as to how to approach granularity, (ii) a discussion of what the canonical service provider is in the OLAC context, and (iii) a discussion of what kinds of records a service provider should expect from a data provider. (Jeff Good 16-Dec-2007)

Conclusion: Granularity Recommendation

This issue is clearly relevant for OLAC Metadata Usage Guidelines. We have added a new section to the document on “Granularity of Resources.”

5. Element Content of Subject Language

Should we try to come up with some way of specifying how to interpret the element content of Subject Language when it's there? Right now, two use cases are given: (i) dialect and (ii) alternate name. Since these two cases are so common and have very different semantics, shouldn't there be an (optional) way of specifying what that element content means? It shouldn't be too hard to come up with a way of expressing this. (Jeff Good 4-Dec-2007)

I would like to add my support… (Heidi Johnson 5-Dec-2007)

"For a work of literature or other monolingual document aimed at the speakers of a particular language, use Language to identify that language. ... For an annotated text, use Language for the language in which the annotations are made; use Subject.language for the language of the base text that is being annotated."

This is clear. But, concerning text documents (and not text embedded in a language description), it seems odd to list an annotated text under the language of annotation and the same text unannotated under its own language. Could this cause users searching across OLAC and other documents to fail to find texts in a language they are interested in? Would it not be better to explicitly distinguish annotation language from source text language and to code both right here under DC "Language", which is the first place a user would look? We can maintain our convention (professional deformation?) of considering the "Subject" -- specifically "Subject.language" -- of a text to be the language it is in -- after all, OLAC is primarily for linguists, and languages are our subject. (I recognize that this means having the same information in two places for this type of resource. This is bad, but maybe not too bad if it improves access to our documents.) (Boyd Michailovsky 10-Dec-2007)

We want to show people how to add narrowing or clarifying notes about specific language varieties. I get this sort of note often at AILLA - people will pick the closest ISO code, but add a note explaining how the code is quite right. We all know the codes need to be revised, but that will take a long time. It is good for people to make these notes a part of their OLAC records, though, both to encourage their colleagues to think about them and so that code revisers can find them. Maybe something like this:

  <dc:subject xsi:type="olac:ISO639-3" olac:code="zab">
     Tlacolula de Matamoros Zapotec. Ethnologue designation: San Juan Guelavía 
     Zapotec (zab), but this includes many varieties, including Tlacolula de 
     Matamoros Zapotec. Also known as: Valley Zapotec, Tlacolula Valley Zapotec, 
     but these names also include other language varieties.
  </dc:subject>

The note is one that Brook Danielle Lillehaugen, one of my model depositors, pastes into every metadata form. (Heidi Johnson 8-Jan-2008)

Conclusion: Element Content of Subject Language

There are two different issues here. The first regards clarifying the nature of the text content. The first comment points out that there are two main uses of the text content (for an alternate name or for a variety name) and asks if there should be a way to distinguish these two cases. This can be done by means of the wording of the text content; the most straightforward approach is to add the word "dialect" after the name in the case of a variety name. An example like this is given in the document. Other cases of indicating a variety, such as "Women's speech," don't involve a name at all and so do not pose a problem. Still other cases of using the content (like the note that Heidi Johnson gives above as an example) can include multiple names, including both alternate names and variety names.

The second issue has to do with when to identify a language with Language versus with Subject. The usage note for Language has been substantially expanded to clarify various cases.

6. Reference to Subject Language in xml:lang Attribute

In the opening discussion of practices that apply to all elements, the xml:lang attribute is discussed without any reference to Subject Language. I think it wouldn't be a bad idea to distinguish the two immediately, if only in a parentheses. (Helen Aristar-Dry 5-Dec-2007)

Conclusion: Reference to Subject Language in Lang Attribute

Agreed. Both Language and Subject are now mentioned in the discussion of xml:lang in order to distinguish the three places in which language codes can be used.

7. Coverage

I was pleased to see the emphasis on geographical coverage in the usage notes. But--from working with LL-MAP, I have several questions with regard to the following paragraph under 'Coverage':

"In the OLAC context, service providers already have a database that maps languages to the countries in which they are spoken [OLAC-Language]. Coverage should not be used to duplicate this information; rather service providers will support searches concerning languages spoken in a given country by referring to the language database. Coverage should be used geographically only when the language involved has a wide distribution and the resource focuses on its use in a particular region or geopolitical jurisdiction, or conversely, when the resource deals with a topic of study in which the region itself is in focus, e.g., multilingualism, language polilcy, languages in contact, in a given locale."

Isn't the phrasing of this recommendation rather too restrictive? It seems to enforce use of the Ethnologue/GMI polygons to identify language locales and to discourage collection of any other language locale data. LL (and LL-MAP) do in fact use these polygons and language/country match-ups. But do we want the OLAC usage guidelines to suggest that this is what all service providers should do? What would be the consequences of such standardization? And even if service providers do want to construct their services around a standard set of language locales, what would be the harm in collecting more specific information from archives and individuals? Perhaps we need to discuss this before deciding one way or another. (Helen Aristar-Dry 5-Dec-2007)

I agree with Helen's reservations about relying on Ethnologue for coverage information on specific documents. (Boyd Michailovsky 10-Dec-2007)

The examples in this section don't seem parallel; the second two leave out the Subject element. A small point, but in a usage document the examples should probably be as clear and complete as possible. (Helen Aristar-Dry 5-Dec-2007)

A good example illustrated the difference between coverage and the place created is a map of Mexico (what is covered) that was produced in California (not coverage). (Heidi Johnson 8-Jan-2008)

Conclusion: Coverage

We agree with Helen’s comments and have removed the mention of service providers making inferences based on known language-to-country mappings. We have also expanded the examples to include more elements.

8. Proposal for Place Element

There is no obvious way to specify the place where a resource was created. People have told me to use Coverage for this, but first, it's a completely non-transparent interpretation of the term, and second, the guidelines specifically state that Coverage pertains to the intellectual content of the resource:

"Definition: The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant."

We should either add a Place element that is parallel in every way to the Date element, or change the definition of Coverage to include this counter-intuitive usage with supporting examples. (Heidi Johnson 5-Dec-2007)

This gap exposes the publication bias of Dublin Core. Nobody cares where a journal article is written, but we do very much care where field recordings are made. Collecting Zapotec vocabulary in Los Angeles is quite different from collecting it in Oaxaca. If we're going to recommend OLAC elements as an adequate basic-level documentation of resources, we have to include all the essential data points. (Heidi Johnson 5-Dec-2007)

A side-question: What about the collection point of a piece of documentation? This isn't "coverage" in the sense of what the subject covers, but it's something we would very much want to know, and sometimes the distinction between subject and collection point blurs (e.g., for Arienne's collection of Mon La songs, where the collection point is part of a map layer whose 'subject' is places where this ceremonial type persists). Is there a place in the OLAC metadata to record the geographic coordinates of the place where a song was collected? Should it be mentioned here, if only as a parentheses? (I think that cross referencing to emphasize distinctions is helpful to the user, cf. the reference to "Date" as distinguished from temporal coverage.) (Helen Aristar-Dry 5-Dec-2007)

Interesting that we both thought of the collection point problem: point of collection doesn't seem to go in 'Coverage' but we do want to know it. Where should it be indicated? (Helen Aristar-Dry 5-Dec-2007)

I also want to second Heidi's point about Location. We currently map this to OLAC Description, but that's not a very satisfactory solution. Location can be extremely important in sorting out issues such as dialect variation. But an even more important use of this element is for distinguishing displaced language resources, as Heidi points out about Zapotec in Los Angeles. (Although perhaps Zapotec in Paris would be clearer -- there may actually be some Zapotec speaking communities in LA.) The point is that we need a way to exclude these displaced field methods recordings from the search. Coverage does not work because Heidi's Zapotec example still has Oaxaca as its Coverage. (Gary Holton 5-Dec-2007)

I agree with Heidi that place of recording is important, and that it clearly does not belong in the coverage category as defined. But I would keep "coverage" as it is. A "Place" category parallel to the "Date" category would work, but it would mean a new DC category. That leaves "Description" (Gary). This is currently used for free-text information, but could it be stretched to include an OLAC extension for place of recording? (Boyd Michailovsky 10-Dec-2007)

As the DC definitions say, Coverage is for "the spatial or temporal topic of what the resource.". Thus, when the value is a geocode (like the geospatial coordinates), it is saying that the resource is “about” (in some sense) that place on earth. When the resource is an example of how they talk in that place, then I think that using a geocode as the value of Coverage is completely appropriate. When the place where the resource was collected also happens to be the place where people talk that way, then the Coverage coincides with the place of collection, and since this is the default case, it does not seem necessary to specifically talk about point of collection in the descriptive metadata.

But consider the exceptional case, in which a lifelong resident of X, where an endangered language is spoken, flies to a far away place Y where recordings are made of that language. The Coverage should still be reported as location X, but in this case the collection point is location Y . People who want to find the resource are going to be looking for X, not Y, which is why Coverage should be X. If the cataloger felt it was necessary to say something about the circumstances of the recording event (including where it took place). That would fall under the umbrella of the Description element. But for the purpose of resource discovery, the record would not even need to mention the collection point. It would be sufficient for this fact to be documented inside the resource, such as in an introduction to the corpus, in much the same way that an author uses the Preface to explain where the book was written. Thus, I think the answer for this issue is to include some discussion like the above in the usage notes for Coverage. (Gary Simons 14-Dec-2007)

There needs to be an example of using the Description field for the Place Created data, in the cases where that's relevant (e.g. recordings.) Or could we get an extension of Description for this? (Heidi Johnson 8-Jan-2008)

Conclusion: Proposal for Place Element

We have added to the usage notes for Coverage to explain how it relates to point of collection, and how Description can be used to add location-related details that do not relate to spatial topic.

9. Typos

'informaiton' (Nicholas Thieberger 4-Dec-2007)