Sunday, July 19, 2009

Yee: Questions 9-11

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

Question 9
How do we express the arrangement of elements that have a definite order?
Creating order, whether in display or in other system functions, is the job of the application. However, to do so, the information that it needs to create that order must be present in the data. Simple orders, like alphabetical or numeric, are easy to do. Yee, however, gives an example of a non-simple ordering problem:
Could one define a property such as natural language order of forename, surname, middle name, patronymic, matronymic, and/or clan name of a person given that the ideal order of these elements might vary from one person to another?
Well, yes. If you have defined separately all of the elements that you need to take into account in your ordering, then your application can have rules for the order that it uses. If you wish to use more than one set of rules, then you also would have an element for the rule set to be applied.

Many of the problems that we have in today's data are due to the fact that we present the data as a string with the elements in a single sort order, but we don't fully explain what those elements are. Is "Anna Marie" a single forename that happens to be two words, or is her name "Anna" with a middle name "Marie"?

At some point, however, we have to ask ourselves the question: is it worth it to code our data in such great detail? What do we gain from a particular capability of ordering, and what is its value to the users of our catalogs? Is there an easier way to help the users find what they are looking for? Detailed coding of data is expensive, and the cost of precise ordering may be more than the value we obtain from it.

Question 10

How do we link related data elements in such a way that effective indexing and displays are possible?
Yee wants to know how you can say "two oboes and three guitars" in a way that you don't retrieve this item when you search on "two guitars." Again, this isn't directly related to RDF but to the metadata record format you create. When your data is represented just as a character string the only way to prevent the false drop is with a phrase search. That has limitations (e.g. if you search on the phrase "three guitars and two oboes" you won't retrieve the record). With your data coded for machine processing, conceptually as
[instrument = guitar, number = 3]
[instrument = oboe, number = 2]

you can create an application that allows the user to query for the correct number of instruments.

The underlying RDF may not look anything like that example, and that's ok. The application will use the defined RDF entities and properties as it needs. RDF itself should be seen as the building blocks for a metadata record. This means that the element for "instrument" will be defined in RDF as a property that has as its value a selection from a list of terms. The application will create a structure that allows you to input all of the relevant instruments for whatever the metadata is describing, along with a number.

One sentence in this section of Yee's is puzzling, however:
The assumption seems to be that there will be no repeatable data elements.
I think this comes out of a confusion between RDF and the application that uses RDF properties. RDF itself is expressed in what are called "triples." Each triple works like a simple sentence: subject - verb - object. If you have more than one of any of those, you create another triple.
Dick and Jane wrote Fun with Dick and Jane.
becomes two triples, one that says:
Dick wrote Fun with Dick and Jane
Jane wrote Fun with Dick and Jane
This is really no different than creating a bibliographic record with one title field and two author fields. It's just a different way of organizing it under the hood. You actually can take a MARC record and reorganize it as triples.

I think the main point here is that data creators and users may not even be aware that RDF is under the hood. Humans will not be presented with RDF triples -- those are for machines. Only the people creating the systems structures need to be aware of the RDF-ness of the metadata. (Think of this as the difference between programmers who work with fields defined as "character" or "numeric" vs. what users of the data see, such as titles and dates.)

Since RDF uses some fairly abstract concepts, a group of us are working to create design patterns for the most common situations that will be needed to define metadata elements: a simple string; an element that uses a controlled list of terms; etc. These then become the building blocks for metadata element definitions: title will be defined as a string of characters; language of the text will be a term taken from the standard list of languages. Once you have your metadata elements defined then you can begin to build applications.

Question 11

Can a property have a property in RDF?
This is a question about how you create elements like "publisher statement" that themselves contain elements like place, publisher, date of publication. This kind of structure is common in our bibliographic records today. Whether one should create similar structures in RDF is somewhat controversial. One solution is to define your place of publication, date of publication, and publisher as elements, and let the application gather them into a unit as desired. The publisher statement as an element is really just a way to collect them together for display, which could be considered to be the job of the application. By defining your data elements in some detail, there can't be ambiguity between, say, the date of publication and some other date in the same record. However, if you absolutely must gather the elements together as a unit for some reason, then RDF allows you to create something called a "blank node" for that purpose.

Using RDF will require us to rethink some of our data practices. This is hard because we've worked with data that looks like a catalog record for our entire careers. It will be important for future systems that use these re-engineered data elements to present them in an easy-to-understand way to the cataloging community and to catalog users. I'm betting that you could put out an input form that looks exactly like today's MARC record but based on RDA data elements defined in RDF. That wouldn't gain us much in terms of functionality, but the internal guts of the data definitions don't dictate what catalogers or users see on the screen. What we should be looking forward to, though, is what new functionality we can have when we are able to express rich relationships between resources or between persons and resources. Replicating the "old" way of doing things would be a step backward.

1 comment:

Eric said...

I've posted an article about question 11 on my blog. Summary: That's a really good question. Yes.