Friday, July 10, 2009

Yee: Questions 1-2

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

As I mentioned previously, I am going to try to cover each of Martha Yee's questions from her ITAL article of June, 2009. The title of the article is: "Can Bibliographic data be Put Directly onto the Semantic Web?" Here are the first three. As always, these are my answers, which may be incorrect or incomplete, so I welcome discussion both of Yee's text as well as mine. (Martha's article is available here.)

Question 1
Is there an assumption on the part of the Semantic Web developers that a given data element, such as publisher name, shuld be expressed as either a literal or using a URI ... but never both?
The answer to this is "no," and is explained in greater detail in my post on RDF basics.

Yee goes on, however, to state that there is value in distinguishing the following types of data:
  • Copied as is from an artifact (transcribed)

  • Supplied by a cataloger

  • Categorized by a cataloger (controlled)
She then says that
"For many data elements, therefore it will be important to be able to record both a literal (transcribed or composed form or both) and a URI (controlled form)."
This distinction between types of data is important, and is one that we haven't made successfully in our current cataloging data. The example I usually give is that of the publisher name in the publisher statement area. Unless you know library cataloging, you might assume that is a controlled name that could be linked to, for example, a Publisher entity in a data model. That's not the case. The publisher name is a sort-of transcribed element, with a lot of cataloger freedom to not record it exactly as it appears. If we want to represent a publisher entity, we need to add it to our data set. There are various possible ways to do this. One would be to declare a publisher property that has a URI that identifies the publisher, and a literal that carries the sort-of transcribed element. But remember that there are two kinds of literals in Yee's list: transcribed and cataloger supplied. So a property that can take both a URI and a literal is still not going to allow us to make that distinction.

A better way to look at this is perhaps to focus more on the meaning of the properties that you wish to use to describe your resource. The transcribed publisher, the cataloger supplied publisher, and the identifier for the corporate body that is the publisher of the resource -- are these really the same thing? You may eventually wish to display them in the same area of your display, but that does not make them semantically the same. For the sake of clarity, if you have a need to distinguish between these different meanings of "publisher", then it would be best to treat them as three separate properties (a.k.a. "data elements").

Paying attention to the meaning of the property and the functionality that you hope to obtain with your data can go a long way toward solving some of these areas where you are dealing with what looks like a single complex data element. In library data that was meant primarily for display, making these distinctions was less important, and we have numerous instances of data elements that could either have values that aren't exactly alike or that were expected to perform more than one function. Look at the wide range of uniform titles, from a simple common title ("Hamlet") to the complex structured titles for music and biblical works. Or how the controlled main author heading functions as display, enforcement of sort order, and link to an authority record. There will be a limit to how precise data can be, but some of our traditional data elements may need a more rigorous definition to support new system functionality.

Question 2

Will the Internet ever be fast enough to assemble the equivalent of our current records from a collection of hundreds or even thousands of URIs?
I answered this in that same post, but would like to add what I think we might be doing with controlled lists in near-future systems. What we generally have today is a text document online that is updated by the relevant maintenance agency. The documents are human-readable, and updates generally require someone in the systems area of the library or vendor's support group to add new entries to the list. This is very crude considering the capabilities of today's technology.

I am assuming that in the future controlled lists will be available in a known and machine-actionable format (such as SKOS). With our lists online and in a coded form, the data could be downloaded automatically by library systems on a periodic basis (monthly, weekly, nightly -- it would depend on the type of list and needs of the community). The downloaded file could be processed into the library system without human intervention. The download could include the list term, display options, any definitions that are available, and a date on which the term becomes operational. Management of this kind of update is no different to what many systems do today to receive updated bibliographic records from LC or from other producers.

The use of SKOS or something functionally similar can give us advantages over what we have today. It could provide alternate display forms in different languages, links to cataloger documentation that could be incorporated into workstation software, and it could provide versioning and history so that it would be easier to process records created in different eras.

There could be similar advantages to be gained by using identifiers for what today we call "authority data." That's a bit more complex however, so I won't try to cover it in this short post. It's a great topic for a future discussion.

2 comments:

arkham said...

Some of what you're talking about here in the answer to #2 sounds a lot like what the presenters at the ALA preconference on RDA FRBR and FRAD were discussing. They didn't talk about SKOS at all, and I think in some cases they were thinking of the links as being within a closed system (ie. one of the ubiquitous library silos) but the concepts seem very similar.

Do you see RDA as having any bearing on making the data more available for this kind of semantic linkage, or do you think is RDA not really going to help get us there?

Karen Coyle said...

Arkham -- RDA (the rules the JSC authored) won't itself get us to linked data, but I think that the fact that we need to undergo changes to accommodate RDA in our data format and systems gives us an opportunity to move our data more toward compatibility with linked data. RDA does encourage that somewhat because it embraces FRBR, and FRBR does introduce an entity-relationship view of library data.

In particular, the change gives us a chance to scrutinize our data elements in light of desired linking functions. Actually, just looking at our data elements as DATA rather than the elements of a document that describes a bibliographic resource would be a step in the right direction. For example, title is more than what we put on the title line of our displays, it could be a key link to other bibliographic resources like Amazon or Wikipedia. And author should be seen as a very rich link, going not only to biographies, but (for living authors) web pages, fan sites, etc. This is the change in thinking that will help us link.

I don't know if anyone mentioned it at the pre-conference, but RDA online is going to be linking to the Metadata Registry which has registered each RDA data element and controlled lists in linked-data compliant format. (Either RDF or SKOS) It only sort-of works because the RDA elements weren't really designed as linked data components, so I think there will be some evolution before we are producing actual linked data. but it's a start and a way to experiment. Go to the registry and you can view RDA elements in various formats, including RDF/XML.