Coyle's InFormation: 07/01/2009

Monday, July 20, 2009

Yee: Questions 12-13

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

Question 12

How do we document record display decisions?

In a sense we've covered this in the answers to other questions, but to reiterate, in addition to the data elements (called "properties" in RDF) you will need to develop one or more record formats. The record formats will provide the application with the information that is needed to produce the desired displays. I'm assuming that most display rules will be implemented in the application, but I'd be interested in hearing other ideas.

Question 13

Can all bibliographic data be reduced to either a class or a property with a finite list of values?

The answer to this is "no," but I think there's a misunderstanding here about the RDF "class." Table 2 on page 64 of Yee's article equates the RDFS "Class" with RDF "Subject" and I don't think that this is correct. As I understand Class in RDF it has a function somewhat like abstract classes in object-oriented programming: it essentially is the umbrella for a group of like things, but itself never has an actual value. Think of classes as the upper levels of a hierarchy where only the bottom elements actually are filled in with real "stuff." In the Dublin Core Metadata Elements there is a class called Agent. Particular properties like rightsHolder or creator are members of the Agent class. Agent itself isn't a property, it's an organizing feature.

That said, and I can't claim I understand it fully, I'm still not sure if the FRBR entities work as classes. In some cases, like with Person, it seems to work, but in others, like WEMI, I'm less sure.

Back to answering the question: I think we'll have the following types of properties in our library metadata:

plain strings, like the transcribed title or a note

formatted strings, like dates

controlled lists of values, like language lists or media type lists

Then we have one other type of data, and that is where we select a display form from what today we call an "authority record." This is often considered the same as #3 in the above list, but I think there is a significant difference because an authority record is more than a term in a list: it is a rich information resource of its own. This harks back to Yee's question #5 about using cross references in authority control. Yee asks: "how will we design our systems to take advantage of the richness of authority control?" while my question is "how can we design authority control so that systems can make use of it?"

Sunday, July 19, 2009

Yee: Questions 9-11

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

Question 9

How do we express the arrangement of elements that have a definite order?

Creating order, whether in display or in other system functions, is the job of the application. However, to do so, the information that it needs to create that order must be present in the data. Simple orders, like alphabetical or numeric, are easy to do. Yee, however, gives an example of a non-simple ordering problem:

Could one define a property such as natural language order of forename, surname, middle name, patronymic, matronymic, and/or clan name of a person given that the ideal order of these elements might vary from one person to another?

Well, yes. If you have defined separately all of the elements that you need to take into account in your ordering, then your application can have rules for the order that it uses. If you wish to use more than one set of rules, then you also would have an element for the rule set to be applied.

Many of the problems that we have in today's data are due to the fact that we present the data as a string with the elements in a single sort order, but we don't fully explain what those elements are. Is "Anna Marie" a single forename that happens to be two words, or is her name "Anna" with a middle name "Marie"?

At some point, however, we have to ask ourselves the question: is it worth it to code our data in such great detail? What do we gain from a particular capability of ordering, and what is its value to the users of our catalogs? Is there an easier way to help the users find what they are looking for? Detailed coding of data is expensive, and the cost of precise ordering may be more than the value we obtain from it.

Question 10

How do we link related data elements in such a way that effective indexing and displays are possible?

Yee wants to know how you can say "two oboes and three guitars" in a way that you don't retrieve this item when you search on "two guitars." Again, this isn't directly related to RDF but to the metadata record format you create. When your data is represented just as a character string the only way to prevent the false drop is with a phrase search. That has limitations (e.g. if you search on the phrase "three guitars and two oboes" you won't retrieve the record). With your data coded for machine processing, conceptually as

[instrument = guitar, number = 3]
[instrument = oboe, number = 2]

you can create an application that allows the user to query for the correct number of instruments.

The underlying RDF may not look anything like that example, and that's ok. The application will use the defined RDF entities and properties as it needs. RDF itself should be seen as the building blocks for a metadata record. This means that the element for "instrument" will be defined in RDF as a property that has as its value a selection from a list of terms. The application will create a structure that allows you to input all of the relevant instruments for whatever the metadata is describing, along with a number.

One sentence in this section of Yee's is puzzling, however:

The assumption seems to be that there will be no repeatable data elements.

I think this comes out of a confusion between RDF and the application that uses RDF properties. RDF itself is expressed in what are called "triples." Each triple works like a simple sentence: subject - verb - object. If you have more than one of any of those, you create another triple.

Dick and Jane wrote Fun with Dick and Jane.

becomes two triples, one that says:

Dick wrote Fun with Dick and Jane
Jane wrote Fun with Dick and Jane

This is really no different than creating a bibliographic record with one title field and two author fields. It's just a different way of organizing it under the hood. You actually can take a MARC record and reorganize it as triples.

I think the main point here is that data creators and users may not even be aware that RDF is under the hood. Humans will not be presented with RDF triples -- those are for machines. Only the people creating the systems structures need to be aware of the RDF-ness of the metadata. (Think of this as the difference between programmers who work with fields defined as "character" or "numeric" vs. what users of the data see, such as titles and dates.)

Since RDF uses some fairly abstract concepts, a group of us are working to create design patterns for the most common situations that will be needed to define metadata elements: a simple string; an element that uses a controlled list of terms; etc. These then become the building blocks for metadata element definitions: title will be defined as a string of characters; language of the text will be a term taken from the standard list of languages. Once you have your metadata elements defined then you can begin to build applications.

Question 11

Can a property have a property in RDF?

This is a question about how you create elements like "publisher statement" that themselves contain elements like place, publisher, date of publication. This kind of structure is common in our bibliographic records today. Whether one should create similar structures in RDF is somewhat controversial. One solution is to define your place of publication, date of publication, and publisher as elements, and let the application gather them into a unit as desired. The publisher statement as an element is really just a way to collect them together for display, which could be considered to be the job of the application. By defining your data elements in some detail, there can't be ambiguity between, say, the date of publication and some other date in the same record. However, if you absolutely must gather the elements together as a unit for some reason, then RDF allows you to create something called a "blank node" for that purpose.

Using RDF will require us to rethink some of our data practices. This is hard because we've worked with data that looks like a catalog record for our entire careers. It will be important for future systems that use these re-engineered data elements to present them in an easy-to-understand way to the cataloging community and to catalog users. I'm betting that you could put out an input form that looks exactly like today's MARC record but based on RDA data elements defined in RDF. That wouldn't gain us much in terms of functionality, but the internal guts of the data definitions don't dictate what catalogers or users see on the screen. What we should be looking forward to, though, is what new functionality we can have when we are able to express rich relationships between resources or between persons and resources. Replicating the "old" way of doing things would be a step backward.

Yee: Questions 6-8

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

(Martha's article is available here.)

Question 6

To recognize the fact that the subject of a book or a film could be a work, a person, a concept, an object an event, or a place (all classes in the model), is there any reason we cannot define subject itself as a property (a relationship) rather than a class in its own right?

As I said in my earlier post, this makes perfect sense to me. In fact, we need to accept that FRBR, while it holds many important concepts, probably needs to be revised in light of the advances in thinking that have taken place since it was first developed in the late 1990's. FRBR is not RDF-compliant, and it has some vestiges of the record and database concept that guided thinking in the past.

This was one of the recommendations of the report on the future of bibliographic control: that we are putting ourselves in a dangerous position basing RDA on FRBR when it doesn not appear that FRBR has been thoroughly scrutinized, much less tested. You could say, however, that RDA is the test of FRBR, but that means that we must be prepared to do some revision based on what we learn when trying to use RDA for a FRBR view of bibliographic data.

Question 7

How do we distinguish between the corporate behavior of a jurisdiction and the subject behavior of a geographical location?

and...

To distinguish between the corporate behavior of a jurisdiction and the subject behavior of a geographical location, I have defined two different classes for place: Place as Jurisdictional Corporate Body and Place as Geographic Area.

This isn't directly related to RDF, but it's an interesting example of how one can approach the definition of metadata elements. I agree with Martha that jurisdictions and delineated areas on the planet are different entities. For data that is destined to be interpreted by humans, you can talk about, for example, "California state government" and "California rivers" without having to distinguish between political entity and geography. As we read those phrases we adjust our thinking accordingly. But for processing by machine, it is necessary to provide the information that humans derive automatically from the context or their own knowledge.

Political entities are a particularly interesting problem because 1) they are often entirely or somewhat contiguous with geographic entities and may commonly be called by the same name 2) they can have different meanings at different times. To say "Louisiana" when referring to an area in 1810 is very different to the state of Louisiana that was formed in 1812. Geologic areas also have a time component, but they are much less volatile -- their changes take place in "geologic time."

When we relied entirely on humans to interpret our data, we could create data elements that depended on context and the human ability to read the data in that context. The more that we move toward machine processing of our data, and toward interaction between programs, the more we need to be precise in the defintion of our data elements. We can see this somewhat in RDA, where work titles are defined differently from titles of expressions. In our MARC records, we treat these simply as titles, and assume that the people looking at our displays will make sense of them. Sense, of course, is exactly what a computer does not have, so there is an extra burden on us to be clear about our meanings.

Question 8

What is the best way to model a bound-with or an issued-with relationship, or a part-whole relationship in which the whole must be located to obtain the part?

This is primarily a question about FRBR and RDA, but it is also an opportunity to think about how we might use relationships in future systems. The problem with bound-with is that of the logical entity (a book, an journal issue, a pamphlet) and the physical entity that the library holds. In today's catalog, we don't have a way to create relationships between catalog records -- "bound with" becomes a note. In FRBR "bound with" is an item-to-item relationship. Having a way to code explicit relationships between entities should make it possible to help users navigate our catalogs.

Thursday, July 16, 2009

Yee: Questions 3-5

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

Question 3

Is RDF capable of dealing with works that are identified using their creators?

Yee goes on to say:

We need to treat author as both an entity in its own right and as a property of a work.... Is RDF capable of supporting the indexing necessary to allow a user to search using any variant of the author's name and any variant of the title of a work... etc.

I'm not entirely sure of the point of these questions, but they appear to me to be mainly about applications and system design, not RDF, which is the same advice she has gotten from others. Let me say that as I understand RDF, it is particularly suited to allowing entities like author to be used in a variety of relationships. So a person entity can be the author of one book, the illustrator of another, and the translator of yet another. But there's something here about "identifying a work using the creator" and I think that is entirely a question of how we decide to identify works, and is unrelated to the capabilities of RDF.

The identification of all of the FRBR Group 1 entities raises many interesting questions. The fact is that we do not have a real identifier for any of them, with the possible except of the barcodes that libraries place on items. But Works, Expressions and Manifestations are lacking in true identifiers. As Jonathan Rochkind has pointed out, we use identifiers like OCLC numbers and LCCNs as pseudo-identifiers for manifestations because most of the time they work pretty well. Many systems rely heavily on ISBNs, which work reasonably well for modern published books and have the advantage of being printed on the books themselves, thus making a connection between the physical object and the metadata. Other than that, though, we're not very well set as far as identifiers go.

Yee talks about the use of the main author + title (or uniform title) as a work identifier, but even those are not a true identifier for the Work, at least not in the sense of a URI. As long as we rely on display forms we won't have an identifier that we can share with anyone whose author or title display may vary from ours (and even within the AACR2 community, there are differences in choices about names and a great gap in the actual use of uniform titles). It should be possible to create an authority-type record for name/title pairs that would include the variants from different practices, and assign a single identifier for it. But we have to stop thinking that we can create identifiers out of display forms -- that's not going to allow us to share our data outside of a tightly controlled cataloging tradition.

What I also read here is a frustration that our current systems do not produce a linear display that is analogous to the display in the card catalog (and is one of the goals of our cataloging practices). I'll pose my own question here, which is: can we create a system design that imitates the linear card catalog and at the same time provide us with the Catalog/Web 2.0 features that some members of our community desire? If not, how do we resolve these apparent conflicting requirements? (BTW, Beth Jefferson of Bibliocommons gave at talk at ALA in which she said that in their usability research, users invariable disliked -- or even hated -- the linear alphabetic display that so many librarians find necessary. I believe that statistics show that the browse function in current catalogs is seldom used. I suspect that most use is by library staff.)

Question 4

Do all possible inverse relationships need to be expressed explicitly, or can be they inferred?

If they are truly reciprocal, they can be inferred. It will require rules (the reciprocal of parent of = child of, the reciprocal of is author is has author). How this is handled internally in applications is something else, that is whether they create the inverse relationships in local storaage or are able to traverse them in any direction using rules on the fly. But I see no need to create the inverse relationships in one's metadata standard.

Question 5

Can RDF solve the problems we are having now because of the lack of transitivity or inheritance in the data models that underlie current ILSes, or will RDF merely perpetual these problems?

I answer this (first post, my #3) when I talk about the inconsistencies in authority data that make it very hard to make the appropriate inferences about relationships between data elements. It is possible that we could use RDF as the basis of our data and create these same ambiguities, but I hope that we will use the opportunity of moving to a new set of rules and a new data format to correctly restructure our data so that it does have the functionality we want.

Friday, July 10, 2009

Yee: Questions 1-2

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

As I mentioned previously, I am going to try to cover each of Martha Yee's questions from her ITAL article of June, 2009. The title of the article is: "Can Bibliographic data be Put Directly onto the Semantic Web?" Here are the first three. As always, these are my answers, which may be incorrect or incomplete, so I welcome discussion both of Yee's text as well as mine. (Martha's article is available here.)

Question 1

Is there an assumption on the part of the Semantic Web developers that a given data element, such as publisher name, shuld be expressed as either a literal or using a URI ... but never both?

The answer to this is "no," and is explained in greater detail in my post on RDF basics.

Yee goes on, however, to state that there is value in distinguishing the following types of data:

Copied as is from an artifact (transcribed)

Supplied by a cataloger

Categorized by a cataloger (controlled)

She then says that

"For many data elements, therefore it will be important to be able to record both a literal (transcribed or composed form or both) and a URI (controlled form)."

This distinction between types of data is important, and is one that we haven't made successfully in our current cataloging data. The example I usually give is that of the publisher name in the publisher statement area. Unless you know library cataloging, you might assume that is a controlled name that could be linked to, for example, a Publisher entity in a data model. That's not the case. The publisher name is a sort-of transcribed element, with a lot of cataloger freedom to not record it exactly as it appears. If we want to represent a publisher entity, we need to add it to our data set. There are various possible ways to do this. One would be to declare a publisher property that has a URI that identifies the publisher, and a literal that carries the sort-of transcribed element. But remember that there are two kinds of literals in Yee's list: transcribed and cataloger supplied. So a property that can take both a URI and a literal is still not going to allow us to make that distinction.

A better way to look at this is perhaps to focus more on the meaning of the properties that you wish to use to describe your resource. The transcribed publisher, the cataloger supplied publisher, and the identifier for the corporate body that is the publisher of the resource -- are these really the same thing? You may eventually wish to display them in the same area of your display, but that does not make them semantically the same. For the sake of clarity, if you have a need to distinguish between these different meanings of "publisher", then it would be best to treat them as three separate properties (a.k.a. "data elements").

Paying attention to the meaning of the property and the functionality that you hope to obtain with your data can go a long way toward solving some of these areas where you are dealing with what looks like a single complex data element. In library data that was meant primarily for display, making these distinctions was less important, and we have numerous instances of data elements that could either have values that aren't exactly alike or that were expected to perform more than one function. Look at the wide range of uniform titles, from a simple common title ("Hamlet") to the complex structured titles for music and biblical works. Or how the controlled main author heading functions as display, enforcement of sort order, and link to an authority record. There will be a limit to how precise data can be, but some of our traditional data elements may need a more rigorous definition to support new system functionality.

Question 2

Will the Internet ever be fast enough to assemble the equivalent of our current records from a collection of hundreds or even thousands of URIs?

I answered this in that same post, but would like to add what I think we might be doing with controlled lists in near-future systems. What we generally have today is a text document online that is updated by the relevant maintenance agency. The documents are human-readable, and updates generally require someone in the systems area of the library or vendor's support group to add new entries to the list. This is very crude considering the capabilities of today's technology.

I am assuming that in the future controlled lists will be available in a known and machine-actionable format (such as SKOS). With our lists online and in a coded form, the data could be downloaded automatically by library systems on a periodic basis (monthly, weekly, nightly -- it would depend on the type of list and needs of the community). The downloaded file could be processed into the library system without human intervention. The download could include the list term, display options, any definitions that are available, and a date on which the term becomes operational. Management of this kind of update is no different to what many systems do today to receive updated bibliographic records from LC or from other producers.

The use of SKOS or something functionally similar can give us advantages over what we have today. It could provide alternate display forms in different languages, links to cataloger documentation that could be incorporated into workstation software, and it could provide versioning and history so that it would be easier to process records created in different eras.

There could be similar advantages to be gained by using identifiers for what today we call "authority data." That's a bit more complex however, so I won't try to cover it in this short post. It's a great topic for a future discussion.

Tuesday, July 07, 2009

Yee on RDF and Bibliographic Data

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

I've been thinking for a while about how I could respond to some of the questions in Martha Yee's recent article in Information Technology and Libraries (June 2009 - pp. 55-80). Even the title is a question: "Can Bibliographic data be Put Directly onto the Semantic Web?" (Answer: it already is ) Martha is conducting an admirable gedanken experiment about the future of cataloging, creating her own cataloging code and trying to mesh her ideas with concepts coming out of the semantic web community. The article's value is not only in her conclusions but in the questions that she raises. In its unfinished state, Martha's thinking is provocative and just begging for further discussion and development.(Note: I hope Martha is allowed to put her article online, because otherwise access is limited to LITA members.) (Martha's article is available here.)

The difficulty that I am having at the moment is that it appears to me that there are some fundamental misunderstandings in Yee's attempt to grapple with an RDF model for library data. In addition, she is trying to work with FRBR and RDA, both of which have some internal consistencies that make a rigorous analysis difficult. (In fact, Yee suggests an improvement to FRBR that I think IFLA should seriously consider, and that is that subject in FRBR should be a relationship, and that the entities in Group 3 should be usable in any relevant situation, not just as subjects. p. 66, #6. After that, maybe they'll consider my similar suggestion regarding the Group 1 entities.)

I'm trying to come up with an idea of how to chunk Yee's questions so that we can have a useful but focused discussion.

I'm going to try to begin this with a few very basic statements that are based on my understanding of the semantic web. I do not consider myself an expert in RDF, but I also suspect that there are few real experts among us. If any of you reading this want to disagree with me, or chime in with your own favorite "RDF basics," please do.

1. RDF is not a record format; it isn't even a data format

Those of us in libraries have always focused on the record -- essentially a complex document that acts as a catalog surrogate for a complex thing, such as a book or a piece of recorded music. RDF says nothing about records. All that RDF says is that there is data that represents things and there are relationships between those things. What is often confusing is that anything can be an RDF thing, so the book, the author, the page, the word on the page -- if you wish, any or all of these could be things in your universe.

Many questions that I see in library discussions of the possible semantic web future are about records and applications: Will it be possible to present data in alphabetical order? What will be displayed? None of these are directly relevant to RDF. Instead, they are questions about the applications that you build out of your data. You can build records and applications using data that has "RDF Nature." These records and applications may look different from the ones we use today, and they may provide some capabilities in terms of linking and connecting data that we don't have today, but if you want your application to do it, it should be possible to do it using data that follows the RDF model. However, if you want to build systems that do exactly what today's library systems do, there isn't much reason to move to semantic web technology.

2. A URI is an identifier; it identifies

There is a lot of angst in the library world about using URI-structured identifiers for things. The concern is mainly that something like "Mark Twain" will be replaced with "http://id.loc.gov/authorities/n79021164" in library data, and that users will be shown a bibliographic record that goes like:

http://id.loc.gov/authorities/n79021164
Adventures of Tom Sawyer

or will have to wait for half an hour for their display because the display form must be retrieved from a server in Vanuatu. This is a misunderstanding about the purpose of using identifiers. A URI is not a substitute for a human-readable display form. It is an identifier. It identifies. Although my medical plan may identify me as p37209372, my doctor still knows me as Karen. The identifier, however, keeps me distinct from the many other Karens in the medical practice. Whether or not your application carries just identifiers in its data, carries an identifier and a preferred display form, or an identifier and some number of different display forms (e.g. in different languages) is up to the application and its needs. The point is that the presence of an identifier does not preclude having human-readable forms in your data record or database.

So why use identifiers? An identifier gives you precision in the midst of complexity. Author n790211164 may be "Mark Twain" to my users, and "Ma-kʻo Tʻu-wen" to someone else's, but we will know it is the same author if we use the same identifier. And Pluto the planet-like object will have a different identifier from Pluto the animated character because they are different things. It doesn't matter that they have the same name in some languages. The identifier is not intended for human consumption, but is needed because machines are not (yet?) able to cope with the ambiguities of natural language. Using identifiers it becomes possible for machines to process statements like "Herman Melville is the author of Moby Dick" without understanding one word of what that means. If Melville is A123 and Moby Dick is B456 and authorship is represented by x->, then a machine can answer a question like: "what are all of the entities with A123 x->?", which to a human translates to: "What books did Herman Melville write?"

As we know from our own experience, creating identities is tricky business. As we rely more on identifiers, we need to be aware of how important it is to understand exactly what an identifier identifies. When a library creates an authority record for "Twain, Mark," it may appear to be identifying a person; in fact, it is identifying a "personal author," who can be the same as a person, but could be just one of many names that a natural person writes under, or could be a group of people who write as a single individual. This isn't the same definition of person that would be used by, for example, the IRS or your medical plan. We can also be pretty sure that, barring a miracle, we will not have a situation where everyone agrees on one single identifier or identifier system, so we will need switching systems that translate from one identifier space to another. These may work something like xISBN, where you send in one identifier and you get back one or more identifiers that are considered equivalent (for some definition of "equivalent").

3. The key to functional bibliographic systems is in the data

There is a lot of expressed disappointment about library systems. There is no doubt that the systems have flaws. The bottom line, however, is that a system works with data, and the key to systems functionality is in the data. Library data, although highly controlled, has been primarily designed for display to human readers, and a particular kind of display at that.

One of the great difficulties is with what libraries call "authority control." Certain entities (persons, corporate bodies, subjects) are identified with a particular human-readable string, and a record is created that can contain variant forms of that string and some other strings with relationships to the entity that the record describes. This information is stored separately from the bibliographic records that carry the strings in the context of the description of a resource. Unfortunately, the data in the authority records is not truly designed for machine-processing. It's hard to find simple examples, so I will give a simplistic one:

US (or U.S.)
is an abbreviation for United States. The catalog needs to inform users that they must use United States instead of US, or must allow retrieval under either. The authority control record says:
"US see United States"

United States, of course, appears in a lot of names. You might assume then that every place where you find "United States" you'll find a reference, such that United States. Department of State would have a reference from U.S. Department of State that refers the user from that undesirable form of the name ... but it doesn't. The reference from U.S. to United States is supposed to somehow be generalized to all of the entries that have U.S. in them. Except, of course, for those to which it should not be applied, like US Tumbler Co. or US Telecomm Inc. (but it is applied to US Telephone Association). There's a pattern here, but probably not one that can be discerned by an algorithm and quite possibly not obvious to all humans, either. What it comes down to, however, is that if you want machines to be able to do things with your data, you have to design your data in a way that machines can work with it using their plodding, non-sentient, aggravatingly dumb way of making decisions: "US" is either equal to "United States" or it isn't.

Another difficulty arises from the differences between the ideal data and real data. If you have a database in which only half of the records have an entry for the language of the work, providing a search on language guarantees that many records for resources will never be retrieved by those searches even if they should be. We don't want to dumb down our systems to the few data elements that can reliably be expected in all records, but it is hard to provide for missing data. One advantage of having full text is that it probably will be possible to determine the predominant language of work even if it isn't encoded in the metadata, but when you are working with metadata alone there often isn't much you can do.

A great deal of improvement could be possible with library systems if we would look at the data in terms of system needs. Not in an idealized form, because we'll never have perfect data, but looking at desired functionality and then seeing what could be done to support that functionality in the data. While the cataloging data we have today nicely supports the functionality of the card catalog, we have never made the transition to truly machine-actionable data. There may be some things we decide we cannot do, but I'm thinking that there will be some real "bang for the buck" possibilities that we should seriously consider.

Next... I'll try to get to the questions in Martha's article.