Coyle's InFormation: 05/01/2012

Wednesday, May 30, 2012

FRBR, FRAD, ISBD in LD by BNE

To translate the title of this post: the National Library of Spain (BNE) has developed a linked data set for its bibliographic data that makes use of FRBRer, FRAD and ISBD in RDF. There are a number of interesting aspects to their decisions, and we can now compare this solution to the solution used by the British Library for their National Bibliography (diagram in PDF).

There is an English-language article with a short explanation of the project and a high-level design of the data structure. The diagram there didn't print out terribly well, so you might want to consult the Spanish-language version from their ontology documentation page:

What I find particularly interesting about this interpretation of FRBR is that the Work ("obra") and Expression ("expresión") are considered authority entities, while the manifestation, in the darker area below, is a bibliographic entity. (If this is common knowledge, I must admit that I hadn't seen this expressed so clearly before.) The Work and Expression are coded using RDF properties (e.g. "data elements") from FRBRer, FRAD and FRSAD; the Manifestation uses RDF properties from ISBD. These are the same element sets that Pat Riva recently announced on various lists.

There is a fascinating visualization of the entities and links for de Cervantes and his works. Note that you can zoom in and see more detail in the many nodes there. The visualization project is "in progress" and at the moment you cannot move directly from the nodes on the screen to the actual data (except by plugging in the URIs that you see). The resulting displays are similar to those of DBPedia, which means that you basically have to know what you are looking at in order to follow from one entity to another.

To view some actual entity records, Daniel Vila of the OEG-UPM (Ontology Engineering Group) provided these examples that are linked together, a Person, a Work, and an Expression:

- Miguel de Cervantes: http://datos.bne.es/resource/XX1718747

- Don Quijote work: http://datos.bne.es/resource/XX3383563

- Don Quijote de la Mancha Spanish (original) Expression: http://datos.bne.es/resource/XX3383563spa

The Expression leads to many dozens of manifestations, but here is one example:

Obviously this isn't easy to read since the display does not yet substitute the names for the ID numbers of the ISBD and FRBR entities (e.g. isbd:P1004 is "has title proper").

If you look at the Work and Expression records you will see that they consist primarily of links -- the Work links to creators, subjects, and Expressions. The Work also provides some alternate display text and links to some outside sources:

The Expression relates to the Work and the many manifestations. The only thing we would think of as a data element in the Expression in this example is a coded version of the language:

dcterms:language <http://lexvo.org/id/iso639-3/spa>

I find it particularly interesting to see this example of library data in linked data form because the only thing new here is the use of linking -- the data appears to me to be basically vanilla ISBD, re-organized using FRBR entities. One could conclude that this means that we could move into linked data without taking on the RDA step. I would very much like to see how these examples would be different if RDA were the applied cataloging rule set.

The view of Works, Expressions, plus of course Persons, Corporate bodies, and Subjects, as akin to today's authority records really serves to separate the linking entities from the description (which is in the Manifestation). The manifestation remains primarily text strings, but the examples in the BNE demo suggest that there are significant opportunities to create links between library resources and potentially to non-library resources without major modifications to descriptive cataloging. In fact, it looks like our attention should be on our authority data, which provides the linking opportunities.

More and more I come to the conclusion that in the linked data space the thing we seem to focus on today, descriptive cataloging, will be less useful than the entities that are represented by our authority data. For this reason I see VIAF as a very good start because links to and from those personal and corporate entities should give us a lot of connections to the rest of the world.

Monday, May 21, 2012

Google goes semantic

In a long-awaited move [1], Google has announced that its search will now be "semantic." They don't actually mean "semantic" in the sense of the semantic web, although there are similarities. While what Google is doing may not formally follow the W3C standards for the semantic web, there is no doubt that they are performing acts of "data linking" that make use of the concepts of linked data. The W3C standards for linked data are designed for openness, so that data from disparate communities can come together. Google has no obligation to play well with others and, as we saw with the development of schema.org, is in a position to make its own rules, many of which are known only within the giant Google-verse. They call their technology a "knowledge graph" and talk about "things not strings." I've used this same phrase myself in numerous presentations on linked data.

Google has always been about using links between things on the web to determine its brand of "relevance" of a web resource to a search query. By using existing linked data, via large stores of links like Dbpedia, Wikipedia, Freebase, and presumably others, Google can now expand its offerings from a single list of results to additional information about the topic that might be the intended topic of the searcher. I say "might be" without any irony; whether in a web search engine or a library catalog, the communication between the searcher's mind and the device that provides results is always only approximate. What the additional data provides is not only more context but a more ample explanation of the topics that have been retrieved. No longer do users have to guess from snippets the meaning of the results in the result set, but they can see a Wikipedia-like entry that not only gives them more information, but it contains links to other sources of information of the topic.

Snippet

"Knowledge Graph" result

"Knowledge graph" detail

At a meeting of the Northern California Technical Services Group in Berkeley last Friday, I said to the group:

Imagine that you have an 18-year-old user who finds a novel on your library's shelf by Oliver North. The user looks up the author in your catalog and sees that this person has written a few other books, but oddly always with a "co-author." Is someone so inept worth reading? Now imagine that your catalog also presents the user with the context: Ollie North, Iran Contra, and related persons. Suddenly the user sees where North fits into US history, has a chance to find out what an interesting character he is, and the books take on a whole new meaning.

That was before I saw this Google result.

We treat library users as if they are all-knowing; as if they know each author in our catalog, as if the title of the book and the number of pages is sufficient for them to decide if it is a good read or has the information they need. This is so obviously false that I am at a loss to explain how we continue to work under this illusion.

[1] Google purchased the only linked-dated search system, Freebase, in July of 2010, thus tipping their hand that they were moving in that direction. Not only did they acquire Freebase and the skills of its employees, they eliminated a potential rival (although it may be silly to consider that anyone could really be a rival to Google).

Sunday, May 13, 2012

RDA, DBMS, RDF

I have written before about some issues relating to RDA and RDF. Today I want to actually consider some things we should consider that should cause us to question the concept of "RDA in RDF."

For many decades we have been using relational databases to store our bibliographic data, bibliographic data that we create and exchange using the MARC format. Doing so was not by any means natural or intuitive because there is nothing about the structure or content of the MARC record that lends itself to being stored and managed in a relational database. The results were often awkward, inefficient, and unsatisfying.

Part of the reason for this is the unitary and flat nature of MARC. In spite of the long history of creating separate authority files, each MARC record is a complete and closed document with no actual connections to data outside of itself. While some database implementations for MARC do create relational tables for headings, the degree to which a MARC record can be separated out into tables is minimal and gains us very little in terms of the functionality of an RDBMS.

The underlying problem, however, is not in the structure of the MARC record but in the content of our catalog records. Moving from the card to a database for our data requires more than adding mark-up coding around the catalog data; to do so successfully requires re-thinking the data in terms of relational database principles. There are two basic principles to relational database design: repetition and combination.

To design for relational databases you look at your data to see what elements will be repeated in many different records. Rather than carrying those data elements in multiple records, you create a separate database table for each repeating element, and you store that element once. For example, if you are creating a database of mailing addresses, you see quickly that elements like state and zip code will appear in multiple records. You therefore create a table of state names and one of zip codes, and perhaps even one that links zip codes to city names. In this way, your database carries the string "Mississippi" only once, and that string is replaced in the records with a database pointer that uses much less internal storage. Ditto the zip code. And if the zip code is associated in a table with a city name, you also only store city names once, and each address record needs only a pointer to the zip code, not a city name. In fact, with a zip code you can get the city and state, and your design might look like:

In this way you have saved a huge amount of storage space. You have also made selection of your records on zip code, city and state much more efficient than if they were stored in every address record, because a search on a zip code, for example, retrieves a single entry in the zip code table, and that entry has database-managed links to the relevant records.

In a database of customer orders that has your inventory information along with customer addresses, you use the tables in your database to search for things like "all customers in Mississippi who have ordered WidgetX in the last six months." Information about your inventory and information about purchases are all in appropriate sets of tables in your database and you can combine the data elements to develop different views of the data.

Where the goal in relational database design is to identify and isolate data elements that are the same, the goal in library cataloging data is exactly the opposite: headings are developed primarily to differentiate at the data creation point rather than allow combination within the database management system. The goal is to have each data point be as unique as possible and to be assigned to as few records as possible. Thus, library cataloging creates headings whose purpose is to distinguish between entries:

Shakespeare, William, 1564-1616. As you like it
Shakespeare, William, 1564-1616. As you like it. 1905
Shakespeare, William, 1564-1616. As you like it. 1911.
Shakespeare, William, 1564-1616. As you like it. 1919.
Shakespeare, William, 1564-1616. As you like it. Czech
Shakespeare, William, 1564-1616. As you like it. French

These headings are counter to the functioning of a database management system. If moved to a database table to facilitate retrieval, they will each point to only one or a very small number of records. This negates both the space-saving aspect of database management and it also does not facilitate combination of data elements for retrieval. In the case of headings, the combination of elements is pre-coordinated in the data, rather than post-coordinated in the database retrieval function.

A database approach might break this data into four tables:

In this way one could search for this data by title, by title + author, date + language, or by any other combination of these four data elements. To search the library headings as anything but a single keyworded string, that is to use these headings to perform searches on title or date or language, would be incredibly inefficient. The upshot is that library headings are not "relational" and do not contribute to the functionality that database management systems can provide. Instead, database management systems make use of the separate coded elements, such as date and language, for combinatorial retrieval. Names and titles, because they are text strings and do not have an identified presence in the stored records, must be searched separately rather than being available for relational combination. The results of this type of searching are less than optimal in speed and accuracy.

All of this may seem obvious to some of you, so you may be asking yourselves why I bring this up. I bring it up because even though RDA claims to have as its goal the creation of records in a relational design (see scenario one in this JSC document), it continues to instruct catalogers to create pre-coordinated headings like the ones above. Not only will these not be efficient or fruitful in a relational database, this brings into question whether RDA is truly modeled on the principles it claims to embrace. If it is not we have cause to worry: we cannot move forward with data that does not conform to a modern model.

Note that in this post I have been emphasizing the use of relational database design for the data. The current plans for a new bibliographic framework appear to plan to create a data model for RDA that is based on semantic web principles. Those principles are yet another significant evolution following on the database model, which is now considered waning technology. Other communities, ones that have been designing for database management requirements for their data for decades, are now looking at ways to transform that data to RDF. It is possible that we can skip the relational database phase of our data development and move directly into a semantic web model. However, to think that data created following RDA instructions, which is not even suitable for a relational database, could be made usable on the semantic web without major modifications is simply wrong. If we create a bibliographic framework that takes RDA as it has been described and ports that, unchanged, to RDF we will create a data model that does not serve us, does not serve our users, and that cannot reasonably interact with other linked data on the web.

What we need is an analysis of our data, not a transformation of it "as is" to a new technology. If we aren't ready to admit that some traditional practices, like headings, are no longer useful or usable in today's technological environment, we cannot have any hope that our data will be relevant in the future.

(p.s. I anticipate that someone will state that headings are needed for alphabetical displays, to "collocate" the records. To that I reply: 1) you can do the same collocation using the data elements, and in fact you could devise multiple collocations by combining the elements in different ways and 2) a linear, alphabetic display is so anachronistic with today's technology, and so seldom used when available, that it is hard to justify the use of human catalogers to create these fields. If you still believe that library records must contain hand-crafted headings, all I can say is: you can believe what you want, but some of us will be exploring other solutions.)

Thursday, May 03, 2012

WIsh list: dump the desk

When I worked for the University of California we moved our offices a number of times, and sometimes into space that was being newly renovated. During each of these moves we were given diagrams and asked to choose a configuration for our cubicles or offices. One of the configurations, at least for offices, was the option to be sitting behind a desk rather than having the desk against a wall. In executive offices, the "behind the desk" configuration is de riguer. Its purpose is to put a solid barrier between the occupant and the visitor, and it symbolizes the power of the person who sits behind the desk.

At my public library, all of the available staff (except the shelvers) are located behind desks. There is the information desk, the reference desk, and the circulation desk. In this case, the desks do not make the person look powerful; in fact, they make the person look unavailable and powerless. The people behind a desk cannot (easily) leave the desk, they are stuck there. If a person asks for help the staff member can point but can't go with the person and help. Admittedly, some of the seated staff give the impression that just being asked to stand up is a burden.

The desk creates a physical space between the library user and the staff. Think about how it feels to be talking to a person who is behind such a barrier compared to being "corpo a corpo" next to them with no barrier. The social distance created by the desk is huge.

The desk sets up an inequality between the user and the staff member because the user has to go to the staff member, the staff member cannot go to the user. If you ask a reference question, head off to the stacks, then discover that you aren't finding what you need or have thought of another question, you have to go back to the reference desk. In a large library, that can be quite a trek. Do that a couple of times and you are likely to quit asking since it's too much trouble.

I want my library to be more like the Apple store. I want there to be staff visible in the library space but not sitting at desks. I want them to be, for example, near the catalog or at key entrances to the stacks. I want to be able to identify them as staff so I can approach them if I have a question -- it just takes a colorful T-shirt to accomplish this. I want them to be mobile, not glued to one spot. I want them to be in the same space as I am, not separated out to staff-only spaces. I want them to have their tools with them, perhaps a tablet where we can access the catalog and various resources together, right where we are. Even the shelvers could be equiped with the ability to send an SMS to the reference staff and either queue the person up for help or get an answer directly.

Some staff need to stay put, for example the circulation desk staff (note how we call it a desk, almost always?) Even having staff at open "pods" rather than behind desks would give a different impression.

So this is my wish: a library that feels like it is staffed by real live people, who walk and talk and mingle with the users. Wow. What a concept.