Friday, March 05, 2010

MARC: from mark-up to data

The main reason that I keep pushing the semantic web is not that I think the semantic web is the answer to all of our problems -- but I do think we need to have something to be moving toward in terms of transforming our data carrier to something both more modern and web-compatible. The semantic web gives us some basic concepts of data design. I'm not sure that the semantic web concepts will hold for data as complex as the library bibliographic record, but there's only one way to find out: do it. That's a huge task, of course.

The first question to be answered is: What are our data elements? In theory, this should be one of the simpler questions, but it's not. I can create a list of all of the MARC fields, subfields, and fixed field elements (which I have, and they are linked from this page of the futurelib wiki), but that doesn't answer the question. Here's why:

Indicators

The indicators in the MARC fields are like a wild card in poker -- they can be used to utterly transform the play. Some of the indicators are simple and probably can be dismissed: the non-filing indicators and the indicators that control printing. Some are data elements in themselves: "Existence in NAL collection" is essentially a binary data element. Many further refine the meaning of the field, allowing the field to carry any one of a number of related subelements:
Second - Type of ring
# - Not applicable
0 - Outer ring
1 - Exclusion ring
Others name the source of the term, such as LCSH or MeSH. It'll take a fair amount of work to figure out what all of these qualifiers mean in terms of actual data elements.

Redundancy

There is non-textual (although not non-string) data in the MARC record, primarily in the fixed fields (00X) but also in some of the number and code fields (0XX). Some of these, actually most of these, are redundant with display information in the body of the record. Should these continue to be separate data elements, or can we remove this redundancy and still have useful user displays? Basically, having the same information entered in two different ways in your data is just begging for trouble and we've all seen fixed field dates and display (260 $c) dates that contradict each other.

Inconsistency

Primarily due to the constraints of the MARC format, the same information has been coded differently in different fields. A personal author entry in the 100 field uses subfields abcdejqu; in the 760 linking entry field, all of that data is entered into subfield a. It's the same data element, and by that I mean that the some contents are contained in the concatenation of abcdejqu as in a. Bringing together all of these krufty bits into a more rational data definition is something I really long for.

And of course my favorite... data buried in text

So much of our data isn't data, it's text, or it's data buried in text. My favorite example is the ISBN. Everyone knows how important the ISBN is in all kinds of bibliographic linking operations. But there isn't a place in our record for the ISBN as a data element. Instead, there is a subfield that takes the ISBN as well as other information.
020 __ |a 0812976479 (pbk.)
This means that every system that processes MARC records has to have code that separates out the actual ISBN from whatever else might be in the subfield. Other buried information includes things like pagination and size or other extents:
300 __ |a 1 sound disc : |b analog, 33 1/3 rpm, stereo. ; |c 12 in.

300 __ |a 376 p. ; |c 21 cm.



Once this analysis is done (and I do need help, yes, thank you!), it may be possible to compare MARC to the RDA elements and see where we do and don't have a match. I have a drafty web page where I am putting the lists I'm creating of RDA elements, but I will try to get it all written up on the futurelib wiki so it's all in one place. I encourage others to grab this data and play with it, or to start doing whatever you think you can do with the registered RDA vocabularies. And please post your results somewhere and let me know so that I can gather it all, probably on the wiki.

Thursday, March 04, 2010

The Letters Keep Coming In

Today I received a copy of a letter written by Roman Kochan, Dean and Director of Library Services at the California State University, Long Beach (CSULB). It's the perfect day for this, because today is the national day of protest in support of education. This movement has blossomed (exploded?) over the deep cuts the California state legislature has made to the education budget in the state, cuts which are having a devastating effect on the CSU system, with the libraries extremely hard hit.

The letter is addressed to "Link+™ Member Libraries and ILL Partners." The subject line on Kochan's letter reads: Threat to CSULB Library's ILL Participation. He states that faced with budget cuts, not only this year but foreseeable for many years to come, CSULB decided to move to SkyRiver™ as their cataloging utility, with anticipated significant savings.

The next three paragraphs are worth quoting in their entirety:
"We notifed OCLC of this decision, while at the same time advising them of the Library's intent to continue membership in OCLC, to continue to make use of OCLC interlibrary loan services, and to contribute records for our current and future acquisitions to OCLC for batch upload. OCLC's charge for batch upload was (until recently) popsted on the OCLC website as 23¢ per record. That is the amount I referred to in my letter to the organization. I have subsequently learned that:
  • The price schedule for batch downloading [sic, read: uploading] that contained the 23¢ charge has suddenly and mysteriously disappeared from the OCLC website
  • Another academic library that chose to displace OCLC with SkyRiver reports that OCLC has quoted a revised charge for downloading their records that amounts to about $2.85 per record; it is a charge that they report would effectively (and one might not think coincidentally) offset the savings accrued from their change to SkyRiver.
The irony in all of this is that CSULB will still be able to have up-to-date ILL services using INN-Reach and Link+, the Innovative Interfaces (III) ILL service. It's ironic because SkyRiver was founded by Jerry Kline, the owner of III. Link+ is undoubtedly of smaller reach than OCLC's ILL services, but may in the long run grow if more III libraries move to SkyRiver.

Offsetting the cost of having a library move to another vendor may make some economic sense, but this is a matter that will need to get cleared up before other libraries move to SkyRiver thinking that they'll be able to upload their records to OCLC for $.23. MSU and CSULB were caught be surprise, which is very unfortunate.