Tuesday, December 18, 2007

Definitions in RDA Scope

Originally posted to the RDA list.

The RDA scope document defines some basic concepts that presumably will be used throughout RDA. Some of these concepts it takes from the Dublin Core Abstract Model. In particular, it uses "literal value surrogate" and "non-literal value surrogate." These are defined in footnotes of the scope document as:

The term literal value surrogate is used as defined in the DCMI Abstract Model: “a value surrogate for a literal value, made up of exactly one value string (a literal that encodes the value)”.

The term non-literal value surrogate is used as defined in the DCMI Abstract Model: “a value surrogate for a non-literal value, made up of a property URI (a URI that identifies a property), zero or one value URI (a URI that identifies the non-literal value associated with the property), zero or one vocabulary encoding scheme URI (a URI that identifies the vocabulary encoding scheme of which the value is a member), zero or more value strings (literals that represent the value)”.

I found a more concise definition of this in a PPT by Lutz Maicher, University of Leipzig:

- a resource which is a non-literal value is represented by a proxy
- a resource which is a literal value is represented as literal

In the above, "literal" means a text string. So "Melville, Herman" is a literal, while "http://www.loc.gov/names/#n_79006936" is a non-literal proxy (because it points to the authority record, which is where the actual value is held).

The scope document then states:

- A label is represented by a literal value surrogate.
- A quantity is represented by a non-literal value surrogate
- A quality is represented by a non-literal value surrogate.
- A type is represented by a non-literal value surrogate
- A role is represented by a non-literal value surrogate.

However, in the element analysis in the scope document, it shows that quantities can be represented identically to labels (and I suspect that all other data types can as well). So that document has (and here there is a diagram that I cannot reproduce in email):

label
[resourceURIref] -> rda:title_proper -> [plain value string]

quantity
[resourceURIref] -> rda:extent -> [typed value string]^^[syntax encoding scheme]
- or -
[resourceURIref] -> rda:non_linear_scale -> [plain value string]

Given that the label example and the second example under quantity are structurally the same, I don't see how one can be a literal and one a non-literal.

I see two possibilities here. One is that all of the above has no real effect on the development of RDA, and therefore any errors in interpretation of the DCMI model can be ignored. The other is that the misunderstanding (which I think it is, but wait to be proven wrong) is significant, and therefore needs to be corrected as part of the development of RDA.

My gut feeling is that it is the former -- I don't see references to these definitions in the RDA text itself, and all values are treated as simple value strings. For example, dates are just text:

Record the date of the expression by giving the year or years alone.
1940 (p. 6-47 5rda_sec2349.pdf)

And quantities also seem to be just text strings as well:
46 slides
12 cm (from 5rda-parta-ch3rev.pdf)

Thus, at least as far as the RDA text is concerned, there are only literal values.

If this is not the case, would some please present the argument for a different understanding. Thank you.

Friday, December 07, 2007

Interpretations of FRBR Classes

Because it makes use of an entity-relationship model, FRBR consists of two primary concepts: things and relationships. (I often think of them as nouns and verbs.) In the "things" category, FRBR defines 10, which it calls entities. They are: Work, Expression, Manifestation, Item, Person, Corporate Body, Concept, Object, Event, Place.

This is an admirably short list of basic building blocks for bibliographic data. The question is: is it enough? Can we really express our bibliographic data with just these basic concepts? The answer is: probably not. Although we should take a lesson from FRBR and try to keep our set of basic entities small, while allowing for extension of them to express more complex concepts.

As an exercise, I took two well-known attempts to model FRBR using formal definitions. One is the FRBR in RDF, the other is FRBRoo. I also took the RDF entries that Martha Yee created for her cataloging rules and added those to the comparison although it is important to note that Yee's set of RDF statements is intended to go beyond FRBR since it is an expression of cataloging rules, not just the FRBR model.

In each of these three efforts, the FRBR entities are recorded as classes, and the FRBR relationships are recorded as properties. This is in keeping with the definitions in the RDF schema. What is interesting is the number of classes that are defined:

  • FRBR in RDF: 13 classes
  • FRBRoo: 23 classes, 18 sub-classes, 41 total
  • Yee's schema: 23 classes
These are compared to the 10 classes (entities) defined in FRBR. Since no one defined fewer classes, we need to look at what additional classes were defined. But first, there are a few cases where FRBR classes were not included, usually because they were substituted with a set of more detailed classes.

  • FRBRoo does not include Manifestation, but instead has Manifestation product type and Manifestation singleton
  • Yee's substitutes Event as subject for the FRBR class Event and substitutes Place as geographic area and Place as Jurisdictional Corporate Body for the FRBR Place
FRBR in RDF

FRBR in RDF adds only three classes. Two of these (Endeavor and ResponsibleEntity) are supersets of FRBR classes. Endeavor is a generalization that can be related to a work, expression, or manifestation. Similarly, ResponsibleEntity is a more general term that can relate to either a corporate body or a person. Both of these seem fairly sensible, allowing you to refer to the intellectual content or some actor without having to specify more information. It's like being able to say "it" without having to saying exactly to what you are referring.

The third class that is added is Subject. As a matter of fact, all three of these include some instance of subjects as classes in their schemas. FRBR clearly treats subject as a relationship. (And I would like to understand why these three interpreted subject as a class -- so post if you have ideas/knowledge on that, please.)


FRBRoo

FRBRoo is a very interesting interpretation of FRBR. As they state in the document, attempting to re-define FRBR using object-oriented rules rather than entity-relationship rules is a way to test the underlying concepts in FRBR. They also tackle the elements that in FRBR that are called "attributes." (Aside: The FRBR attributes are a bit odd, IMO. They seem to be all over the place and there is no explanation of how they were determined or any way to give them some organization. I don't think they actually fit the definition of attributes in E-R, which seem instead to be on the order of identifiers). The folks working on FRBRoo decided to treat the attributes as properties, that is, relationships between the classes.

FRBRoo defines 23 primary classes with 18 subclasses. They address the issue of complex items, such as articles within serials or collections of essays, by creating classes for aggregate and serial works. Some of the classes seem to be what I would normally understand as genres. As an example, there is a class Performance Plan that is described as:
This class comprises sets of directions to which individual performances of theatrical, choreographic, or musical works and their combinations should conform.
Another example of a new class is Publication Event. This is an action that is part of the work flow of publication, such as

Establishing in 1972 the layout, features, and prototype for the publication of “The complete poems of Stephen Crane, edited with an introduction by Joseph Katz” (ISBN “0-8014-9130-4”), which served for a second print run in 1978.
Being an action, I would tend to express this as a property (a verb). So the layout, features, etc. could be subclasses of a manifestation, there would be an actor (a noun, or a class, probably the publishing house, or more specifically a book designer), and a time. The verb (or property) could be "designed" "typeset" "printed" etc. This makes me wonder about the FRBR class Event as a noun, but I think I could buy into a concept of named events ("WWII" "Election day 2008" "Beatles first appearance on Ed Sullivan"). Interestingly, it does appear that all of these are events as subjects, as Event is defined in FRBR; the FRBRoo event does not appear to have this noun-ish characteristic.

Yee Schema

Martha Yee's set of classes (23 of them, but not the same 23 as FRBRoo) includes Genre/Form as a class. Genre/form seems to be more of an attribute about a work rather than something that has "thingness" in itself. It's hard to imagine how you can have genre/form without it relating to a work. (As opposed to: you can have a person or a corporate body that are things in and of themselves -- that have specific, unique identities.)

It has some classes that might be considered sub-classes. For examples, Place as geographical area and Place as jurisdictional corporate body would seem to be sub-classes of Place, although Yee does not include Place itself in her schema. I'm less clear about classes such as Corporate Subdivision, which has a part/whole relationship with Corporate Body, not a sub-class relationship. (Sub-class would be an "is a type of" relationship, and corporate subdivision is not a type of corporate body, it's a part of a corporate body.) Ditto the subject-related terms: Subject, Subject subdivision, Subject chronological subdivision, Subject form subdivision, Subject geographical subdivision, Subject topical subdivision. In FRBR, the subject is a relationship with the work. These look to me to be relationships with the subject heading, although there is no class for subject headings (unless that is what is meant by the class Subject, but I don't think it would be a good idea to equate subject with subject heading because it makes it impossible to include classifications as subjects or keywords as subjects).

What's the upshot? Well, it would take a good sit-down with all involved to hash out the differences, to understand what each group or person was thinking, and to see if we can formulate a theory of how one extends FRBR to meet ones needs. If a number of people turn out to have the same needs, then it may be that the FRBR model itself needs to take in those ideas. The only way to work this out is to keep modeling and sharing. So I thank the three featured here for the extensive work that they have done in this area.

Friday, November 30, 2007

Titles in Retail and Publisher Data

There's been talk and action lately around libraries making use of data provided by publishers or retailers. What little experience I have in this area leads me to understand that we need to do some serious studies of the bibliographic metadata that is created in situations outside of libraries. What I present here is a single bit of work, not a study, and the numbers should be considered valid only for this particular set of data. However, I think that this shows the value that real studies could produce in terms of understanding our relative approaches to metadata.

For those who prefer not to read further, let me give you my conclusions here:

1. Libraries focus on the title as it is given on the title page. Others (publishers, retailers) are more interested in the cover title, both in its promotional role and as that which the buyer and retailer see when handling the product and creating order on shelves.

2.While online bookstores rely heavily on the ISBN to identify the item, and therefore are motivated to correct the ISBN if needed, the Library of Congress records in this study appear to be less often updated to correct an ISBN. (Therefore, it would be interesting to do this comparison with OCLC records to see if they get corrected more frequently than LoC.)

3. Retailers and publishers use the form of the author's name that is on the book itself and do not concern themselves with the unique identification of authors. Only libraries use the authoritative name form, which may not match up to the form used by others.

4. Publishers and libraries have different data points for number of pages, with libraries using the numbered pages and publishers focusing on the total number of sheets.

I should also mention that everyone except libraries seems to use title case for titles. Does anyone know the logic behind the library decision not to use title case?

The Comparison

- 250,000 LC MARC records compared to Amazon online data, matching on ISBN, then comparing titles

The Numbers

- 71,000 records matched on ISBN
- of those, 67,000 also matched on title, or on partial title (left-anchored)

Reasons for Non-matches

Of those that didn't match, the reason was (based on an unscientific sample, so the percentages are just a rough guide):

1. The Amazon entry includes what libraries consider to be the series title as part of the title. These are often those "publisher series" that would be placed in a MARC 490. They generally appear prominently on the cover of the book, are presented with the title on the cover, and are carried in the cover design. Retailers also seem to add key information that would appear on the cover, such as the fact that the item includes a CD-ROM.

Amazon: State Shapes: Texas
MARC: Texas (series: State Shapes)

(Note, I've gotten a better look at ONIX data and it turns out that in many cases the series is coded as the title, and the book title is coded as the subtitle. So in the example above, the data that Amazon received would have had "State Shapes" as the title and "Texas" as the subtitle.)

Amazon: How to Prepare for the GMAT with CD-ROM
MARC: How to prepare for the graduate management admission test
(In this case there are two versions of the book, one with, one without the CD. MARC has the same title for both)

Number: ~45%

2. Minor differences in wording or spelling errors. These are often on Amazon titles, perhaps those that have been entered in by bookstores or small retailers that sell through Amazon. There are also some obvious differences in practice which may or may not be consistent in the retailer data.

Amazon: Literature of Memory
MARC: Literatures of memory

Amazon: One Eye Laughing, the Other Eye Weeping
MARC: One eye laughing, the other weeping

Amazon: Java(TM) Server and Servlets
MARC: Java server and servlets

Number: ~27%

3. The title in Amazon includes the name of the author; the MARC record separates these into author and title. (Amazon also includes the author name in the author field.) There are also times when this is reversed (eg MARC includes the author name in the title, Amazon does not):

Amazon: John Thelwall's the Peripatetic
MARC: The peripatetic

Amazon: BBC Walking with Dinosaurs
MARC: Walking with dinosaurs

Amazon: Southern Christmas
MARC: Emyl Jenkin's southern Christmas

Number: ~10%

4. Titles so entirely different that it appears to be a wrong ISBN. Often it is an ISBN from another book by the same publisher, possibly a mistaken re-use. Of these, the entries on Amazon appear to be correct, while those in LC records often contain an ISBN that retrieves more than one item from that publisher. I wouldn't be at all surprised to learn the the ISBN received by the CIP program is often not the actual final ISBN. When the book then arrives at LoC, it may be hard to determine that or why the ISBN has changed.

Amazon: Harriet Tubman
MARC: Paul Robeson

Number: ~8%

5. There are differences in the treatment of numbers and abbreviations that appear in titles. In some cases, the title on Amazon has been abbreviated beyond what appears on the book, probably by a bookseller saving keystrokes. It's also my guess that in some Amazon entries the abbreviation or number is spelled out to influence retrieval.

Amazon: Ten Best Teaching Practices
MARC: 10 best teaching practices

Amazon: God, Doctor Buzzard, and the Bolito Man
MARC: God, Dr. Buzzard, and the Bolito Man

Number: ~6%

6. Mysterious, undiagnosed, or possible errors in the comparison algorithm. I'll work more on these.

Amazon: Decisive Treatise and Epistle Dedicatory
MARC: The book of the decisive treatise determining the connection between the law and wisdom

Number: ~14%

Link to the Data

This link takes you to a page with comparisons that link to the MARC record held at the Internet Archive and to the Amazon page for the book. You can see these and other differences by looking at the two sets of data. Again, note that this was a quick comparison and there are some errors in the comparison methodology that we are already aware of.

Some Other Observed Differences

Although not included in this group (which only compares titles) there are other differences that I have observed in ONIX data but that I haven't attempted to measure.

7. Authors. An obvious area of difference is that publisher and retailer data does not use the library name authorities form of the name ("Smith, John, 1837-"). Publishers tend to include a display form of the name in their data ("John Smith") and some of them also include the inverted form ("Smith, John"), but there is no concept of unique identification of authors across time and between different publishers. Talking with publisher representatives, I also have learned that the form of the author's name that will be used on the printed book and in publicity may be designated in the contract between the author and the publisher. This does not mean that publishers cannot include a version of the name as found in the LC name authorities file. However, it is likely that there will be multiple forms of the name in non-library data, rather than the single form found in library records.

8. Pagination. I was looking at pagination as part of a de-duping algorithm because it served us well when de-duping within library data as a way to distinguish different editions. This will not be the case between library and publisher data, at least not with the data that I have seen. Publishers have an entirely different measure of pages, and it is (logically) the actual number of pages in the physical book. This is clearly a matter of cost to them, and also a key piece of data about the manufacture of the book. Libraries, instead, record the printed page numbers. This latter is immediately visible to the cataloger, while the publisher count would mean having to actually hand count the pages in the book. In this case, libraries and publishers are each working with the information that is easily found at hand, but the results differ considerably.

Thursday, November 15, 2007

Future of Bibliographic Control,LC, 11/13

Notes from the meeting on Nov. 13 of the Working Group on the Future of Bibliographic Control.

These are my notes and should NOT be taken to represent accurately the thoughts of the working group, only my quick recording of what I understood at the meeting. Also, I must add the disclaimer that I have been engaged as a consultant to the group for the writing of the report. I attempt in that work to be as faithful to the outcomes desired of the group as I can. However, I admit that pure objectivity is a chimera, so my own opinions may come through in the text below.

There was an introduction explaining the creation of the working group (which you can read about on the working group's web site: http://www.loc.gov/bibliographic-future/). The group presented an interim report to the Library of Congress. The full report will be available by December 1 for public comment. The comment period will end on December 15, and the final report will be presented on January 8, 2008.

The report was commissioned by the Library of Congress, but it many of its recommendations involve the the library community and other players in its environment. There are over 100 individual recommendations in five general areas.

The working group concluded that there are three major "sea changes" that are needed in the library community:

1. We must redefine bibliographic control broadly to include all materials, a widely diverse community of users, and a multiplicity of venues where information is sought.

2. We must redefine the bibliographic universe to include all stakeholders, including the for-profit organizations that are involved in information delivery and digitization

3. The role of the Library of Congress must be redefined as a partner with other libraries and with non-library institutions, working to achieve the goals of the library community.

The five areas of recommendations are:

1. Increase the efficiency of bibliographic production for all libraries through cooperation and sharing of bibliographic records and through the use of data produced in the overall supply chain.

2. Transfer effort into high value activity. In particular, provide greater value for knowledge creation through leveraging access for unique materials held by libraries, materials that are currently hidden and under-used.

3. Position our technology by recognizing that the Web is our technology platform as well as the appropriate platform for our standards. Recognize that our users are not only people but also applications that interact with library data.

4. Position our community for the future by adding evaluative, qualitative and quantitative analyses of resources. Work to realize the potential provided by the FRBR framework.

5. Strengthen the library and information science profession through education and through development of metrics that will inform decision-making now and in the future.

Under each of these areas there are sets of recommendations. The full set of recommendations is fairly detailed, and the group presented high level groupings of recommendations in the first four areas. (Area five was not presented in detail at the meeting.)

In area 1, the recommendations are grouped:

1.1 Eliminate redundancies in the production of bibliographic metadata. This means making use of data that is created elsewhere in the supply chain, and increasing the sharing of bibliographic records and modifications to records. In particular, the group asks for an examination of barriers to sharing.

1.2 Increase the distribution of responsiblity for bibliographic record production. Increase the number of institutions that participate in shared cataloging activities.

1.3 Collaborate on authority record creation. Similar to 1.2, this recommends that the number of participants in authority record creation be increased, but it also asks that we look at the possibility of sharing across sectors and internationally, to reduce the number of times that an authoritative heading must be created.

Area 2 is called "Enhance Access to Rare and Unique Materials." In this area the group states that any efficiencies gained in other areas should allow the redirection of energy to providing access to unique materials that are held by libraries and other cultural heritage institutions. In particular, the group recommends:

2.2 Integrate access to rare & unique materials with other library materials

2.3. Share bibiographic data relating to these materials. The sharing of bibliographic data must not be limited to those areas where copy cataloging is desired.

2.4 Encourage digitization to allow broad access

Area 3 is about technology and the Web:

3.1.1 Integrate library standards into the Web environment

3.1.2 Extend the use of standard identifiers for bibliographic entities, and include those identifiers in bibliographic records.

3.1.3 Develop a more flexible, extensible metadata carrier that can be readily exchanged with non-library applications.

Area 3 also addresses standards:

3.2.1 Develop standards with a focus on return on investment. Do analysis before beginning the standards process.

3.2.2 Incorporate usage data and lessons from use tests in the standards development process

Area 4 is about positioning the library community toward a more progressive future. In this area there are three main recommendation areas:

4.1 Design for today's and tomorrow's user. This means that we must design into our catalogs and other tools the ability to present evaluative information, and to allow and encourage users to interact with bibliographic data. We must also make use of statistical and other computationally-derived information in our user services.

4.2 Realize FRBR. The framework known as FRBR has great potential but so far is untested. It is being used as the basis for RDA, even though FRBR itself is not clearly understood. The working group recommends that no further work be done on RDA until there has been more investigation of FRBR and the basis it provides for bibliographic metadata. [Note: this recommendation is likely to change such that there will be specific recommendations relating to RDA; FRBR will be treated separately.]

4.3 Optimize LCSH for Use and Re-use. Encourage an analysis of LCSH that would move the system toward a more facetted subject system. Work to create more links between LCSH and other subject heading systems in use. Recognize that with the digitization of works the act of subject assignment may benefit from computational analysis.

In the time that I was at the meeting (I had to leave before the question period ended) there were two questions/comments. The first had to do with the fact that while there are costs to today's methods of bibliographic control, that changes in bibliographic control will have costs as well. (Here it would be good to listen again to the talk given by Rick Lugg at the meeting held at LoC. He spoke of the costs of NOT changing, something that is hard to measure but is very real.) The other comment (from Barbara Tillett) mentioned many of the recommendations and stated that LoC is already engaged in, or has rejected, analogous activities. It was acknowledged, however, that LoC had not made these activities public, so the community is generally unaware of the progress made. To me this points out one of the areas that we all need to work on, which is sharing information about our projects and their progress so that the community as a whole can benefit from work done by a single institution.

Wednesday, November 07, 2007

Hierarchy v. Relationships

The use of hierarchy as an organizing principle keeps coming up. I think we are attracted to hierarchy because of its neatness, even though in fact the real world is organized more like fuzzy sets. Fuzzy sets are hard to comprehend, nearly impossible to draw, and can't be slotted neatly into an application.

When people talk about FRBR, they are often focussed on the Group 1 entities, and those are seen as hierarchical. They tend to be shown as:
- Work
-- Expression
--- Manifestation
---- Item

as if we'll fit all of our intellectual works into such a neat hierarchy. T'ain't so. Of all of the relationships that are talked about in FRBR (I almost said "expressed" but that term has now been given a new meaning in this discussion) I think these are the least interesting. And they become even less interesting when we move beyond the traditional inventory control function of the library catalog and begin to see ourselves as navigating in a knowledge universe. But first let me tackle the Group 1 entities.

There are complaints (or remarks, depending on the context) that we don't have an agreed on definition of Work, and that the division between Work and Expression is unclear. They are unclear because in real life there isn't a neat hierarchy that just needs to be modeled. What is a Work is entirely contextual -- when I'm looking for an article, the article is a Work. When I'm subscribing to a journal, the journal is a work. When I'm on iTunes a song is a work, when I'm in the music store the album is a work. A Work is the content I am seeking at that time. In the imaginary universe where I get to create my bibliographic system, a Work will be defined as: anything you wish to talk about, point to, address. So a book-length text is a work, an article in a journal is a work, a journal is a work, a book chapter is a work -- all at the same time and in the same system. For one person the book Wizard of Oz and the movie Wizard of Oz will be a single Work. To a film buff, the director's cut of Blade Runner and the original release are distinct works. To be a Work, it just has to be definable and have a way to name it, that is it has to have an identifier. But anything can be a Work. As a matter of fact, I probably won't use the term Work at all in my universe.

As for Expressions, there will be very obvious Expressions of Works, and there will be fuzzier Expressions. There will be Expressions that express more than one Work. Expression is a relationship, not a subset. If you don't have to organize your bibliographic universe in a hierarchical way, then the need to slot each Expression under a Work goes away, although the relationship can remain.

I'm less sure about Manifestation and Item, even though these are the most concrete of the Group 1 entities. Are they a legitimate focus of a Knowledge Management system, or are they about managing physical objects? When I think about some of the uses of bibliographic data, for instance as citations in a text publication, Manifestation seems to be mainly about locating -- so if I've quoted a passage from a book, I need to cite the manifestation and the page because that's the only way that someone else can find that exact quote. When I include a URL in a document that links to a particular digital manifestation, I am giving the user a direct link to the location. Manifestations and Items will be of interest in some instances, say to rare book collectors, but I'm not at all sure that those instances justify the emphasis they have been given. And if the purpose is primarily inventory control, then I think those relationships will be managed to the extent that they matter to the library. For example, a public library may not terribly care which manifestation of the book Moby Dick is on its shelves, although its inventory system will need to know the barcode, and its acquisitions system will need to store how much the library paid for it and the provider.

The truly interesting relationships in FRBR are those between and among these entities, and those are ones that I have not seen explored. These are the relationships between things: thing1 is a translation of thing2; thing3 is an abridgment of thing4; thing5 extends thing6 in this certain way; thing7 cites thing1; thing8 continues thing3. This is where we get real value, where we provide various interesting paths through which seekers can navigate. This is what we don't provide explicitly in our catalogs today, although a human user may be able to intuit some of these relationships among the works we present.

We have so narrowly defined bibliographic control in libraries that it doesn't really include the relationships between intellectual products, except to the degree that we might make a note that one thing is a translation of another thing. But we see those relationships as "extra" or "secondary," and yet they are the very essence of knowledge creation. It astonishes me that we have focused so completely on the physical items that we have essentially missed what would make our catalogs intelligent.

Sunday, November 04, 2007

Our subject mess

Lately I've had occasion to work with a few different groups of people who are delving into library bibliographic data for the first time. Believe me, it is quite revealing to view it from the viewpoint of these novices. Novices only in this one area, because they generally are quite savvy about computing and data. Each new revelation gives me a chance to regale them with an amusing story about "how it got that way." I can explain (note: explain, not justify) why we have no identifiers for key elements like authors and works. I can pretty much explain why we seem more concerned about the package than the content. I can reminisce about moments in the history of library systems development that happened before some members of these groups were born. But I get totally stuck when they point out the mess that is our subject access.

We have two classification systems, Dewey (DDC) and Library of Congress. (LCC) That in itself is not a problem, and it's fairly easy to explain how they developed in different contexts, always making sure to explain that these systems classify the items in a library, not the world of thought.

What is hard is to try to explain what either of them has to do with the Library of Congress Subject Headings.(LCSH) Many folks assume that LCSH is the entry vocabulary into LCC. Thus if there is a classification code in a record that stands for "vocal music, choruses" that there will be a heading in the record that is "vocal music, choruses," and vice versa. They also assume that the two subject systems (classification and subject headings) have the same structure, which would mean that you can "drill down" from music to vocal music then to choruses in either or both. Nothing could be further from the truth. So it is quite confusing to them when they see a record with a call number that would ostensibly be about "vocal music, choruses" based on the classification, but instead the subject heading is "Cantatas, Secular -- Scores." And they are equally confused when the record has another subject heading ("Funeral music") but only the one classification number.

I can't explain this disconnect between the subject headings and the classification scheme, except to say: that's how it is.

Recently, I was browsing through my beloved copy of the DDC from 1899 that still has both its numeric and alphabetical tabs relating respectively to the classification and the "Relativ Subject Index." The RSI is indeed an index to the classification scheme, and it appears that Dewey originally intended it also as the access to the collection:
"HOW TO USE THIS INDEX
Find the subject desired in its alphabetical place in the index. The number after it is its class number and refers to the place where the topic will be found, in numerical order of class numbers, on the shelves or in the subject catalog."
From this I can only presume that the shelves and the subject catalog were in classification order, and the alphabetical index was the index to that classification. I can only guess at this point, from what he says here, that the subject catalog was in classification order, as is the shelf, but also contained the verbal translation of what the decimal classification numbers meant.
"Under this class number will be found the resources of the library on the subject desired. Other subjects near the one sought may often be consulted with profit; e.g., Communism is the topic wanted and the index refers to 335.4, but 335, Socialism, and even the inclusive division 330, Political economy, also contain much on this subject. The reverse is equally true; the full material on socialism can only be had by looking at its divisions 335.3, Fourierism, 335.4, Communism, etc. The topics which are thus subdivided are plainly marked in the index by heavy faced type."
My copy is #3933, originally owned by the Roger Williams Park Museum in Providence, Rhode Island. The current incarnation of the institution appears to be the Museum of Natural History and Planetarium. My copy has many penciled notes in the area of Zoology (DDC 590), which would fit the natural history nature of the institution. (I don't see any evidence of a current library.) By 1900 the "dictionary catalog" would have taken root, so I don't know if the library would have followed Dewey's instructions for the creation of a classified catalog. But I do wonder how we got from a single system that had an alphabetical index to a classification system to a system with an alphabetical index and two classification systems, but in which the index and the classification have essentially each gone their own ways. This is obviously a gap in my education, which I will gladly rectify if you have suggestions for readings.

Meanwhile, no wonder users are confused.

Sunday, October 28, 2007

Bibliographic ER

No, I'm not sending libraries to the emergency room, although there are days when I feel like we're at that point. The ER in the title refers to Entity-Relationship, a way to look at data that emphasizes the general viewpoint that there are things, and those things exist in relation to each other.

In one sense, this is what we have done for over a century with our library data. The bibliographic records that we create have in them many relationships: Person authored Book; Publishing House published Book; Book is in Series; Book has Topics. Those relationships are implicit in our records, but the data isn't formatted in an entity-relationship model. Our records, instead, talk about the relationships but don't make it easy to give the various entities their own existence. So we create a record that contains:

Author
Book title.
Place, publisher, date
Series
Subject A
Subject B

The record represents all of the information about the book, but there is no record that represents all of the information about the author, or all of the information about the publisher, etc. Instead, those "entities" are buried in bibliographic records scattered throughout the file.

An E-R model would give each of these entities an identity on which you could hang information about the entity.



OK, I can't draw worth beans. But basically the idea is that authors, subjects, publishers, topics, all become entries in their own right. This means that you can add information to the author record or the series record, because they have their own place in the design. It also makes it easy to look at your data from many different points of view, while still retaining all of the richness of the relationships. So from the point of view of the person who is the illustrator in the book above, the bibliographic world may look like this:

This type of model is expressed in FRBR, but the E-R aspect of FRBR does not seem to be incorporated into RDA as it stands today. Instead, RDA appears to be aimed at creating the same flat structure that we have in library data today.

If you take a look at the OpenLibrary you will see that books get a page that is about the book, and authors get a separate page that is about the author. This is very simple, but it is also very important. It means that the catalog is no longer just a list of books with authors but can become a rich source of information about authors. You can add bios for authors, link to web sites about the author, launch a discussion group about a favorite author. Because the author is an entity, not just a data element in a record about the book, it becomes a potentially active part of your information system.

In the future, I hope that we can give life to many more entities in the OpenLibrary, and also that we can give them meaningful relationships between each other. This would mean taking a semantic web approach to library data. I don't have a clear picture of where we'll end up, but I'm glad that folks there are interested in experimenting. If you've already thought this through or have ideas in this direction, please step forward. I'd love to hear from you.

Saturday, October 20, 2007

Great Minds...

As if in response to my post on name authorities, OCLC has come up with a version of the Virtual International Authority File (acronym VIAF). Type in "Fitzgerald, Michael" and you'll see that each name has associated with it what they are calling a "sample title." The titles are unattractive, being normalized forms, but still give you some idea of what each author has written, and you might be able to sort the Michael Fitzgerald who writes on XSL from the one who has written the guide to better business letters. At this point, that authority control has already determined that these are different people is incredibly valuable, where the value was much harder to see when all you had were names and dates.

Friday, October 12, 2007

Cataloging as Industry

Something pointed me to this paper by Alan Danskin of the British Library:
Tomorrow never knows: the end of cataloging?

It has some well-spoken statements about the great increase in materials, the need to collaborate better with others in the publishing supply chain, etc. But what really stood out for me was this:
The future of cataloguing depends on transforming the process from a craft into an industry.

He qualifies this by saying
This requires unambiguous identification at different levels of granularity to facilitate repurposing of metadata created at the different stages of the process of creating and publishing resources. It also means we may have to be less precious about some of our cherished practices.

I can't disagree with what he says here, but I must say that I have a different take on the idea of industrialization of cataloging, and that is that we should consider taking cataloging out of the library and giving it to others who will actually industrialize it. Just as we don't hand craft our own library shelves, and we don't hand craft our own library systems, perhaps we shouldn't be hand-crafting our own catalog records.

What I refer to here would probably come under the rubric of "outsourcing," some of which already takes place, especially for works in less common or more difficult languages. But what if, just what if, someone could develop a cataloging service that was cheaper than what libraries can do themselves, and had comparable quality? Is there any reason why we shouldn't go for it?

Sunday, September 30, 2007

Glut? Gunk!

You've probably had the experience of participating in some activity that was later covered by print or TV news. In many cases, the report of the event is so wrong, so different to what you experienced, that you could hardly recognize it as being the same event. Similarly, when reporters write about something you know intimately, the reports are almost always aggravatingly wrong.

The same is true about books, of course. I thoroughly enjoyed Bill Bryson's A Short History of Nearly Everything, which drove real scientists nuts for everything it got wrong. Now I'm going out of my mind reading Alex Wright's Glut, which I can only describe as poorly researched, and in some cases just outright wrong.

I became suspicious when I read on page 21
"It is no coincidence that snakes have been a leading cause of human mortality throughout our species' history, so it should come as no surprise that the occurrence of serpent imagery tracks closely to the prevalence of poisonous snakes in particular regions."
I don't doubt that snakes are scary creatures and they sure do seem to show up in all kinds of ancient imagery and tales, but "a leading cause of human mortality"? I don't think so. Famine, pestilence, war -- those are leading causes of human mortality. Snakes? A drop in the bucket.

OK, we all can slip up when we get going at the keyboard, and I figured that his editors just hadn't paid attention. Then I got to page 79 where he says:
"... a new form of document: the codex book, so named because it originated from attempts to 'codify' the Roman law in a format that supported easier information retrieval."
Codex comes from "codify"? Were the Romans speaking English? And besides, I'd recently read a few books on book history myself and those all referred to that origin as being from the Latin term "caudex" referring to wood used as the first book covers. The use of "code" for groups of laws came from the term "codex," not vice versa. I began to wonder where he would have gotten such a definition, and on a hunch decided to look at the Wikipedia entry on Codex. There had been some confusion between codex and code in an early Wikipedia version of the codex page, and it was removed:
"Mistaking Codex for Code

I moved this mis-stated misunderstanding here: "A legal text or code of conduct is sometimes called a codex (for example, the Justinian Codex), since laws were recorded in large codices." This is simply an error, one that doesn't come into educated or official discourse. --Wetman 20:14, 9 May 2006 (UTC)

I have no idea if that is where Wright got his information, but this statement makes the same mistake that Wright does.

I have kept reading, I guess because I wanted to get to his treatment of more modern times. I've gotten as far as Panizzi, but had to get all of this out of my system before going on. On page 167, Wright quotes a biographer, one Louis Fagan, on Panizzi's appearance. I looked at the citation for the quote and found:
"3. Louis Fagan, quoted in Teresa Negrucci, 'Historiography of Antonio Panizzi,' 2001, http://www.gseis.ucla.edu/faculty/maack/Panizzi.doc"
I looked up the paper online, and Ms. Negrucci was a student in the UCLA library school at the time of writing this paper, done for IS 281 "Historical Methodology for Library and Information Science." (The citation above is no longer valid. You can find it linked from this page of student writings.) A perfectly fine school paper, but probably not an authoritative source. Plus, I was taught that you only took quotes from someone else if the original is terribly hard to get to. Fagan's book is available in at least 80 US libraries, according to WorldCat, although today I was able to get to it online. Now, I admit that the book may not have been available via Google Book Search when Wright was composing his work, but by no means is the original inaccessible. In fact, if he had looked at the original, rather than the student paper, he would have understood that Fagan was quoting someone else in his description of Panizzi, not making the statement himself, as Wright states.

It's not an important point nor a particularly important passage, but it is sloppy scholarship. It means he took his information from someone else and did not verify the original source. In fact, of the about 260 citations in the book (and I'm counting all of the "ibid's" in this) a full 52 are "quoted in" or "cited by," and mainly the former. The entire first half of the book, which is on ancient and medieval history, uses modern sources almost exclusively. One chapter, on memory, cites only six discrete works, and takes quotes of Thomas Aquinas, Giulio Camillo, John Willis, John Wilkins, and Francis Bacon second-hand from books published mainly in the 1990's. In that chapter, only one "ancient" quote is from an original source. One of the citations referring to Wilkins is to a BBC web site page. It's no longer available. I might be just being mean, but I can find the BBC page cited on the Wikipedia entry for John Wilkins in the Wikipedia version prior to the date of Wright's citation, although it has since been removed. I don't at all mind people using Wikipedia for its basic purpose: to give one a clue and lead one on to sources. And of course we all jump on to the nearest bit of information on the web. But when researching a well-known historical figure, it really is important to cite a good, permanent resource, and in terms of Wilkin, other resources should be available.

As for Panizzi, Wright talks about his creation of a schedule of tiered subject headings. On page 168 he has a quote from Elaine Svenonius that implies some criticism of Panizzi's work.
"Some would argue [the subject headings] were too ambitious -- that there was no need to construct elaborate Victorian edifices since jerrybuilt systems could meet the needs of most users most of the time."
The bracketed words "the subject headings" was added by Wright. In fact, Svenonius was not referring to Panizzi's headings. The quoted passage is about "systems produced during the second half of the nineteenth century," ("Victorian" should be a hint) which would be after Panizzi, whose primary work was done earlier in that century. And the full quote, with no reference to subject headings, is:
"The systems produced during the second half of the nineteenth century, a period regarded as a golden age of organizational activity, [cites Cutter 1904] were ambitious, full-featured systems designed to meet the needs of the most demanding users. Some would argue that they were too ambitious -- that there was no need to construct elaborate Victorian edifices since jerrybuilt systems could meet the needs of most users most of the time. [cites Coffman]" Svenonius, p. 3
The Cutter reference is to his 4th edition of Rules for a Dictionary Catalog. The sentence quoted by Wright is a reference to American Libraries article by Steve Coffman called "What If You Ran Your Library Like a Bookstore?".

Must I go on? I was able to check this one reference carefully because I happened to have the Svenonius book on my own bookshelf. I have no reason to believe that the rest of his text is any more accurate or faithful to the sources he cites. I suppose the one consolation is that in spite of his MLS from Simmons, Alex Wright calls himself an Information Architect, eschewing the "L" word. I wouldn't want people to think that librarians don't know how to do research.

Saturday, September 29, 2007

Name authority control, aka name identification

Libraries do something they call "name authority control". For most people in IT, this would be called "assigning unique identifiers to names." Identifying authors is considered one of the essential aspects of library cataloging, and it isn't done in any other bibliographic environment, as far as I know. When a user goes to a library catalog, they will find all of the works of T.C. Boyle under a single name, even though he has variously used T.C. Boyle and T. Coraghessan Boyle on his books, and was born with the name Thomas John Boyle. Authority control puts all of his works under one name, with references from other forms of his name: TC Boyle, see: T. Coraghessan Boyle. When there are two authors with the same name, one of them (the second one to be added to the authority file, generally) is distinguished using a middle initial or the year of birth. Thus you can have
Smith, John
Smith, John 1709
Smith, John 1936
Smith, John A.

There are some problems with the current method used by libraries to realize authority control, not the least of which is that it is a difficult and expensive process and the number of authors is growing rapidly as we all become creators in this information age. I want to address here 3 aspects of name authority control that are especially non-functional: 1) the use of dates as distinguishing characteristics is not easy for the catalogers creating the authority record 2) the use of dates as distinguishing characteristics does not help the users 3) the name heading is not a legitimate identifier because it may change.

Date of Birth is Hard for Catalogers

We hear that authority control, including name authority control, is responsible for upwards to 40-50% of the time it takes to catalog a book. Part of this is in determining if you do indeed have a new author to enter into the system. Another part is in creating the unique entry. Take the case of Michael Fitzergerald, editor of a book called Touching All Bases. Touching all Bases is a collection of columns by sports writer Ray Fitzgerald. His sons, Michael and Kevin gathered the columns after their father's death in 1982 and published them. Because there have been other Michael Fitzgerald's as authors, the year of his birth had to be added to his name. Here's the authority record for Michael:

LC Control Number: n  83124260
Cancel/Invalid LCCN: n 97055382 no 90013838
HEADING: Fitzgerald, Michael, 1955-
Found In: Fitzgerald, R. Touching all bases, c1983 (a.e.) CIP t.p.
(Michael Fitzgerald)
Call to publisher, 6/27/83 (Accountant, b. 2/22/1955)

Michael Fitzgeralds seem to be in great abundance. There was even another one who wrote a book and was also born in 1955. To distinguish between them, Michael Fitzgerald 1955 #2 has his full date of birth added to his name

LC Control Number: n 2003097483
LC Class Number: PS3556.I8345
HEADING: Fitzgerald, Michael, 1955 June 11-
Found In: The Creative circle, 1989: t.p. (Michael Fitzgerald) p. 241
(teaches at Shenandoah College in Virginia)
Earth circle, c2003: CIP t.p. (Michael Fitzgerald) data
sheet (b. 06-11-55)
His book, Creative Circle, is about art, music and literature from a Baha'I perspective. We see that at the time the authority record was created he was teaching at Shenandoah College.

So here we have two authors whose works would never be mistaken for each other, yet who have the same name. The authority records are evidence of why it is so time consuming to create these identifiers. Because the date of birth is generally not one of the pieces of information about an author that is included in the book nor in the promotional material provided by publishers, the librarians establishing the name heading often must resort to contacting the publisher or the author or the author's institution to determine that information.

Date of Birth May Not Help Users

In a time when few people wrote books, and when users may have come to the library with some knowledge of the famous intellectual whose works they were seeking, the distinction between two John Smiths, one born in 1709 and one born in 1936, may have been an obvious one. We are now, however, in a time of author abundance. Anyone can, and many do, write books, and many of those writing are not known in wider circles. Reading is now considered a "popular" activity, as the bookshelves of any chain bookstore will evidence. So a user of a library catalog may find himself facing a daunting choice among authors, such as these, all named "Michael Fitzgerald":

Fitzgerald, Michael
Fitzgerald, Michael, 1768-1831
Fitzgerald, Michael, 1859-
Fitzgerald, Michael, 1918-
Fitzgerald, Michael, 1937-
Fitzgerald, Michael, 1946-
Fitzgerald, Michael, 1955-
Fitzgerald, Michael, 1955 June 11-
Fitzgerald, Michael, 1957-
Fitzgerald, Michael, 1958-
Fitzgerald, Michael, 1959-
Fitzgerald, Michael, 1970-
FitzGerald, Michael A.
etc.
The Michael Fitzgerald born June 11, 1955, will be able to find himself in this list, but other than members of his immediate family, no one else will know which of these he is. Catalogers have to call publishers or authors to find out the author's date of birth because it's not included on the book, so there is no reason to believe that the date is available to users of the library catalog. All of that time and effort is expended to create a distinction that often doesn't help the user.

All That, and It's not Even a Valid Identifier

The final blow to name authority control is that the name heading (as the name entry is called, e.g. Smith, John A.) can change. Sometimes it might change because a mistake was made in creating the heading, or even in the printing of the book, other times it changes because the library rules for creating name headings change. The heading performs multiple functions: it is the display form in displays of the book's data, it is used as the string to search on in a catalog, and it identifies the author. If a new display form is needed, then the identifier itself changes. When this happened on a grand scale a few decades ago, due to a change in the library cataloging rules, all of the connections between names and books broke, and names in library records all over the country (and beyond) had to be changed. A true identifier only identifies, and if display forms change the identifier stays the same. John Smith is the same person even if the library entry changes from Smith, John A. to Smith, John Arthur.

What Now?

It seems pretty clear that we won't be able to deal with our author abundance using the current name authority methods. There are too many new authors appearing for us to spend time calling around to determine birthdates. There are also too many new authors for those dates of birth to be useful as a way to distinguish between persons. To add to that, we really need a true identifier for authors.

Library catalogs attempt to maintain uniformity throughout, so the idea of treating contemporary authors differently from historical ones is a very disruptive concept. However, the notion is beginning to circulate that we could have contemporary authors identify themselves in some way. Something to the effect of: Yes, I am the same Michael Fitzgerald who authored that book on Art, and that's the identifier for me. After all, who better than the author knows his own identity?

That doesn't solve the problem that users have of identifying the author they seek from a long list of persons with essentially the same name. Perhaps the days of looking at lists of authors' names is over. Maybe users need to see a cloud of authors connected to topic areas in which they have published, or related to books titles or institutional affiliations. In this time of author abundance, names are not meaningful without some context.

Wednesday, September 19, 2007

Wish list: Pimp my hard drive

I can't believe that it's 2007 and I'm still staring at nested displays of little yellow folder icons, which then open up to show me, of all things, file names. Or I can get little thumbnails that tell me no more than I know from the file name extension.

Shouldn't we be beyond this? Here's what I want to see when I look at my hard drive:

Titles. Most of the documents on my hard drive have titles. Some of them even have those titles coded in some way as titles - such as the html files, and files in various word processing formats. I'm sure that it is possible to make an algorithmic guess at titles (or at least first lines) for just about any file with text.

Authors. OK, it won't really be authors, but if nothing else I should be able to distinguish files I created from those created by others. There are automated (and frequently erroneous) "owners" in files, but that ownership is often a good clue as to the provenance of the file. I want to see that. (Meanwhile, I'm going to start storing what I write in folders apart from what I have downloaded. No, I don't use "my documents." I hate that folder. I renamed it once and really screwed things up, so I have my own top level folder under c:.)

Snippets.
Is there any reason why I shouldn't be able to see snippets from my own files as I browse a folder? An opening line or beginning paragraph would be fine. I shouldn't have to manually open every file to see what's in it.

Most used.
The "recent documents" function in Windows is useless. OK, pretty much useless. I want to be able to see the files I have most frequently (but not necessarily recently) opened. I can't tell you how much time I've spent hunting for local copies of certain files.

Tags. I want to tag my files, and I want the tags to be available external to the files themselves, a kind of delicious for my hard drive. (And don't tell me to get Windows Vista -- that's not what I want, in more ways than one.) I want to see tag clouds and tag lists.

"Like." This may be pushing it, but I really want to see clusters of documents that are like each other. This will be the usual statistical reliance on the imprecision of language, but it would reveal connections in the documents that could be useful.

Folder names. I'm not sure I can explain this one but... I have a folder named "FRBR" and a folder named "MARC". When I want to look in one of those folders I don't want to have to go through the hierarchy of folders to find them -- especially because I never can remember where I've put them. Why can't I just type "MARC" and see the folder or folders with "MARC" in the name? Why do I always have to run through the whole hierarchy? (If you have found a way to do a folder name search only on Windows XP, please let me know.) Or maybe folder names could be treated as tags, once tagging is working.

There are undoubtedly many other things I could wish for, but basically what it comes down to is that there needs to be a better interface to the hard drive. Some of this can be found in google desktop, but I have found it unsatisfactory, generally.

Tuesday, September 04, 2007

Wish list: on-line in the stacks

This is my workspace. It's messy, I know, but the key thing is that my main "desktop" is on the screens. The physical workspace is primarily for setting things down, not for working.



Basically, everything happens on these screens -- I search, I read, I write, I converse (both text and voice). I can't imagine doing my work without the Internet. So I find myself in a dilemma when I go to the library, because I am cut off from my "place of work." I go into the stacks, perhaps with a scribbled note containing a call number, and I stand in front of shelves with fewer capabilities than I have in my own home office. If I don't find the book I want I can't check to see if I wrote the call number correctly; I can't look to see if there's a "second best" book that I'd like; I can't determine if there's another area of the stacks where I might find something else I'd like to read; and I can't search within the text of the bound volumes in front of me, even if digitized versions do happen to be available on-line. I stand there wishing I could go on-line.

Essentially, going into the library means leaving behind my ability to find. Yes, there are a few computers in the stacks, but they are too far away to make it possible to be usefully on-line and at the shelf at the same time.

Libraries made a great effort to get on-line and to reach out to users beyond their walls. What we haven't done, however, is to combine the on-shelf and on-line resources in a useful way. It makes sense to me that I should be able to stand amid bound journal volumes and do a keyword search. Or that I could pull a book off the shelf, see a citation, and check to see if the library has that item.

What would make this possible? First, many more access points within the physical stacks. Access to the catalog or other resources shouldn't be more than a few steps away. Heck, find a way to tie down one of those $100 computers at the end of each row, or create a place where a user can easily lean their laptop (and have the wireless access reach within the shelves). Instead of telling people to turn off their cell phones, remind them that if they have net access they can combine the power of the library's catalog, the library's on-line resources, and the items on the shelves. Encourage people to work with physical and digital resources together. If I could do that, I'd spend more time in the library.

Friday, August 24, 2007

Information architects v. librarians

I have discovered a key difference between information architects and librarians: information architects write books that you can find at a bookstore; librarians write books that you can only find (at best!) at the ALA store twice a year, assuming you attend ALA meetings.

There is meaning behind this statement beyond book distribution. It has to do with the insularity of the library world and our tendency to only speak to each other. It also reveals an underlying assumption that what we know and what we think isn't of interest to anyone outside of our profession. At least, I hope that's the reason, because another possibility, which would be even worse, would be that we don't think that anyone outside the library world is worth speaking to. That would be truly tragic.

Tuesday, August 14, 2007

MarcXchange

It had been announced a while back that folks from a Danish standards body were proposing an ISO standard for an XML version of ISO 2709, which is the ISO standard for what we think of as MARC. I couldn't figure out at the time why an ISO standard was needed since we have MARCXML. I found the draft of the ISO standard (ISO/DIS 25577) online, and learned some important things.

To begin with, I have never seen a copy of ISO 2709, even though the standard is referenced in just about every document that relates to the MARC format. In fact, you often see references to "Z39.2, also known as ISO 2709." Z39.2 is available from the NISO web site, and is the basis for what those of us in the U.S. think of as MARC. So I assumed that ISO 2709 was essentially the same as Z39.2. It turns out that there are some differences that are evidenced in this new standard. They may just be differences in terminology, but here's what shows up in ISO 25577:
  • the "Leader" is called "record label" in ISO 2709
  • the "control fields" (those beginning with "00") are called identifier field and reference fields in ISO 2709
  • what we call "variable" fields in Z39.2 are called "data fields" in ISO 2709
I agree that these may be minor differences, but now I have to go back and try to fix the wikipedia article on ISO 2709. And I have no idea if there are other differences that didn't show up in this particular standards document. I am really annoyed -- no, more than annoyed -- that ISO standards are not open. (And if anyone wants to violate copyright and license and send me a copy of 2709, I will not tell anyone it was you.)

OK, over that hump, the MarcXchange (ISO 25577) is an XML format for ISO 2709. MARCXML is an XML format for MARC21. The difference is the ISO 25577 is much broader than MARCXML. Tags can be anything from 001 to 999 and 00A to ZZZ. And you can have up to nine indicators on a field.

The significance? Well, since you are creating records in XML, certain limitations in the ISO 2709 format do not exist (like field lengths). And you don't have the limitations of MARC21, like limiting tags to 000-999 or having exactly two indicators on every variable field. In this schema, you could create an instance that has no indicators on some fields, and the fields that have indicators wouldn't need to have the same number of them. Think of all of those fields where both indicators have been used and you'd like to add another one. (I don't have the schema in a machine-readable format, but it looks like indicators are limited to one character. I'd love to see that changed so you could have multi-character indicators -- hey, why not?)

No, I'm not advocating that we drop MARC21 for MarcXchange, but could we at least brainstorm on whether MarcXchange could help us out in expanding our bibliographic record where it's needed? No, you couldn't round-trip it, but eventually we have to move forward and quit circling back. Would something like this help us out?

Thursday, August 09, 2007

Wish list: ONIX records

There are things that I wish existed, but don't, so I'm going to start posting my wishlist here, one piece at a time. Some of these things might not be possible for various reasons, and some may already exist but I'm just not aware of them (but I hope you'll clue me in). For those that could be done, let's talk about how we could make them happen.

The first one that I'm posting is a desire for a database of available ONIX records.

A few years ago I looked at some ONIX records that were being created for e-books and I have to say that they were so poor as to be almost unusable. Recently I've been reviewing some ONIX records received at the Internet Archive for the OpenLibrary project. There are only about a half dozen publishers represented there, but it's obvious to me that they are producing useful data. The basic bibliographic data is there, plus there is data that fits into the "book promotion" realm: blurbs, author bios, subject categories. This is data that is sent to online booksellers and to bookstores. It would be useful for libraries and for anyone else keeping data about books. But I don't know of anyone who is aggregating it, much less making it public.

What we need is:
  • a database that receives ONIX feeds
  • that keeps the records up to date
  • that has a z39.50 capability and an API for retrieving data
  • that can output in a couple of different common formats
It seems that this could be a great companion to CoverThing, a project proposed by LibraryThing creator Tim Spaulding (and perhaps in the works?) In any case, it's like there's a bunch of bibliographic data that is being created and then flushed down the drain. Let's find a way to save it and use it. (And I sure hope the publishers feel this way, too.)

Wednesday, August 01, 2007

Deceptive Copyright Notices

I have often pointed out some of the deceptive copyright notices that libraries and archives put on materials, such as the many statements on digitized public domain materials that tell users that they cannot make copies of the digital file without the permission of the holding library. (Yes, there is debate as to whether that constitutes a license and its agreement, but let's not go there for the moment.) I also have some wonderful examples of real-life copyright notices that are questionable at best, such as this notice which appears on the back cover of a... blank book:



Now an organization called the Computer and Communications Industry Association (CCIA) has filed a complaint with the FTC stating that NFL, NBC, DreamWorks, Harcourt, and others, are misrepresenting the rights of consumers through their copyright notices. They do have some delightfully egregious examples in their document, and the web site allows you to view the video-related ones, such as the NFL's statement that any "account of the game without permission is prohibited." Wonderfully, they posted those clips on YouTube. Included in the complaint are those "FBI" warnings at the beginning of DVDs. There actually are aficionados of the FBI warning screens and their variations over time (blue phase, green phase) as well as numerous parodies like this one.

At the meeting on copyright at the University of Maryland, Fred von Lohmann of the EFF (whose talk was outstanding, and sadly is not available online even though it was webcast) showed a video with a modified FBI warning that says:

WARNING. Federal law allows citizens to reproduce, distribute, or exhibit portions of copyright motion pictures, video tapes, or video discs under certain circumstances without authorization of the copyright holder. This infringement of copyright is called "Fair use" and is allowed for purposes of criticism, news reporting, teaching, and parody.

This perfectly conveys the message that seems to be sought by this complaint, which is to point out that the truth is very different from the messages that we see every day.

The complaint talks about the "chilling effect" of the false statements about copyright. I think there's also a numbing effect -- the ridiculousness of the claims means that we just ignore them all, and leads folks to see copyright itself as ridiculous.

The complaint's "Request for Relief" is mainly a call for the FTC to make the companies stop making these false and misleading statements about user rights. Like the "punishment" meted out to the tobacco companies, the complaint also calls for the offending companies to be required to engage in some honest consumer education about copyright. (Are pigs flying yet?) There's another relief requested that I probably shouldn't point out because I suspect it's the real payoff:

Order the Rights-holder Corporations to forebear from attempting to force consumers into waiving their rights through contractual instruments, including contracts of adhesion.
The FTC action would only be against the companies named in the complaint, but if it were to become common practice, libraries and archives would be among those who have to clean up their act when it comes to statements about user rights.

So who or what is this CCIA? The list of members includes Google, Oracle, Microsoft, Sun, Fujitsu, Intuit, and many others. I admit that confuses me -- these are not the organizations that I would expect to engage in a campaign of this nature. The web site claims that the organization has existed for three decades, and it appears to be primarily a lobbying organization for "policy and legislation" on Capitol Hill. That part makes sense, but I'm baffled by their campaign for public rights. If they are trustworthy in this endeavor, I would like to see them prevail. But there's that "if" that nags at me.

Friday, July 27, 2007

Worst title change

The ALCTS Serials Section, the folks who give the award for the Worst Serial Title Change each year, have just announced their own title change: they will now be known as the Continuing Resources Section. This does not further our profession's ability to communicate with the world around us.

Meanwhile, I was looking at the Bibliographic Ontology, work being done by some individual academics who wish to create a standard for the expression of academic citations. Their vocabulary is also notable: they have documents (article, book, patent...) and they have collections (journal, magazine, periodical). I've been in discussions before where people were declaring journals and magazines as separate document types and I've never gotten a definition that I found satisfactory, although there's no question that if you put Journal of Immunological Methods beside Vogue, no one would have trouble seeing them as different publication types.

Unfortunately the ontology defines a journal as "A collection of journal Articles," and a magazine as "A collection of magazine Articles." I have to say that's not very ontological of them.

Some of the suggestions made at the recent meeting on the Future of Bibliographic Control encourage us to get more bibliographic creation integrated into the authoring and publication workflows. I have long felt the need for a standard bibliographic model for citations which would make linking between citations and their cited documents easier, and one that could be used by common document creation software. The Bibliographic Ontology unfortunately internalizes too much nerdy academic practice, but at least it uses words that most academics might understand. It's patently clear that we librarians cannot go out into the world talking about "continuing resources" and hope to meet with any comprehension. This is just one small illustration of the gap that we need to cross before we can talk to anyone outside of our own secret cabal.

Friday, July 20, 2007

Copies, duplicates, identification

In at least three projects I'm working on now I am seeing problems with the conflict between managing copies (which libraries do) and managing content (which users want). Even before we go chasing after the FRBR concept of the work, we are already dealing with what FRBR-izers would call "different items of the same manifestation." Given that the items we tend to hold were mass produced, and thus there are many copies of them, it seems odd that we have never found a way to identify the published set that those items belong to.

"Ah," you say, "what about the ISBN?" The ISBN is a good manifestation identifier for things published after 1968 (not to mention some teddy bears and fancy chocolates), but it doesn't help us for anything earlier than that.

You probably aren't saying, "What about the BICI?" which was an admirable attempt to create a book identifier similar to the SICI (which covers serials, serials issues, and serials articles). The BICI never got beyond being a draft NISO standard, presumably because no one was interested in using it. The SICI is indeed a full NISO standard, but it seems to be falling out of use. Both of these were identifiers that could be derived either from the piece or from metadata, which is in itself not a bad idea. What was a less than good idea is that the BICI only could be derived for books that have ISBNs, but if you've got an ISBN you haven't a whole lot of use for a BICI, although it would allow you to identify individual chapters or sections of the book. But as a book identifier, it doesn't do much for us.

Now that we're moving into a time of digitization of books, I'm wondering if we can't at least find a way to identify the duplicate digital copies (of which there will be many as the various digitization projects go forward, madly grabbing books off of shelves and rushing them to scanners). Early books were identified using incipits, usually a few characters of beginning and ending text. Today's identifier would have to be more clever, but surely with the ability to run a computation on the digitized book there would be some way to derive an identifier that is accurate enough for the kind of operation where lives aren't usually at stake. There would be the need to connect the derived book identifier to the physical copies of the book, but I'm confident we can do that, even if over a bit of time.

Both Google and the Internet Archive are assigning unique identifiers to digitized books, but we have to presume that these are internal copy level identifiers, not manifestation-specific. The Archive seems to use some combination of the title and the author. Thus "Venice" by Mortimer Menpes is venicemenpes00menpiala while "Venice" by Berly De Zoete is venicedeselincou00dezoiala and "Venice" by Daniel Pidgeon is venicepidgeon00pidgiala. The zeroes in there lead me to believe that if they received another copy it would get identified as "01." Google produces an impenetrable identifier for the Mortimer Menpes book: id=4XsKAAAAIAAJ, which may or may not be derivable from the book itself. I suspect not. And we know that Google will have duplicates so we also know that each item will be identified, not each manifestation.

Meanwhile, there is a rumor circulating that the there is discussion taking place at Bowker, the ISBN agency, on the feasibility of assigning ISBNs to pre-1968 works, especially as they get digitized. I'm very interested in how (if?) we can attach such an identifier to the many copies of the books that already exist, and to their metadata. (This sounds like a job for WorldCat, doesn't it, since they have probably the biggest and most accurately de-duped database of manifestations.)

I know nothing more about it than that, but will pass along any info if I get it. And I'd love to hear from anyone who does know more.

Thursday, July 12, 2007

FoBC Meeting 3, Detailed Notes

Introduction
Speaker: Deanna Marcum Associate Librarian for Library Services Library of Congress

This is the 3rd public session. Comments can still be sent to the committee or via the web site until the end of July.

The question is turning out to cover more than bibliographic control. Instead the broader question is: what is librarianship about in the web world?

When MARC was introduced, libraries were concerned that using MARC would have implications for their own local cataloging, and weren't sure they wanted to use this standard for their own local cataloging. Conforming to the standard meant giving up local practice. But we have gotten many benefits.

In the web world, users have the opportunity to use their own language for searching, and they are being successful. So what contributions can users make, and what will make things more effective for our users?

The theme today is economics and organization. Many librarians believe that cataloging should not be an economic issue. In "this" world, it is not possible for us to ignore the economic implications of cataloging.

The Library of Congress provides cataloging as a service, and that helps other libraries economically. But Library of Congress has no budget line for that service.

Speaker: José-Marie Griffiths, Chair, Working Group, University of North Carolina at Chapel Hill

This is the third of three meetings, each with a different theme.
1. Who uses bibliographic data produced by libraries, and what are the needs of users?
The meeting showed that there is a wide variety of users and uses.
2. Standards and structures
One issue that came out at that meeting is whether the process serves the needs of the community.
3. Economics and Organization
One study that the speaker has conducted was to determine the actual costs of "free" services.

Speaker: Judith Nadler, Working Group Member, University of Chicago Library
Judy described the meetings as being about Who, What, and How. We are now at the How.

Setting the Stage
Rick Lugg, Partner, R2 Consulting

He used to always say that there is no such thing as a bibliographic emergency. However, in the past few years he has found himself working as bibliographic trauma specialist. As consultants, R2 gets called in to see things that aren't working. In the cataloging area, he has seen huge backlogs that are so well-established they have sophisticated inventory systems. With hard copy backlogs you can go into a storage room and see the huge amount of material there. In the digital world you can't see the backlogs. Broken links aren't visible. You don't know what isn't getting done. We don't have a measure of how far behind we are in the digital world.

He said that the cost of bibliographic control is disproportionate to benefits.[kc: It would be great to have a way to measure that, or at least to measure what parts of the bibliographic record produce the greatest benefits.]

The MARC record for a basic monograph is a commodity. It is estimated that the creation of the MARC record is $150-$200. The book is cataloged once, and the cataloging is used many times. Libraries have contained costs by using different levels of staff for copy cataloging. But there are still a lot of duplicative costs in the system.

We have a cult of perfection with the following beliefs:
1 – bibliographic perfection is attainable
2 – cataloging is still about the arrangement of print books on the shelf

Bibliographic Perfection
One of the main barriers to cost savings is the desire to create the perfect record: people change bibliographic records, or at least check all of the details. They change call numbers and use custom Cuttering schemes. Many still write the call number in pencil on the verso of the title page. Some check the reported size with rulers. We focus on the record itself rather than what record is for.
We have a narrow view of quality – we see quality as being about the record, but not about timeliness. (Thus, the backlogs.)
What is good enough? The question should be: does this error impede access?
We need to take advantage work on elsewhere in supply chain.

Shelf arrangement still influences cataloging, but many items are in storage where shelf order doesn't matter. We still create unique call numbers, but duplicate call numbers don't prevent access. We need to think about browsing online, not just on the shelf.

We also need to consider the total cost of bibliographic control. There are the initial costs, but we also need to consider full lifecycle cost. Records are changed at various points, for example as we move items offsite, or move a book out of reference. Most of these changes are done manually. In serials, as we move from print to electronic and end or modify print subscriptions, records have to be updated. Much of this is inventory control, but still means record changes.

There are opportunity costs: What are we not doing that we should be doing? Answer: special collections cataloging, cataloging unique materials, and rare books, manuscripts and archives.

Another opportunity cost: we have no capacity for non-MARC metadata – no one has time to learn MODS, METS, DC. Cost in delay in moving in new directions.
We are involved in mass digitization, but we haven't started working on discovery of full text.
Catalogers are not involved in systems development early on, which affects how systems are developed.
How can we collaborate with others (not just other libraries) to create a richer bibliographic record?

Q: I asked: To what extent is complexity of MARC an issue? His answer was rather vague, so I think he hadn't really thought about this in detail. It would be interesting to know how much time is spent on things like fixed fields, or figuring out subfielding. It would also be interesting to do more experimentation with interfaces. Later speakers brought up the idea of using systems better to help catalogers work faster.

Speaker: Lizanne Payne - Library Consortium
Executive Director Washington Research Library Consortium

Lizanne Payne talked out how consortia can affect costs. Their main role is often providing joint licensing of digital materials, but they are also involved in ERMs and ILL workflow. The usually share a common OPAC to facilitate borrowing, and sometimes have a common ILS. This latter allows them to share the cost of IT staff for systems by centralizing systems. If don't share an ILS, then you have duplication between local catalogs and union catalog. You need 3 levels of bibliographic control: 1 – master record 2 – individual library records (eg for special subject control) 3 – holdings, shelving, etc.

Where libraries share a storage facility, searching for duplicates before sending to storage is very expensive.

[This talk brought up some interesting thoughts about duplication – of materials and of catalog records. Duplication keeps coming up for me in various projects I am working on, and it seems to have cost implications at a lot of levels, especially those areas where duplication in the user view is not desirable, but duplication that exists in the real world also serves users where access is concerned.]

I also learned from Payne's talk that MFHD is pronounced "muffhead."

Speaker: Mary Catherine Little - Public Library
Director, Technical Services Department Queens Borough Public Library

Little gave some good arguments for matching your cataloging to your actual need. She manages a huge and active public library with 65 different languages represented in the collection. She doesn't have the ability to produce cataloging in all of those languages so she relies on vendor-supplied copy and doesn't augment it. Her bottom line is to know what the library owns and give users access to it. She asks herself: am I creating data I'm not likely to use? Am I creating enough data for the ILS to function today? Tomorrow?

And, would this item be replaced if lost? (Many of her books are popular reading that are used for a few years then discarded when the item is worn out.) She even has some un-cataloged collections that are accessed at the shelf only. But fewer users today are in the library. [Note: there were various mentions that digital materials require more and better metadata, but no one really connected this to that fact that our collections are increasingly digital.]
She called for more sharing of vendor data – which of course means a change on the part of vendors.

Speaker: Susan Fifer Canby - Special Library
Vice-President, Library and Information Services National Geographic Society

The Special Library case was quite different from either public or academic libraries.
Some special libraries hold proprietary data that cannot be shared. They are focused on service to their organizations and often have considerable collections of archival and organizational records. They may have responsibility for all or part of the organization's web site. They may also use their collection for e-commerce, as is the case with the National Geographic Society's photo archives.

On the other hand, an organization can require that internal data providers attach certain metadata (like subject headings) to items they store.

The special library is not seen as a general good by the organization. It is a cost center, therefore has to produce value. Bibliographic control is not a major activity for them.

Questions and comments
Q: There seems to be a distinction being made between bibliographic control v. inventory control
Lugg: That starts way back in the chain. For vendors it's about inventory and sales. In systems, the overhead of using MARC as a transaction vehicle is too much, so the transaction areas of systems tend to keep less data and match it up to MARC when needed. However, libraries often see transaction data as part of MARC record (because they display together.). There are different needs within the system, and the MARC record shouldn't change when items circulate.
Q: The committee has done some thinking about atomizing MARC record, removing some complexity and creating different structures for the different functions
Payne: MARC was designed for transmittal, not for daily use. And there's no standardization for how it is broken apart and used in our systems, which makes system upgrades difficult. There are lots of areas of our systems that we haven't standardized.
Lugg: This really shows up in the holdings area. Libraries make different choices as to how that is structured and stored and displayed. Some of this is showing up as libraries try to go to Worldcat local.
Lorcan Dempsey (OCLC): The problem is not MARC, but the fact that we want to do more sharing, so all of these local options are showing up more as problems. It isn't the technology but the social way that we decide what goes into records (often designed for a single application but now want to reuse it for a different application.) Think of data as something that applications use rather than people.
Q: There are greater expectations for the sophistication of access. How much of that is part of shared bibliographic control and how much is local?
Little: Social tagging can represent the cultural aspects of language – the social spin on things.

The Stakeholders' Perspective
Speaker: Bob Nardini - The Vendor
Group Director, Client Integration and Head Bibliographer, Coutts Information Services

It is good that vendors are included in discussions of bibliographic control. Vendors produce a lot of bibliographic information. Coutts employs catalogers and is providing 280,000 bibliographic records this year. Other vendors are even larger. 63% of libraries obtain records from book vendors (based on a survey).

He spoke of the CIP program as one where vendors contribute data. Publishers produce metadata for their audience, for example publishers are very aware of the metadata needs of Amazon, since that translates to sales. He said that he would like to see more of a use of the metadata record in a marketing role. (I'm not sure what that means for libraries.)

Speaker: Mechael Charbonneau - PCC and Large Research Library
Director of Technical Services and Head, Cataloging Division Indiana University, Bloomington

Cataloging is seen as high cost activity, thus the Program for Cooperative Cataloging is a way to save labor. PCC is an international coalition coordinated by the Library of Congress, and a major stakeholder in the bibliographic data future. It relies on voluntary cooperation between libraries. Today, about 35-45% of shared records are being produced outside of the Library of Congress.
She mentioned a need to include non-MARC metadata (but didn't say which ones). She also talked about the need to internationalize authority files, and mentioned the Virtual International Authority File project at OCLC.

Speaker: Linda Beebe - Abstracting and Indexing Services
Senior Director of PsycINFO, American Psychological Association

A&I services create metadata for discovery. There is little emphasis on description in the library cataloging sense. There are particular needs in the different subject areas.

She suggested that we need to look at the "meeting points" of linked systems to see if there is a way we can simplify workflow. [She didn't give any detail, but I have thought that we need to define what our linking elements will be so that we can concentrate on those, and maybe skip non-linking data in some instances.]

One of the problems they are running into is the increase in supplemental audio visual files that need to be linked to the print resource.

She talked about the difference between customers and librarians. Librarians like controlled vocabulary, but users simply want to search on the terms they know. This means that systems need to handle lots of synonyms. We have to discard the notion that it takes special knowledge to find things in the literature. This isn't dumbing down, but making our systems work harder.

Questions and Comments
Q: In the past, vendors have been reluctant to allow their records to be merged with other vendor records because they lost branding. Is this still an issue?
Beebe: This is becoming less of an issue.
Q: What about the different treatments of author names?
Beebe: Searching for author is the most complicated thing. There are author profiles that some are putting together to help this. Social tagging might also help here.
Todd Carpenter (NISO): There is an ISO group is working on an international standard name identifier. This is being driven by the publishing community because of their interest in tracking royalties.
A: Crossref is working at author identifiers (also looking at institutional identifiers)
Q: Vendors don't use LCSH. Vendors put in more marketing tags and readership levels, plus formats (e.g. textbooks). Maybe this is something that Library of Congress should stop putting in records, but should take from the vendor records.

Speaker: Karen Calhoun - OCLC
Vice-President, OCLC WorldCat and Metadata Services

Response to the Background Paper

She was speaking from the view of OCLC as a stakeholder.
There are 7 economic challenges: productivity, redundancy, value, scale, budgets, demography, collaboration
1 - productivity
Fred Kilgore created a dramatic enhancement in productivity of cataloging
2. redundancy
OCLC shared cataloging removed duplication of effort; the Internet and web make possible other efficiencies
3.value
We talk about quality, but we all mean different things depending on our point of view. To the bibliographic control expert it means: adherence to rules. To the library decision-maker, quality has to do with stewardship of library funds and budgets, producing value for communities.
4. scale
Users look at and beyond individual library collections when seeking answers to questions. We must not narrow our scope to what we have done in the past.
5. budgets
Budget restrictions not surprising – especially as libraries move into new areas but have the same budgets.
6. demography
The famed "retirement wave" for generation of bibliographic experts begins in 2010. We will have to change hiring practices.
7. collaboration
These challenges won't be met by libraries working alone.

She then outlined some future potential for OCLC to respond to these challenges.
Metadata is like money – it is a medium of exchange; it points to the value of things.
OCLC might build grid services along the supply chain for creation and augmentation of metadata. The publication supply chain could be an interdependent flow of reusable metadata on the grid.

Where does metadata come from? From bibliographic control experts; publishers, authors, reviewers, readers, selectors. Where could metadata come from? Worldcat is a large unexplored resource, as evidenced by its terminology services and Worldcat Identities. OCLC could run a contract cataloging service. OCLC might help libraries by incrementally moving selected technical services functions to the network. E.g. build on the ILL fee management service into the acquisitions area, creating a kind of Pay Pal for libraries. This could make libraries less dependent on local systems.

Speaker: Beacher Wiggins - Library of Congress
Director, Acquisitions and Bibliographic Access Library of Congress

The Library of Congress has explored the use of bibliographic data from number of sources. PCC is the largest and most successful of these operations. They have increased their cataloging output at the same time that their staffing in the cataloging area has been cut. In the current congressional climate, they must do more without any increased funding for staff.

He mentioned the precipitating event of the Library of Congress dropping authority control for series entries as an example of how they have to cut back. They are re-organizing their cataloging staff such that technicians will do all descriptive cataloging and librarians will do authorities and subject analysis. They are shifting costs, not reducing costs.

He also mentioned the problem of not being able to share vendor data. Apparently there was a rather nasty incident between Library of Congress and Casalini Libri over the reuse of Casalini's supplied bibliographic records. (Something that no one talked about was the systems issues: identifying which records you cannot share. That itself must have some overhead.)

Questions and Comments
Q: There are many small libraries that cannot afford to be in OCLC. How can they be included if OCLC expands its services?
Calhoun: We're looking into that.
Q: What is the cost of leadership, such as standards maintenance?
Q: What is being done to increase training/continuing education?
Wiggins: ALCTS, Library of Congress and ALA are organizing continuing education in this area.


PUBLIC TESTIMONY
Speaker: Diane McCutcheon for NLM
Ideas on how to improve cost-effectiveness

NLM does both cataloging and A&I indexing.

She agrees that cataloging is a public good, but that service has costs. Institution has a particular obligation to create cataloging in cost-effective way. How? Fully utilize descriptive metadata that is available electronically, mainly from publishers and vendors. Basic descriptive data. Eliminate rote keying tasks. NLM uses metadata from journal publishers rather than re-keying – have realized cost savings. Publishers supply date in a standard format because they want to be in Medline. Need to convince publishers that it is to their advantage to be cited in catalogs.
Getting metadata earlier in the chain. Can't use MARC – need to use an xml format (ONIX) – but library systems can't handle non-MARC data. Use crosswalks instead. There is a need for those crosswalks to be available to others.

Making more use of automated data. Current cataloging is like hand crafting furniture or clothing. Need to move into mass production. Some materials may not have electronic data, but we should take advantage for those that do. Need to make more use of more machine assistance. Catalogers are often working in subject areas where they aren't expert, so machines can help with subject heading and classification assignment. They've been working with an automated system that suggests MeSH terms.

New economic model : libraries create data, then share for the cost of sharing it via OCLC. Libraries and vendors have little incentive to do original cataloging.

Need faster standards development. Can't take 2-5 years.

Speaker: Chris Cole for the National Agricultural Library

NAL also does both library and A&I publisher. Indexing uses basic metadata supplied by the publisher. This saves a considerable amount of cost. No suffering of quality. Use of publisher data both possible and necessary. Metadata should be created from data supplied by publishers, with libraries adding value.

NAL contributes to CIP on agriculture related titles. Many libraries use the CIP record because they aren't connected to the network.

Current process isn't economically feasible. We can also get data from music and sound recording industry. If we can move from transcription to adding values, we can tap those resources. This is especially true for digital files, which cannot be discovered without metadata.

Focus of RDA is on traditional materials and traditional procedures, unfortunately. RDA is not recommending an abandonment of standards but a transformation. Do not focus on the record but on clear set of data elements that can be used by libraries, vendors and others that can be reassembled as needed for different uses.

Lorcan Dempsey: the majority of records in Worldcat are not from Library of Congress, but the majority of holdings are on Library of Congress-produced records.

Q: cost and value of creation of thesauri and classification
NAL: we have a thesaurus, and have found that others want to use it and offer to help (in the sciences)
NLM: authors should identify themselves; publishers aren't in our discussions about and they need to be here.

We put too little value on our work. It costs $130 to catalog book but we sell it for 6 cents. What do we offer to people to make it worth their while to contribute?

Regina Reynolds (LC): one economic model is bartering. Could we barter our data in trade for expertise?

Dan Chudnov: [after some in audience rejected the idea of "non-expert" social tagging] The user, in social tagging becomes an access point. Hard to reconcile with privacy, but somehow we have to do that. LibraryThing has social tagging around Library of Congress data. Also, we need an involvement of technology folks in the discussion about bibliographic control.

Speaker from U Penn on subject analysis: issue is: how to make it more efficient. Not the creation of the string, but the aboutness of the work. There is no way to contribute actual subject headings (in cooperative cataloging) in the same way as name authorities files. Social tagging: "expert" tagging defeats the purpose. There's a value in letting users decide; people tag for various reasons, have points of view.

Wiggins: We are looking at the pre-coordination of Library of Congress subject headings. Will issue a report looking at simplification.

Speaker from Folger Shakespeare library: How do we know when we have accomplished our goals? What are our evaluation mechanisms?

Speaker from Library of Congress, education office: Most speakers seem to say that a librarian manages, a user finds. The problem is that we don't use our own products.

Library of Congress staff member: was viewing the meeting on the webcast and came up to say something. We keep talking about incrementally change how we process bibliographic records so we can create more of them. Library of Congress and OCLC are metadata repositories. We should think more radically about what kind of metadata repository we want and need. Create a repository for all of the ONIX data that publishers are creating that allows a way to use that data. Let libraries download the information they need. Do this rather than item-by-item work flow.

Karen Calhoun: OCLC is exploring a way to use ONIX data, enrich it and send it back to publishers, and then create MARC records, and let users add enrichment.

WRAP-UP
Summary of the Day
Speaker: Robert Wolven, Working Group Member, Columbia University Library

Themes of the day:
  • We have focused on today and the near-term future.
  • We've thought mainly about trade monographs and efficiencies there.
  • But opportunity costs are about less standard areas of collection. Economics there are more local and individual; less opportunity with collaboration. Are we looking at economic shift to areas where we won't have large scale economies?
Disconnects:
  • We look carefully at individual records we are creating, then we go off and load hundreds of thousands of records in sets.
  • Different approach to names authorities in cataloging and A&I databases.
  • We need to think about lifecycle of resource. We tend to think about initial process, not later changes. Some life-cycles are short, like public reading, others longer term, like making decisions about off-site storage.
  • The MARC record is a commodity; we need appropriate distributions of costs. How to compensate vendors for cataloging; vs "free riders" in the chain. Do we recoup the costs from those who benefit? Or do some bear the costs?
  • We don't want to pay for metadata we get but talk about getting value for our metadata. This implies retaining control.
  • We propose that value is tied to its use, yet a lot of our effort goes into metadata that isn't used much. Do we focus on area where we have the most sharing, or on the long tail? With long tail less ability to share costs.
Education as an economic factor: how we education and re-educate staff. But we also expect our users to learn. Education as a barrier.

Digital backlogs – we don't have ways to understand what they are and how they are treated. We don't have measures of this.

Final Thoughts
Speaker: Deanna Marcum

What will they say a hundred years from now talking about the choices we had in 2007? The choices we make at the Library of Congress will make a difference. Library of Congress has focused on cataloging those materials that will be most used by other institutions. Of 130 million items at Library of Congress, only 30 million have records in the catalog. Many are set up as mediated collections. Many are unique or rare materials. Not sharable like books and journals. Users now expect to get access to these materials.

Library of Congress is going to identify performance measures that are quantitative (as much as possible). We have to report back to Congress on benefits and who has benefited. This is much more detailed of a report than Library of Congress has ever had to do before.

What do we all have in common? That we are the institutions in which society has placed its trust that we will figure out: what should be saved, how will it be saved, how will we make it available over time.