Monday, April 26, 2010

Social aspects of subject headings

You've probably played the "my favorite subject heading" game when geeking out with librarian friends. Here's some additional fuel in case you've run out of zingers.

The Open Library takes the LC subject headings and breaks them apart at the subfield level into subjects, persons, places, genres, and times. It also includes some BISAC headings retrieved from Amazon, so the subject list is not "pure." The separate subject entries obtained are similar to, but not the same as, OCLC's FAST headings, and look much like some facets that appear in library catalogs.

The Open Library database currently holds about 24 million records for books (at least partially de-duped). In a recent dump of subjects, the total number of different subjects came out as 1,278,539. Of those, 336,638 were of the "topical" variety, that is either a 650 $a or a 65X $x. The top 25 are as follows:

825168 History
322928 Biography
212822 Politics and government
206519 Congresses
192968 History and criticism
184183 Fiction
123838 Law and legislation
119333 Bibliography
95555 Juvenile literature
93364 Description and travel
90866 Economic conditions
84787 Criticism and interpretation
74878 Claims
71468 Social life and customs
70926 Social conditions
70563 Catalogs
69205 Private Bills
69191 Private bills
66480 Education
63410 Exhibitions
63301 World War, 1939-1945
60235 Foreign relations
60068 Philosophy
56219 Dictionaries
55460 Study and teaching

I find it interesting that with the exception of "World War, 1939-1945" these appear to have the function of qualifiers, and I'm thinking that it would be interesting to contrast the $a and $x terms. My guess is that these are $x, but that not all $x are of this nature.

Of the subfields, 164,342 appear only once in the database. These are a great source of interesting an unusual headings, including "Social aspects of adzes" and "Deer as pets." In fact, the "Social aspects...." tail is so amusing that I have made a file of those with a count of 1.

The full file of topical subjects is 8 megabytes, but can probably yield innumerable hours of library cocktail hour amusement. (text in format "count - tab - subject") I will also look into names, organizations, places and times as subjects.

Friday, April 09, 2010

OCLC record use policy

OCLC has issued a new draft of its record use policy for member comment. As others have remarked, while better worded and seemingly less draconian than the previous policy (the one that was withdrawn) the substance has not changed one iota. There are many things wrong with the policy itself, but the primary problem with it is not the text of the policy but the way that OCLC has chosen to define the problem it is trying to solve. Here are some of the issues I have with the approach:

1. Pushing the river
The central issue is that OCLC wants to limit downstream use of bibliographic data that is stored in WorldCat. This simply cannot be done. The same data is also stored in individual library catalogs, some union or consortial catalogs, and in bibliographic software used by many hundreds of thousands of researchers around the world. It also often closely resembles data created outside of OCLC's sphere, such as through publisher and retailer channels. Sharing of this data is absolutely necessary for the furtherance of intellectual pursuits and scientific progress, as well as the market for new and used items. Ironically, the policy would restrict use of the data by OCLC members without restricting its use by the multitude of non-members. It would be unacceptable even if it were workable, which it isn't.

2. One-sided
The policy has a section on member rights and responsibilities, but no such section on OCLC's rights and responsibilities. (Nope, I was wrong about that. The section does exist, I must have missed it.) The policy carries the assumption that, if anything, members are the problem, OCLC the solution, and gives no sense of the policy being the result of an agreement between the parties. OCLC can make unilateral decisions about record use, such as its agreement with Google, but members must ask permission of OCLC for many uses. There is nothing here that acknowledges that there could be a situation where the interests of a library and the interests of OCLC are in conflict, nor how that would be resolved. All-in-all, it reads as if the purpose of membership were to sustain OCLC (instead of the purpose of OCLC being to support libraries).

3. Transparency
OCLC, or one of OCLC's governing groups, will make decisions. Yet there are no criteria given for making these decisions, no timelines, no reporting back to members, no mechanism for feedback. Will members know how "their" WorldCat records are being used? Will they have any choice in the matter? Will there be a way to know what requests for use have come in to OCLC, which ones have been accepted, which turned down? If WorldCat is such a "community good" shouldn't the community at least have this information about the use of that good?

4. No options
In most agreements there is some give and take. If you do X, you will get Y. The OCLC record use policy does not give members options. An example of an option would be: if you do your cataloging on OCLC, ILL will cost you $X; if you do not do your cataloging on OCLC, uploading your records will cost you $Y and ILL will cost you $Z. With clear options, libraries can decide what is best for them in their particular situation. Without clear options libraries have no way to make rational decisions about their participation in OCLC. It's not a religion, it's a business relationship, and it should be treated like one.

5. Avoids facing the problem
The problem that OCLC is trying to fix arises, as far as I can tell, because of OCLC's particular mix of costs and expenses. Most of the revenue comes in to OCLC from its cataloging service, so having members choose to catalog elsewhere is the problem. Exhorting members to keep their records in their databases so that others cannot create a large database of bibliographic data is not a solution to this problem. Large bibliographic databases do and will exist. If their existence is a threat to OCLC, then the jig is already up. Rather than stew about what others are doing with bibliographic data, OCLC needs to find a balance of income and revenue that meets the needs of its member libraries, and that might include making some hard decisions about OCLC services.

6. Ignores market forces
If someone can do it better, cheaper, more conveniently, why should libraries stick with OCLC as their vendor? For the purchase of materials or library systems or other services, libraries move to new vendors when they see advantages. With the economic downturn there is a scramble by libraries to cut costs wherever they can. No amount of loyalty to the "collective" can overcome the economic situation libraries find themselves in today. In a sense, OCLC seems to expect the libraries to act irrationally by sticking with the service even if something more economical comes along. Libraries obviously cannot afford to do this.

I cannot tell what steps OCLC's members can take at this point. The web site points to a community forum where people can post comments, but posting comments on the policy doesn't begin to solve the underlying problems as presented here. If I were a member, I think I would feel like a row boat hitching a ride behind the Titanic, hoping it will get me through the ice floes. Nothing is unsinkable, as we have unfortunately found out in the past.

Wednesday, April 07, 2010

After MARC

The report on the Future of Bibliographic Control made it clear that the members of that committee felt that it was time to move beyond MARC:
"The existing Z39.2/MARC “stack” is not an appropriate starting place for a new bibliographic data carrier because of the limitations placed upon it by the formats of the past." p. 24

The recent report from the RLG/OCLC group Implications of MARC Tag Usage on Library Metadata Practices comes to a similar conclusion:
"5. MARC itself is arguably too ambiguous and insufficiently structured to facilitate machine processing and manipulation." p.27

We seem to be reaching a point of consensus in our profession that it is time to move beyond MARC. When faced with that possibility, many librarians will wonder if we have the technical chops to make this transition. I don't have that worry; I am confident that we do. What worries me, however, is the complete lack of leadership for this essential endeavor.

Where could/should this leadership come from? Library of Congress, the maintenance agency for the current format, and OCLC, the major provider of records to libraries, both have a very strong interest in not facilitating (and perhaps even in preventing) a disruptive change. So far, neither has shown any interest in letting go of MARC. The American Library Association has just invested a large sum of money in the development of a new cataloging code. It has neither the funds nor the technical expertise to take the logical next step and help create the carrier for that data. Yet, a code without a carrier is virtually useless in today's computer-driven networked world. NISO, the official standards body for everything "information" is in the same situation as ALA: it cannot fund a large effort, and it has no technical staff to guide such a project.

It seems ironic that there have been projects funded recently to develop library-related software based on MARC even though we consider this format to be overdue for replacement. The one effort I'm aware of to obtain funding for the development of a new carrier was rejected on the grounds that it wasn't technically interesting. In fact, the technology of such an effort isn't all that interesting; the effort requires the creation of a social structure that will nurture and maintain our shared data standard (or standards, as the case may be). It requires an ongoing commitment, broad participation, and stability. Above all, however, it requires vision and leadership. Those are the qualities that are hard to come by.