Sunday, September 30, 2007

Glut? Gunk!

You've probably had the experience of participating in some activity that was later covered by print or TV news. In many cases, the report of the event is so wrong, so different to what you experienced, that you could hardly recognize it as being the same event. Similarly, when reporters write about something you know intimately, the reports are almost always aggravatingly wrong.

The same is true about books, of course. I thoroughly enjoyed Bill Bryson's A Short History of Nearly Everything, which drove real scientists nuts for everything it got wrong. Now I'm going out of my mind reading Alex Wright's Glut, which I can only describe as poorly researched, and in some cases just outright wrong.

I became suspicious when I read on page 21
"It is no coincidence that snakes have been a leading cause of human mortality throughout our species' history, so it should come as no surprise that the occurrence of serpent imagery tracks closely to the prevalence of poisonous snakes in particular regions."
I don't doubt that snakes are scary creatures and they sure do seem to show up in all kinds of ancient imagery and tales, but "a leading cause of human mortality"? I don't think so. Famine, pestilence, war -- those are leading causes of human mortality. Snakes? A drop in the bucket.

OK, we all can slip up when we get going at the keyboard, and I figured that his editors just hadn't paid attention. Then I got to page 79 where he says:
"... a new form of document: the codex book, so named because it originated from attempts to 'codify' the Roman law in a format that supported easier information retrieval."
Codex comes from "codify"? Were the Romans speaking English? And besides, I'd recently read a few books on book history myself and those all referred to that origin as being from the Latin term "caudex" referring to wood used as the first book covers. The use of "code" for groups of laws came from the term "codex," not vice versa. I began to wonder where he would have gotten such a definition, and on a hunch decided to look at the Wikipedia entry on Codex. There had been some confusion between codex and code in an early Wikipedia version of the codex page, and it was removed:
"Mistaking Codex for Code

I moved this mis-stated misunderstanding here: "A legal text or code of conduct is sometimes called a codex (for example, the Justinian Codex), since laws were recorded in large codices." This is simply an error, one that doesn't come into educated or official discourse. --Wetman 20:14, 9 May 2006 (UTC)

I have no idea if that is where Wright got his information, but this statement makes the same mistake that Wright does.

I have kept reading, I guess because I wanted to get to his treatment of more modern times. I've gotten as far as Panizzi, but had to get all of this out of my system before going on. On page 167, Wright quotes a biographer, one Louis Fagan, on Panizzi's appearance. I looked at the citation for the quote and found:
"3. Louis Fagan, quoted in Teresa Negrucci, 'Historiography of Antonio Panizzi,' 2001,"
I looked up the paper online, and Ms. Negrucci was a student in the UCLA library school at the time of writing this paper, done for IS 281 "Historical Methodology for Library and Information Science." (The citation above is no longer valid. You can find it linked from this page of student writings.) A perfectly fine school paper, but probably not an authoritative source. Plus, I was taught that you only took quotes from someone else if the original is terribly hard to get to. Fagan's book is available in at least 80 US libraries, according to WorldCat, although today I was able to get to it online. Now, I admit that the book may not have been available via Google Book Search when Wright was composing his work, but by no means is the original inaccessible. In fact, if he had looked at the original, rather than the student paper, he would have understood that Fagan was quoting someone else in his description of Panizzi, not making the statement himself, as Wright states.

It's not an important point nor a particularly important passage, but it is sloppy scholarship. It means he took his information from someone else and did not verify the original source. In fact, of the about 260 citations in the book (and I'm counting all of the "ibid's" in this) a full 52 are "quoted in" or "cited by," and mainly the former. The entire first half of the book, which is on ancient and medieval history, uses modern sources almost exclusively. One chapter, on memory, cites only six discrete works, and takes quotes of Thomas Aquinas, Giulio Camillo, John Willis, John Wilkins, and Francis Bacon second-hand from books published mainly in the 1990's. In that chapter, only one "ancient" quote is from an original source. One of the citations referring to Wilkins is to a BBC web site page. It's no longer available. I might be just being mean, but I can find the BBC page cited on the Wikipedia entry for John Wilkins in the Wikipedia version prior to the date of Wright's citation, although it has since been removed. I don't at all mind people using Wikipedia for its basic purpose: to give one a clue and lead one on to sources. And of course we all jump on to the nearest bit of information on the web. But when researching a well-known historical figure, it really is important to cite a good, permanent resource, and in terms of Wilkin, other resources should be available.

As for Panizzi, Wright talks about his creation of a schedule of tiered subject headings. On page 168 he has a quote from Elaine Svenonius that implies some criticism of Panizzi's work.
"Some would argue [the subject headings] were too ambitious -- that there was no need to construct elaborate Victorian edifices since jerrybuilt systems could meet the needs of most users most of the time."
The bracketed words "the subject headings" was added by Wright. In fact, Svenonius was not referring to Panizzi's headings. The quoted passage is about "systems produced during the second half of the nineteenth century," ("Victorian" should be a hint) which would be after Panizzi, whose primary work was done earlier in that century. And the full quote, with no reference to subject headings, is:
"The systems produced during the second half of the nineteenth century, a period regarded as a golden age of organizational activity, [cites Cutter 1904] were ambitious, full-featured systems designed to meet the needs of the most demanding users. Some would argue that they were too ambitious -- that there was no need to construct elaborate Victorian edifices since jerrybuilt systems could meet the needs of most users most of the time. [cites Coffman]" Svenonius, p. 3
The Cutter reference is to his 4th edition of Rules for a Dictionary Catalog. The sentence quoted by Wright is a reference to American Libraries article by Steve Coffman called "What If You Ran Your Library Like a Bookstore?".

Must I go on? I was able to check this one reference carefully because I happened to have the Svenonius book on my own bookshelf. I have no reason to believe that the rest of his text is any more accurate or faithful to the sources he cites. I suppose the one consolation is that in spite of his MLS from Simmons, Alex Wright calls himself an Information Architect, eschewing the "L" word. I wouldn't want people to think that librarians don't know how to do research.

Saturday, September 29, 2007

Name authority control, aka name identification

Libraries do something they call "name authority control". For most people in IT, this would be called "assigning unique identifiers to names." Identifying authors is considered one of the essential aspects of library cataloging, and it isn't done in any other bibliographic environment, as far as I know. When a user goes to a library catalog, they will find all of the works of T.C. Boyle under a single name, even though he has variously used T.C. Boyle and T. Coraghessan Boyle on his books, and was born with the name Thomas John Boyle. Authority control puts all of his works under one name, with references from other forms of his name: TC Boyle, see: T. Coraghessan Boyle. When there are two authors with the same name, one of them (the second one to be added to the authority file, generally) is distinguished using a middle initial or the year of birth. Thus you can have
Smith, John
Smith, John 1709
Smith, John 1936
Smith, John A.

There are some problems with the current method used by libraries to realize authority control, not the least of which is that it is a difficult and expensive process and the number of authors is growing rapidly as we all become creators in this information age. I want to address here 3 aspects of name authority control that are especially non-functional: 1) the use of dates as distinguishing characteristics is not easy for the catalogers creating the authority record 2) the use of dates as distinguishing characteristics does not help the users 3) the name heading is not a legitimate identifier because it may change.

Date of Birth is Hard for Catalogers

We hear that authority control, including name authority control, is responsible for upwards to 40-50% of the time it takes to catalog a book. Part of this is in determining if you do indeed have a new author to enter into the system. Another part is in creating the unique entry. Take the case of Michael Fitzergerald, editor of a book called Touching All Bases. Touching all Bases is a collection of columns by sports writer Ray Fitzgerald. His sons, Michael and Kevin gathered the columns after their father's death in 1982 and published them. Because there have been other Michael Fitzgerald's as authors, the year of his birth had to be added to his name. Here's the authority record for Michael:

LC Control Number: n  83124260
Cancel/Invalid LCCN: n 97055382 no 90013838
HEADING: Fitzgerald, Michael, 1955-
Found In: Fitzgerald, R. Touching all bases, c1983 (a.e.) CIP t.p.
(Michael Fitzgerald)
Call to publisher, 6/27/83 (Accountant, b. 2/22/1955)

Michael Fitzgeralds seem to be in great abundance. There was even another one who wrote a book and was also born in 1955. To distinguish between them, Michael Fitzgerald 1955 #2 has his full date of birth added to his name

LC Control Number: n 2003097483
LC Class Number: PS3556.I8345
HEADING: Fitzgerald, Michael, 1955 June 11-
Found In: The Creative circle, 1989: t.p. (Michael Fitzgerald) p. 241
(teaches at Shenandoah College in Virginia)
Earth circle, c2003: CIP t.p. (Michael Fitzgerald) data
sheet (b. 06-11-55)
His book, Creative Circle, is about art, music and literature from a Baha'I perspective. We see that at the time the authority record was created he was teaching at Shenandoah College.

So here we have two authors whose works would never be mistaken for each other, yet who have the same name. The authority records are evidence of why it is so time consuming to create these identifiers. Because the date of birth is generally not one of the pieces of information about an author that is included in the book nor in the promotional material provided by publishers, the librarians establishing the name heading often must resort to contacting the publisher or the author or the author's institution to determine that information.

Date of Birth May Not Help Users

In a time when few people wrote books, and when users may have come to the library with some knowledge of the famous intellectual whose works they were seeking, the distinction between two John Smiths, one born in 1709 and one born in 1936, may have been an obvious one. We are now, however, in a time of author abundance. Anyone can, and many do, write books, and many of those writing are not known in wider circles. Reading is now considered a "popular" activity, as the bookshelves of any chain bookstore will evidence. So a user of a library catalog may find himself facing a daunting choice among authors, such as these, all named "Michael Fitzgerald":

Fitzgerald, Michael
Fitzgerald, Michael, 1768-1831
Fitzgerald, Michael, 1859-
Fitzgerald, Michael, 1918-
Fitzgerald, Michael, 1937-
Fitzgerald, Michael, 1946-
Fitzgerald, Michael, 1955-
Fitzgerald, Michael, 1955 June 11-
Fitzgerald, Michael, 1957-
Fitzgerald, Michael, 1958-
Fitzgerald, Michael, 1959-
Fitzgerald, Michael, 1970-
FitzGerald, Michael A.
The Michael Fitzgerald born June 11, 1955, will be able to find himself in this list, but other than members of his immediate family, no one else will know which of these he is. Catalogers have to call publishers or authors to find out the author's date of birth because it's not included on the book, so there is no reason to believe that the date is available to users of the library catalog. All of that time and effort is expended to create a distinction that often doesn't help the user.

All That, and It's not Even a Valid Identifier

The final blow to name authority control is that the name heading (as the name entry is called, e.g. Smith, John A.) can change. Sometimes it might change because a mistake was made in creating the heading, or even in the printing of the book, other times it changes because the library rules for creating name headings change. The heading performs multiple functions: it is the display form in displays of the book's data, it is used as the string to search on in a catalog, and it identifies the author. If a new display form is needed, then the identifier itself changes. When this happened on a grand scale a few decades ago, due to a change in the library cataloging rules, all of the connections between names and books broke, and names in library records all over the country (and beyond) had to be changed. A true identifier only identifies, and if display forms change the identifier stays the same. John Smith is the same person even if the library entry changes from Smith, John A. to Smith, John Arthur.

What Now?

It seems pretty clear that we won't be able to deal with our author abundance using the current name authority methods. There are too many new authors appearing for us to spend time calling around to determine birthdates. There are also too many new authors for those dates of birth to be useful as a way to distinguish between persons. To add to that, we really need a true identifier for authors.

Library catalogs attempt to maintain uniformity throughout, so the idea of treating contemporary authors differently from historical ones is a very disruptive concept. However, the notion is beginning to circulate that we could have contemporary authors identify themselves in some way. Something to the effect of: Yes, I am the same Michael Fitzgerald who authored that book on Art, and that's the identifier for me. After all, who better than the author knows his own identity?

That doesn't solve the problem that users have of identifying the author they seek from a long list of persons with essentially the same name. Perhaps the days of looking at lists of authors' names is over. Maybe users need to see a cloud of authors connected to topic areas in which they have published, or related to books titles or institutional affiliations. In this time of author abundance, names are not meaningful without some context.

Wednesday, September 19, 2007

Wish list: Pimp my hard drive

I can't believe that it's 2007 and I'm still staring at nested displays of little yellow folder icons, which then open up to show me, of all things, file names. Or I can get little thumbnails that tell me no more than I know from the file name extension.

Shouldn't we be beyond this? Here's what I want to see when I look at my hard drive:

Titles. Most of the documents on my hard drive have titles. Some of them even have those titles coded in some way as titles - such as the html files, and files in various word processing formats. I'm sure that it is possible to make an algorithmic guess at titles (or at least first lines) for just about any file with text.

Authors. OK, it won't really be authors, but if nothing else I should be able to distinguish files I created from those created by others. There are automated (and frequently erroneous) "owners" in files, but that ownership is often a good clue as to the provenance of the file. I want to see that. (Meanwhile, I'm going to start storing what I write in folders apart from what I have downloaded. No, I don't use "my documents." I hate that folder. I renamed it once and really screwed things up, so I have my own top level folder under c:.)

Is there any reason why I shouldn't be able to see snippets from my own files as I browse a folder? An opening line or beginning paragraph would be fine. I shouldn't have to manually open every file to see what's in it.

Most used.
The "recent documents" function in Windows is useless. OK, pretty much useless. I want to be able to see the files I have most frequently (but not necessarily recently) opened. I can't tell you how much time I've spent hunting for local copies of certain files.

Tags. I want to tag my files, and I want the tags to be available external to the files themselves, a kind of delicious for my hard drive. (And don't tell me to get Windows Vista -- that's not what I want, in more ways than one.) I want to see tag clouds and tag lists.

"Like." This may be pushing it, but I really want to see clusters of documents that are like each other. This will be the usual statistical reliance on the imprecision of language, but it would reveal connections in the documents that could be useful.

Folder names. I'm not sure I can explain this one but... I have a folder named "FRBR" and a folder named "MARC". When I want to look in one of those folders I don't want to have to go through the hierarchy of folders to find them -- especially because I never can remember where I've put them. Why can't I just type "MARC" and see the folder or folders with "MARC" in the name? Why do I always have to run through the whole hierarchy? (If you have found a way to do a folder name search only on Windows XP, please let me know.) Or maybe folder names could be treated as tags, once tagging is working.

There are undoubtedly many other things I could wish for, but basically what it comes down to is that there needs to be a better interface to the hard drive. Some of this can be found in google desktop, but I have found it unsatisfactory, generally.

Tuesday, September 04, 2007

Wish list: on-line in the stacks

This is my workspace. It's messy, I know, but the key thing is that my main "desktop" is on the screens. The physical workspace is primarily for setting things down, not for working.

Basically, everything happens on these screens -- I search, I read, I write, I converse (both text and voice). I can't imagine doing my work without the Internet. So I find myself in a dilemma when I go to the library, because I am cut off from my "place of work." I go into the stacks, perhaps with a scribbled note containing a call number, and I stand in front of shelves with fewer capabilities than I have in my own home office. If I don't find the book I want I can't check to see if I wrote the call number correctly; I can't look to see if there's a "second best" book that I'd like; I can't determine if there's another area of the stacks where I might find something else I'd like to read; and I can't search within the text of the bound volumes in front of me, even if digitized versions do happen to be available on-line. I stand there wishing I could go on-line.

Essentially, going into the library means leaving behind my ability to find. Yes, there are a few computers in the stacks, but they are too far away to make it possible to be usefully on-line and at the shelf at the same time.

Libraries made a great effort to get on-line and to reach out to users beyond their walls. What we haven't done, however, is to combine the on-shelf and on-line resources in a useful way. It makes sense to me that I should be able to stand amid bound journal volumes and do a keyword search. Or that I could pull a book off the shelf, see a citation, and check to see if the library has that item.

What would make this possible? First, many more access points within the physical stacks. Access to the catalog or other resources shouldn't be more than a few steps away. Heck, find a way to tie down one of those $100 computers at the end of each row, or create a place where a user can easily lean their laptop (and have the wireless access reach within the shelves). Instead of telling people to turn off their cell phones, remind them that if they have net access they can combine the power of the library's catalog, the library's on-line resources, and the items on the shelves. Encourage people to work with physical and digital resources together. If I could do that, I'd spend more time in the library.