Tuesday, January 17, 2012

Google dashboard

Google has an ad in today's New York Times. Over a half page (and with lots of white space), it is a cartoon of a guy up to his waist in water calling a plumber. The plumber who answers says: "I'm on my way. See you in 15 hours." The rest of the text goes:

"You live in Peoria. Do you really need a plumber from New York? We didn't think so.... That's why search engines, including Google, give you results based on your city or region. They can do this by using your computer's IP address. It's a number like 209.85.229.147, which acts like a zip code to tell them the rough area your computer is in.

To find out more about how websites get to know you better go to google.com/goodtoknow"
The text vs. subtext in this ad is stunning. Although justifying a Google practice, it speaks of it in the third person: "they" use your IP address, it tells "them" the area your computer is in. The message is: everyone does it. It's not a Google thing, it's an Internet thing. Don't blame us.

The site at "goodtoknow" uses the same cartoon figures and has very little text; most information is given via videos. The site is a fairly good round-up of information topics, from phishing to securing your home wifi network. (The irony of that being that Google was caught picking up open wifi traffic in Germany.) I could imagine it as a "go-to" place for novices needing information on online privacy. Much of it isn't about Google at all: the video on "Stay safe online" gives five rules about passwords and avoiding phishing and never mentions Google. It also doesn't mention that when you log into a site with a secure password, everything you do is observable by the owner of the site. Believe it or not, many people do not understand that. They think that the password makes their activities private, even to the site owner.

The page on "Manage your information" includes a link to Google Dashboard, which was also mentioned in one of the videos, and which, if I'd known about I had forgotten. Google Dashboard is a list of some of the things that Google knows about you, in particular which Google services you have accounts on. It shows your settings on these services. I found some services I had played with and forgotten about, which I can now delete.

Of course, Dashboard is only the tip of the iceberg in terms of what Google knows about us. I turned off Web history in 2007 so I don't see my searches there. If you are at all concerned about privacy, visit Dashboard and make some adjustments. Google warns you that you will get results that are less customized for your interests. However, if you are reading this you probably are an information professional, and my guess is that you can find the ad for that printer just as well searching privately (if real privacy really exists) without also letting Google know your political, sexual and religious interests.

You often hear that people don't really care about their privacy and they are quite happy to give Web sites their information in exchange for services. I also observe that behavior, but I'm not convinced that the majority of Web users are truly aware of how much information about them is being gathered. I also doubt that most users know how to take advantage of things like the private browsing options in browsers. (I'm not sure I trust that private browsing is truly private. I also don't know how to find out how private it really is.) I do find myself giving out information about myself to Web sites, but it's not because I don't care: it's because I get rushed and don't want to take the extra step, or I forget, or I'm not given a choice and I need to access that site right now. I don't believe in blaming users for the lack of privacy, because the privacy options are always opt-out, not opt-in, and are often hard to find.

And, yes, I know I am writing this on a Google-owned blog site. I've had on my task list for a very, very long time to figure out a way to port this content over to my own web site. It's not so much for privacy purposes (it'll still be a public blog) but because I want the content to be mine even though I'm more likely to lose it than Google is.  The Web has become my workplace and the choice I make is not privacy vs. better ads but privacy vs. getting my work done.  Making it all about advertising trivializes the reality that our personal and professional lives are intertwined with systems we have no control over. This dependency is as frightening as the privacy issue.




Wednesday, January 11, 2012

Bibliographic Framework: RDF and Linked Data

With the newly developed enthusiasm for RDF as the basis for library bibliographic data we are seeing a number of efforts to transform library data into this modern, web-friendly format. This is a positive development in many ways, but we need to be careful to make this transition cleanly without bringing along baggage from our past.

Recent efforts have focused on translating library record formats into RDF with the result that we now have:
    ISBD in RDF
    FRBR in RDF
    RDA in RDF

and will soon have
    MODS in RDF

In addition there are various applications that convert MARC21 to RDF, although none is "official." That is, none has been endorsed by an appropriate standards body.

Each of these efforts takes a single library standard and, using RDF as its underlying technology, creates a full metadata schema that defines each element of the standard in RDF. The result is that we now have a series of RDF silos, each defining data elements as if they belong uniquely to that standard. We have, for example, at least four different declarations of "place of publication": in ISBD, RDA, FRBR and MODS, each with its own URI. There are some differences between them (e.g. RDA separates place of publication, manufacture, production while ISBD does not) but clearly they should descend from a common ancestor:
RDA: place of publication
RDA: place of distribution
RDA: place of manufacture
FRBRer: has place of publication or distribution
ISBD: has place of publication, production, distribution
This would be annoying, but not unworkable, if these different instances of "place of publication" could be treated as having some meaning in common such that one could link a FRBRer element to an ISBD element, but they cannot. The reason they cannot is that each of these constrains the elements in a particular way that defines its relationship to a single data context (what we generally think of as a "record structure"). The elements are not independent of that context, and this means that each can only be used within that particular context. This is the antithesis of the linked data concept, where data sets from diverse sources share metadata elements. It is this re-use of elements that creates the "link" in linked data. To achieve this, metadata elements need to be unconstrained by a particular context.

Linking can also be achieved through vertical relationships, similar to "broader" and "narrower" in thesauri. This is less direct, but makes it possible to mix data sets that have differing levels of granularity. In our case, the ISBD "place of publication, production, distribution" could be defined as broader to the three RDA elements that treat those separately. Unfortunately that is not possible because of the way that ISBD and RDA have been defined in RDF. (I'll post more detail about this later for those who want more.)

The result is that we now have a series of RDF silos, expressions of our data in RDF that lack the linking capabilities of linked data because they are bound to specific data structures. Clearly we gain little in terms of linked data by creating mutually incompatible bibliographic views. Not only are these RDF schemes not compatible with each other, none will be linkable to bibliographic data from communities outside of libraries who published their data on the Web. That means no linking to Amazon, to Wikipedia, to citations within documents.

Given where we are in the development of linked data for libraries, we now have two options:

1) Define 'super-elements' that float above the record formats and that are not bound by the constraints of the RDF-defined records. In this case there would be a general "place of publication" that is super- to all of the "place of publication" elements in the various records, and would be subordinate to a general concept of "place" that is widely used (possibly a property of GeoNames). To implement linking, each record element would be extrapolated to its super elements.

2) Define our data elements outside of any particular record format first, then use these in the record schemas. In this case there would be only one instance of "place of publication" and it would be used throughout the various bibliographic records whenever an element with that meaning is needed. Those records would be interchangeable as linked data using their component data elements, and would interact with other bibliographic data on the Web using the RDF-defined elements and their relationships.

My message here is that we need to be creating data, not records, and that we need to create the data first, then build records with it for those applications where records are needed. Those records will operate internally to library systems, while the data has the potential to make connections in linked data space. I would also suggest that we cease creating silo'd RDF record formats, as these will not move us forward. Instead, we should concentrate on discovering and defining the elements of our data, and begin looking outward at all of the data we want to link to in the vast information universe.


_____
* Note on RDA: RDA in RDF includes two "versions" of each data element: one bound to FRBR and one not. The latter has potential for re-use outside of a FRBR environment, and was designed for this purpose by the DCMI/RDA task force. Its relationship to "official" RDA is somewhat unclear at this time but hopefully will gain support as the linked data concept is absorbed into the bibliographic framework.



Monday, January 02, 2012

Google Book Search Redux

The document I referred to in the previous post would have been so much clearer if I had read the two preceding documents. Now that I have, the story is even more dramatic.

On December 12, 2011, the Author's Guild filed a fourth amended complaint (PDF) against Google. This complaint is nearly identical to the first one, filed on September 20, 2005 (PDF). The two complaints between these (October 28, 2008 and November 16, 2009) included the Association of American Publishers, as did the two attempts at settling the case. (October 28, 2008, and November 13, 2009). The publishers had had their own complaint in 2005 before combining forces with the Authors Guild. Now the Authors Guild is again standing alone against Google's book digitizing efforts.

This fourth amended complaint brings us pretty much back to square one, with the addition of the involvement of more libraries and the creation of HathiTrust as a way for the libraries to store their (allegedly) ill-gotten copies. The library copies are a key element of the suit because they are proof that Google has not only digitized the library books but has made copies (the purview of copyright law) and distributed them to others.

The most interesting document of this latest group, and the one with the greatest detail about Google's actions, is the Memorandum in support of the class certification. This document is the explanation of why the Authors Guild should be considered by the court to be a valid representative of all authors in a class action suit. The document has a number of quotable moments, of which my favorite is the "tell it like it is in plain language" opening:
This litigation arises from Google's business decision to gain a competitive edge over its rivals in the search engine market by making digital copies of millions of "offline" printed materials. ... Rather than obtaining licenses from copyright owners for the digital use of their printed works, Google instead entered into agreements with libraries to gain access to these works. A number of university libraries allowed Google to make digital copies of the books in the libraries' collections, including in-copyright books. In exchange, Google provided digital copies of the books to the libraries. Google refers to this massive copyright infringement as its "Library Project." (p.1)
The assumption on the part of most folks commenting on this latest development in this now 6-year-old case is that the settlement is dead. We are therefore back to the question of whether Google's book scanning is or is not Fair Use. This question, though, is only being asked on the part of authors, not publishers, and if anyone has inside knowledge on what approach the publishers are taking I would love to hear it. It is clear that the position of publishers in relation to Google has changed greatly over these past 5-6 years since the suit was originally filed. There are now reportedly thousands of publishers who are using Google Books to promote and sell their works. It also makes sense that publishers, as corporations, are better able to negotiate with Google than are individual authors. A large publisher with numerous books in print and in its backlist has clout that a single person does not have. In addition, large publishers have lawyers, or access to legal counsel. At least some publishers have made their peace with Google and are seeing the relationship as advantageous.

Looking at this from the library point of view I wonder what will happen to the millions of library books already scanned by Google. I also wonder what this awkward and failed attempt to create the overly broad settlement between Google and the AG/AAP will mean for future digitization projects. There are strong arguments for digitization for scholarly purposes, and the creation of a computational capability over millions of texts could be a positive step for research, especially in the social sciences and humanities. I hope that the botched attempt to commercialize the contents of libraries will not prejudice the future of digital research.