Saturday, June 29, 2013

FRBR and schema.org

The FRBR structure for what it calls the Group 1 entities (Work, Expression, Manifestation, and Item, hereafter written as WEMI) presents quite a few problems for data modeling. Of the many issues this brings up, there is the fact that this division is not universally recognized, not even in library data, and definitely is not recognized outside of libraries. This has particular impact for library data as part of the linked data space, where a primary goal is interlinking with data from diverse resources. It is unlikely that online bookstores or academic citations will begin to use the WEMI structure.

One area where library bibliographic data and bibliographic data from other sources may mingle is in schema.org markup in web pages. Schema already has a basic class that can be used for bibliographic data, called "CreativeWork." Creative work contains the common elements for this type of description, like author, title, publisher, pages, subject, etc. Problems arise, therefore, when trying to express either WEMI or the simplified BIBFRAME Work and Instance (hereafter bf:Work, bf:Instance) in this model. CreativeWork is a unified model that includes all descriptive elements in a single set; BIBFRAME separates those elements into two entities, and each entity contains only a defined set of the descriptive elements. Thus, where CreativeWork will have information for author, title, publisher, pages, subject, in BIBFRAME author and subject must be described in the bf:Work entity, and title, publisher and pages in the bf:Instance entity. Between MARC, FRBR, BIBFRAME, and schema.org, a full bibliographic description may require one, two, or four separate entities.

comparison of marc, frbr, bibframe, and CreativeWork


The OCLC report on BIBFRAME and schema.org proposes that one could use CreativeWork for different FRBR (or presumably BIBFRAME) entities, making the determination based on what fields are present:
"In this scheme, it would be possible to say that when only titles, subjects, and creators are mentioned, the description for a Schema:CreativeWork refers to a FRBR Work; and when copyright dates and genres are present, the description is equivalent to a FRBR Expression." (p. 14)
While that makes sense from a pure logic point of view, and would probably work in a library database, it has problems within the web and linked data contexts of schema.org. I should note, before going on, that schema.org is metadata markup for any web site, and CreativeWork will be used for books, films, music, art, and other forms of creation by anyone and everyone on the web. This is not a library-specific standard.

First, there are many sites that have a search response page with limited information about the item, requiring the user to click through for details. A search results page for books on Amazon or Ebay gives only the author and title, but does not represent the Work -- it merely doesn't give the user the full data on that page in order to fit more results onto the page. Therefore, the lack of information on one web page does not mean that the description there is complete.

Second, there is no "record" in schema.org, merely a number of coded statements with values within a web page. Any web page can contain information about any number of "things" and information about those things may be placed anywhere on the page, possibly far from each other and not coded as a single unit. It may not be possible to know how complete a description is.

Third, web site owners can opt to mark up only part of their data. In schema.org markup that I have encountered on commercial sites, markup reflects the owner's view. For example, Google (one of the originators of schema.org) does not mark up the bibliographic data in its Books pages, but instead emphasizes user ratings, images, and subjects. (This shows the markup using the Google rich snippet testing tool.) In comparison, the extracted schema.org elements for an IMDB page is much more detailed, an indication that it considers itself an information site more than a sales site.

Finally, although this is somewhat beyond schema.org, should the data in web pages be incorporated into the linked data space, it will go there as individual triples that are part of a huge graph of data. That graph is theoretically limitless and makes use of a principle called the "open world assumption." In an open world it is not possible to base your assumptions on what is missing from the graph. The open world does not have a concept of completeness because there is always the possibility that there is more information than what you are seeing at any given moment in time.

These may not be the only arguments against the use of CreativeWork for different FRBR or BIBFRAME entities, but in my mind they are sufficient to make the case that if it is desirable to encode FRBR or BIBFRAME entities in schema.org that they must be represented by different schema.org classes and cannot be inferred from data elements in CreativeWork.

Before I end, let me make clear that I do not favor an imposition of FRBR-like separations of bibliographic data on the linked data world. Even the BIBFRAME two-part bibliographic description will have problems interacting with the one-entity model that is used outside of libraries. I do think that we can find a way to talk virtually about works without stripping such key elements as authors and subjects from the description of the package that carries the content. That package is, after all, what I hold in my hand when I read something, and it is a whole, with author, title, subjects, pages, binding, publisher, etc. That is, however, a topic for another post.

2 comments:

Harry Lawe said...

Very good article, karen. But the "CreativeWork." link doesn´t work. Maybe schema.org changed its mind.

Karen Coyle said...

Thanks, Harry. The URL looked good, but something mysterious was wrong. I re-copied it, and now it's fine. (Gremlins?)