Tuesday, September 30, 2008

More Puzzling over FRBR

FRBR came up often in our discussions at DC2008. In particular, there were many attempts to clarify what FRBR means in a technical environment. Since FRBR is about entities and relationships, it seems to be perfectly positioned as the first step in the transformation of library data to the semantic web.

After each such event where ideas about FRBR are thrown around, I go back to FRBR and try to understand it. Each time it's as if I'm reading and thinking about an entirely different model. So here's this week's entry into the multiple personalities of FRBR.

During this reading I focused on the relationships between the entities, such as the relationship between expression and work, and the various work/work, expression/expression (etc.) relationships. What struck me immediately is that there is a fair amount of detail in the explication of the relationships between different Group 1 entities (work/work, etc.). These turn out to be the richest set of relationships in FRBR. At the same time, the relationships between Work-Expression-Manifestation-Item are covered by a single sentence each:

Work: a distinct intellectual or artistic creation

Expression: the intellectual or artistic realization of a work

Manifestation: physical embodiment of an expression of a work

Item: exemplar of a manifestation


Only one example is given for each. Compare that to table 5.1 on page 63 of the FRBR document, which gives these relationships between works:

successor

supplement

complement

summarization

adaptation

transformation

imitation


It seems much easier to see the real world applicability of these relationships than "intellectual...realization of a work." The sum of the lists of the relationships inherent in the Group 1 entities embodies much of the network of bibliographic interactions that will interest us in the semantic web. In fact, these relationships are probably closer to the needs of users than those of WEMI. I would like to explore these relationships further to understand what they reveal in terms of the development of a navigable route through the bibliographic world.

Meanwhile, I have some comments on those relationships. To begin with, I find it very interesting that there is no priviledged "first expression" of a work. Admittedly the first expression may not be known, but in fact the expressions are all equal and all have the same relationship with the work. This means that you can indicate that one expression is a translation of the other, and the translation then has the same relationship to the work as does the expression in the original language (which may - or may not - be the original expression). This seems to defy the concept of the uniform title or work title, which represents the original language of expression, and therefore says something about the "originality" of the first expression.

Another thing is that the list of relationships I give above are also valid between expressions. This is adds to the obscurity of the difference between works and expressions. There are, however, times when it makes sense to me: you could have a film version of Romeo and Juliet that is based on a particular expression of Shakespeare's work. The "workness" of the film also has some relationship with the "workness" of the play (adaptation? transformation?). Yet in general I have trouble with work-work relationships since there really is no work without an expression, therefore it's hard to say that work A adapts work B. I suspect this is just the general uneasiness with the abstractness of the work, but it seems amplified when you try to add relationships to this very fuzzy concept. The relationships between expressions make more sense to me.

Something else that occurs to me is that the transformative relationships make sense between expressions (translation, adaptation) and the intellectual relationships make sense between works (imitation, successor, others?). That an imitation is based on a particular expression (whatever the imitator had in hand) is almost a secondary relationship. What this means is that the work-work relationships and the expression-work or expression-expression relationships with the same name may not be identical. In fact, they couldn't possibly be identical because they refer to different types of entities. So although they have the same names, I would argue that they are not the same relationships, in the same way that the work title and the manifestation title are distinct even though they are both called 'title'.

I end this lengthy, rambling brain dump with the thought that we might be able to create a rich network of expressions, linked handily to their respective works, that would be very useful for those seeking information. And that the network of expressions could help us identify the appropriate work for each expression, because once an expression is found to be a translation of another, then they must logically be expressions of the same work.

Forgive me if this is all a re-hash of the obvious. For some reason, nothing in FRBR comes to me easily.

Monday, September 29, 2008

DC2008

I recently attended the annual Dublin Core conference in Berlin. I would have blogged the sessions but in fact I spent most of my time in the hallways chatting with folks. The main message of the conference was: Semantic Web. This included an interesting talk by Martin Malmsten on turning MARC records into RDF triples. (See also the work at Talis in this area.)

For me the big deal of the conference was a meeting with some of the Dublin Core folks who developed the DC Abstract Model and the DC Application Profile model. We had a nice long talk about the distance between those views and the actual production of library metadata. What we concluded was that we will work together to bridge this gap, in part by creating simple, re-usable modules that are easy to understand and that, when hooked together, provide the information necessary to engineer a fully functioning, DCAM-compliant application profile.

Yes, I know we need more of an explanation, and I'll be working on that very soon. Don't go too far away.

Wednesday, September 17, 2008

Functional Requirements for App Profiles

In preparation for DC2008 (9/22-26, Berlin), I've been thinking about application profiles. The DC folks have developed a structure for application profiles which I have attempted to use for the DC-RDA work. I ran into some difficulties, in part because the library community has its own particular needs. So I thought it would be a good idea to articulate these needs in preparation for discussions I hope we will have next week.

I'm going to use some terminology from FRBR, and some from the DC work. Mainly, I'll use "entity" in the FRBR sense. I'll use "property" in the sense that it is used in the DCAM and in RDF.

Here's my first pass at what we need to express in an application profile for the library community:

entities


We need to define the entities that will be in our metadata environment. It would be ideal to be able to re-use entities where possible. So if two APs can use the same Person entity, they just need to be able to identify it. At the same time, it must be possible to create a different person entity and to give it a new identifier.

relationships between entities

It some cases it will be desirable to constrain the relationships that can exist between entities. Both RDA and FRBR constrain which Group 2 entities have relationships with a Work as opposed to an Expression. This is an area of some disagreement among sub-communities, so there will be some APs that will define the relationships differently.

properties of entities

Entities have properties. These are metadata elements that have been defined outside of the AP. Each property must have a unique identifier. It is the detailed information about the properties that will make up the bulk of the AP. Here is a first list of what that information needs to be:
  • property identifier
  • property is mandatory/option
  • property is repeatable/not (within entity)
  • properties are cumulative/mutually exclusive -- a way to say that you can use property A or B or C, or that you can use any combination of A, B, C.
  • property value is controlled/uncontrolled -- this distinguishes between free text (e.g. an abstract, user tags) and a constrained set of values (authority list, or a designated format). If controlled, then there needs to be a way to give some information on the type of control: URI of a list of terms; URI of a standard format for the data (e.g. date type format, or AACR2 name heading format).
  • property value is transcribed/supplied -- transcribed data are taken directly from the resource itself; supplied means source of the information is not the resource. (Title can be transcribed; subject headings are supplied.)
  • for controlled property values that use a set list of values, it has to be possible to state the vocabularies that are valid, and whether or not they are mandatory or optional. It may also be necessary to define whether one can extend the vocabulary in the metadata (e.g. use an unlisted value if a new value is needed). It needs to be stated whether the entire vocabulary is to be used. If not, the AP needs to define which values from the full vocabulary are valid. It also needs to be possible to create a list of values within the AP for any element. In this case there is no external controlled list.
Other?

I'm musing over whether we need to be able to define a "record," mainly to say what the minimum is that someone could expect to receive.

I'm also considering the need to define relationships between records -- like the FRBR work/work and work/part relationships. As I said in my post on linking, I see a difference between dependent and independent links, and these, in my mind, would be independent links, and may point beyond a particular database or system. I'll think more about this, and welcome comments.

Saturday, September 13, 2008

Thinking About Linking

In my previous post on affordances, I included inter- and intra-metadata links. I feel like there's a lot of confusion in this area (some of which I may myself have contributed), so I'm going to do a bit of a disorganized brain dump here as an attempt to start a conversation in this area, see if maybe we (or I) can't arrive at some clarity.

In the FRBR vision that RDA has embraced, there is something called the "relational/object-oriented model." I have some basic problems with this because I perceive relational and object-oriented designs to be quite distinct. This concept of relational/object-oriented gives me one of those "blank brain" moments -- when something sounds like it should make sense but I just can't make sense out of it. So I'm going to treat it as a set of relationships within a bibliographic record.

In the FRBR/RDA model there are entities: Work, Expression, Manifestation, Item (WEMI), and Person, Corporate body, Concept, Object, Event, Place. The interesting thing about these is that none of them is intended to stand alone. This is a very inter-dependent group of entities, not a set of separate records. This is hard for us to imagine because today's model is indeed of separate records for bibliographic data and authority data (covering names and subjects). However, our view is colored by the fact that the bibliographic record carries headings from the authority records, an therefore is complete in itself. Authority records, if you think about them, even those for names, are of the nature of a controlled vocabulary. The view of these vocabularies as contributing to the bibliographic description means that we have to have a way to express both the entities themselves and the links between them.

In addition, we have to decide what one defines as a record. If, to describe a work, one must also describe the creator, then it does seem that the Work entity and Person (or Corporate) entity must be part of the same record. Otherwise, the record cannot stand alone. So what does it mean to include the Person entity, and where does that entity reside? Or is an unresolved link to a (presumed) entity sufficient to complete the bibliographic record? In other words, if the bibliographic record has, as part of the work, a link to a Person entity that resides elsewhere, is that bibliographic record complete?

Note: I read back through FRBR and FRANAR regarding the Person entity. FRBR includes only the "name heading" in its Person entity, while the FRANAR Person entity has many more elements. This parallels today's difference between the personal name field and the name authority record.
There are other kinds of relationships that are between bibliographic entities. To my mind there are two types of relationships here: dependent and independent. The dependent relationships are between the WEMI entities, none of which is considered complete in itself. In fact, I consider the WEMI to be a single entity with dependent parts. (Admittedly, this is how current library cataloging views it, with a single flat record that contains information on all of these bibliographic levels which exist simultaneously in a single object.) To me, these are indivisible -- you can't have any one of them without the others.
[Note that I consider the WEMI to be a single entity in terms of library cataloging records. The levels of this entity do have meaning on their own. For example, a literary critic will often refer to the Work, perhaps to the Expression. A publisher or bookstore advertises the Manifestation. A library identifies and circulates the Item, and a rare book seller deals almost exclusively in Items.]

The independent relationships are those between different bibliographic entities -
  • Work-Work, two works that reflect or reference each other (cited, cites; works based on other works, like parodies or sequels)
  • Whole-Part, works in which one can be contained in the other (article and journal, chapter and book, volume and series)
  • Item-Item, reproductions of all types
To a large degree, these relationships can all be expressed as properties: isCreatorOf, isExpressionOf, isCitedBy. But I can't shake the feeling that there are at least two distinct kinds of relationships: those that fill in what otherwise would be gaps in a metadata record, and those that inform relationships between bibliographic items. I also wonder about links with and between complex entities. For example, imagine a bibliographic record that links to a member of a subject vocabulary that is stored in SKOS format. The SKOS record has numerous fields covering preferred and alternate headings, definitions, links to broader and narrower terms, and all of this in various languages. What if the property in the bibliographic record has the meaning "definition of term in French"? What does one link to? Or is the only possible link to the vocabulary member as a whole?

So these are a few of the questions I have. Hopefully some of them can be cleared up quickly. I'm interested in hearing how others think about these issues. For those attending DC2008, if this interests you I'm game for some discussion.

Monday, September 08, 2008

Metadata Affordances

In my last post, I promised to spend some time thinking about metadata affordances -- that is, a view of metadata based on what you can do with it. My hope is that this will inform a metadata model that serves our needs (whoever "we" are, but admittedly this will tend toward the metadata needs of the library community). Here are the categories that I have come up with, all open to comment, discussion, correction, etc., so please comment freely.

None (opaque text)

Some metadata will necessarily be of this category, with no particular affordances inherent in the contents. At times plain text is used because that is the nature of the particular metadata element, like the recording of the first paragraph of a text, or transcribing a title from the piece. At other times plain text is used because the metadata community has chosen not to exercise control over the particular metadata element. An example of this is user-input tags. Although human intelligence may be applied to plain text fields, it requires knowledge that is not inherent in the metadata structure itself.

Structure and rules (typed strings)

Typed strings are things like formatted dates (YYYYMMDD) and currency formats ($9,999.99). There are other possible formatted strings, such as the common identifiers like ISBN and ISSN. The affordances of these strings is that you can exercise control over the input of them, forcing the consistency of the values. With consistent values you can perform accurate operations, like adding up a set of figures, sorting or searching by date, etc. Some controlled list values may also have structure: the standard format for personal names used by libraries includes structural rules ("family name followed by comma, then forenames") that facilitates the use of alphabetically ordered lists of names.

List membership/vocabulary control

One way to assure consistency in metadata is to require that the metadata value be selected from a fixed list of values, rather than being open to free text. This tends to take the form of a list of like terms: languages of text, country names, colors, physical formats.

Although it provides consistency, list membership alone does not provide much in terms of capabilities for data processing. Other information is needed to provide affordances for list members:
  • access to display and indexing forms of the term
  • access to alternate forms, including other languages
  • access to definitions of terms

The information that is needed, therefore, for any list and its members is:
  • list identifier
  • member identifier
  • location of services relating to this list/member, and what services are available

If there are no automated services, then a system will need to provide its own, which is what we generally do today by creating a copy of the list within the system and serving display forms and other features from that internal list. In a web-enabled environment, however, one could imagine lists with web services interfaces that can be queried as needed.


Inter- and intra-metadata links

There is a need to create functional links within metadata segments to other metadata segments or records. For example, the use of name and subject authority records implies a link between those records and the bibliographic metadata records that contain the names and subjects as values. There are also links needed between bibliographic records themselves. These latter represent a number of different relationships, which have been articulated in the FRBR documentation. Some examples are: work-work relationships, work-expression relationships, and part-whole relationships (chapters within books, articles within journals).

There may be other kinds of links that are needed as well, but I think that the main need is to distinguish between identifiers and links. Some identifiers, like ISBNs, can be used to retrieve metadata in a variety of situations, but those should be seen as searches, not links. Searching is appropriate in some circumstances, but the ability to create stable links is a separate affordance and should be treated as such.

Note: These categories of affordances are not mutually exclusive. Some metadata values will provide more than one type of affordance. Each should be clearly and separately articulated, however, and we should think about the advantages and disadvantages of having metadata values serve multiple functions.

Friday, September 05, 2008

Literals and non-literals, take 2

Jason Thomale responded to my previous post with his insights into literals and non-literals, and I have to say that this really hit me up-side the head, in the best of ways. Here are some paragraphs from Jason's comment (which is worth reading in full):

A literal is a value that references nothing other than itself. You could consider it the "end of the line" when you're thinking about linked data. It's data that isn't linked to anything else. For example, the property "FirstName" would probably have as its value a literal. Consider "FirstName=Karen"--Karen isn't referencing the person, it's a literal string (or "value string") that tells what the FirstName is. The FirstName property, in turn, would probably be part of a description set that describes the resource--the person--that could be identified by the string "Karen Coyle."

A non-literal, on the other hand, is a value that serves as a reference to something else. Hence "non-literal"--it's not a data value to be taken literally. It's a pointer--a link--that refers to something else. Properties whose value would logically be another resource should contain non-literals. "Author," for example. Even when we say, "The Author of this blog posting is Karen Coyle," we're not referring to the literal string "Karen Coyle." That string didn't write the posting. We're using that string as a name that references the actual person. The person authored the blog posting. "Karen Coyle" is just a convenient reference--or non-literal--that points to the person. So--first of all, that's the difference between a literal and a non-literal.

...

Since I'm an RDBMS guy at heart, it's easy to think of it in those terms. A non-literal would be like a foreign key. The value itself may or may not mean anything--it just references a record in another table. A literal would be a cell that isn't a foreign or primary key. It's the actual data.

Now--this certainly isn't unambiguous. Going back to the FirstName example, one might use a non-literal for this property if you're actually thinking about first names as entities/resources in your data model. Maybe you have a separate description of each name, complete with history, related names, etc. In this case you could use a URI to identify each name, or each resource that describes any given name, or you could keep using the value string "Karen"--but in the latter case you would also need a URI associated with it that identifies how to interpret that value. Otherwise it's just a literal. So--in this case, you have the same value string ("Karen") that we could use for the same property as a literal or as a non-literal. From my understanding, what matters is whether or not it you're using it as an identifier to refer to something else and whether or not you include a URI that describes the identifier--not whether or not it's "structure data."

What Jason does here is to look beyond the way that DCAM defines the structures of literals and non-literals and instead focuses on what UI folks would call the "affordances." In other words, what do these types of values do for me in a linked environment? Although I've heard DC folks talk about this aspect of the DCAM, it is not brought out in the DCAM document itself.

Where I think that my concept of this differs from that which circulates in the DC world is that I'm not at all interested in refining philosophical points about the fine lines between literal and non-literal. This comes up in a second comment of Jason's that I reply to. I believe that Jason's analysis is in agreement with the DCAM definitions, which, however, doesn't work for me:
Jason: "If I said, "This book's author is Karen Coyle," then the real value of "author" is *the person,* and "Karen Coyle" is being used like a non-literal value to identify *the person.*"

Karen: I believe that you can indeed say: "this book's author is [literal value 'Karen Coyle']." Simple metadata does that all of the time. I think that the distinction is *not* in the string or even in the fact the you put it in classic RDF-triple terms, but in the intended use. So in a MARC record following AACR2, an author name in the 100 field is a non-literal because it represents a heading in the authority file. In a [metadata] record that is not using any particular cataloging rules (or where you as a recipient have no idea what the rules are), the value in the [author or creator] field, even if it is identical to the entry in the AACR2 record, is a literal because you can make no inference about what it might represent outside of the metadata record.
The difference that I see here is between a theoretical non-literal ("author of this book is Karen Coyle") and a value that one can actually act on ("author of this book is person identified in library land by LC Control Number: n 89613425"). I realize that this means that the context of the data has an effect on whether one would call the data literal or non-literal, but in fact, I'm less concerned with what you would call it than what I can do with it at any given moment in time. It's this knowing what I can do with a value that to me is of prime importance, and finding a way to convey to people and machines what they can do with a value is my main goal. (I don't know if Jason would disagree with this, but he knows how to comment, so I'll let him speak for himself.)

I am now arriving at the conclusion that if we focus on real affordances for linking, rather than structure, then we can have a very useful discussion of types of metadata affordances that serve our purposes. These may or may not exactly parallel the DCAM structure, but I don't think that adoption of the DCAM is our task -- I think our task is to create a useful model for the next generation of library data. What DCAM provides us with is an existing model that we can poke at, dissect, try to work with, and throw our own ideas at. Then, once we have defined our affordances we can figure out a way to structure our data profiles so they reveal those affordances to human and machine users.

Tuesday, September 02, 2008

Semantic Dementia

"Semantic dementia" is a term for something many of us of advanced age experience: forgetting words we once knew. It brings to my mind, however, the kind of demented semantics that we often encounter in standards in our field, and the use of or creation of words that obscure the meaning of the standard.

I understand the need that standards have to be very precise in their terminology; to give terms specific meaning. There often is a conflict, however, between that desire for precision and the need to communicate well with the users of the standard. An example of this is the OpenURL* standard, which pioneered the "Context object" and its ever-obscure children like the Referent and the Referring Entity. Quick: give me a definition for Referent.... right, it's not exactly on the tip of anyone's tongue.

I'm going to say that there are two kinds of people in the world: those who think that using a standard should require many hours of study leading to a complete understanding and absorption of the concepts and terminology, so that there cannot be any possible mis-use of the standard; and those who think that a standard should be fairly understandable in a single reading, and usable shortly thereafter. Members of the former group seem to feel that the ideas in their standard are so clever, so unique, that they cannot be comprehended easily. Members of the latter group (to which I obviously belong) assume that standards recombine previous concepts into new structures, and, deep down, are generally simple ideas that one could express simply.

Jeff Young, who clearly has an element of Type 2 in him, managed to unbundle the studied obscurity of the OpenURL with the opening post to his blog, Q6. He replaced the OpenURL terms with Who, What, Where, Why, When, How. I believe that for many people, the light bulb suddenly lit up upon reading his explanation.

A similar simplification is needed for the Dublin Core Abstract Model, and I'm going to attempt that even though I think it's a dangerous thing to do. DCAM defines a set of metadata types that can help us communicate to each other about our metadata. It should simplify crosswalking of metadata sets, and make standards more understandable across communities. Unfortunately, it has not done so, at least in part, because of some rather demented semantics.

DCAM, Simplified

First, you need to understand that the DCAM is about metadata structure, not meaning, or at least not meaning in the human sense of the term. It describes a generalized underlying format for machine-readable metadata. In the most simple terms it provides the information that a program would need to determine what operations it can perform on the metadata that it receives. In this sense, it is a general metadata analogy to the OpenURL's Context Object: a formalized message about something.

The basis of the DCAM is key/value pairs, each of which is called a statement, which is the terminology from RDF. Any group of statements describe a single resource. A resource can be just about anything, but they are based on what your metadata ultimately describes. Examples are: a book; a person; an event. The set of key/value pairs that describes a resource is called a description. If you will describe more than one resource in your metadata record, then you will have multiple descriptions. These make up the description set, which is the sum of descriptions that you have defined for your purpose. These descriptions can be packaged into a record. It all looks something like this:
The statement level is where we get to the real meat of the DCAM, and the part that I think holds great potential. The actual DCAM diagram is very large and filled with terminology that makes it difficult to grasp the meaning of the concepts. I'm going to simplify that meaning here, with the understanding that there is more to DCAM than this simple explanation. Consider this Step 1, not the whole enchilada.

Essentially you have key/value pairs. A key/value pair can look something like:
title = Moby Dick
where "title" is the key and "Moby Dick" is the value.

The first rule in the DCAM is that the key must be identified with a URI, a Uniform Resource Identifier. Here's a URI that you might use for this key/value pair:
http://purl.org/dc/elements/1.1/title

There is nothing new in this; using URIs is a very common convention. It's in the definition of the value that DCAM adds something. Values can be "literals" or not. DCAM makes use of the RDF definition of a literal:
"Literals are used to identify values such as numbers and dates by means of a lexical representation. Anything represented by a literal could also be represented by a URI, but it is often more convenient or intuitive to use literals."

Literals can be plain or typed, as defined in the RDF documentation. An example of a typed literal is a date or currency. A typed literal gives you some control over the format of the string, such as "YYYYMMDD."

This is contrasted with the non-literal values. The non-literal values are not defined in the RDF documentation, except to imply that they are everything that is not a literal. The DCAM goes further and defines non-literals as being of two types: the non-literal value is either a URI that names the value or it is a value from a named, controlled vocabulary. So you can have:
http://purl.org/dc/dcmitype/Event

which is a URI for a controlled value, in this case the DCMI type value, Event. Or you can have:
[URI for ISO 3166-2, country codes] + "en" for English
This latter is similar to what we often do in MARC records, which is to record a code as a string and indicate what controlled list that string is from.

Obviously, if this was all there was to DCAM, we'd all be all over it by now. What happens next is that we start trying to apply this in a metadata world that is not as neat as we would like. For example, what do we do with an ISBN -- is it a structured value? yes. Is it a member of a controlled vocabulary? sometimes, yes, because there is a finite list of ISBNs and each one of them is known (at least to Bowker and other ISBN agencies). So, is it a typed literal, or a nonliteral?

In the end, however, perhaps it doesn't matter that this definition of "nonliteral" leaves us with some ambiguity. Perhaps what really matters is that we distinguish between these three kinds of values:
  1. Plain strings. These will be necessary for values that simply cannot be controlled or are not being controlled.
  2. Structured strings. In these, the range of values can be great, and is not being housed in a finite list, but because of their structure they can often be acted on for functions like quality control, transformations, etc.
  3. Identified values. An identified value is the essence of the semantic web. It allows algorithms to make connections between values even though those algorithms are not machine intelligences, are not understanding the human meaning behind the value.
Our mission, as information professionals, if we choose to accept it, is to move our data, where possible, from point 1 to points 2 or 3 so that there are more possibilities to act on the data in the larger web environment.

I welcome discussion, criticism, additions... whatever comes to your mind. Really.

* NOTE: I'd give you a link to the OpenURL standard, but NISO has gone with one of those content management systems that produce unusably long links. So you'll have to go to the NISO site and look for it under Standards.