Is RDF any good without a web of linked data?

Stefano Mazzochi used to work at our SIMILE project here at MIT, where we explored the use of RDF and Semantic Web tools for the sharing of knowledge.  He has since gone to work at Metaweb and, it seems, become much more friendly to their “top down” approach of trying to create a centralized repository of structured data with consistent identifiers, as opposed to letting that data grow all over the place any which way and get linked together afterwards.  In particular, he argues for the critical importance of relational density in the data.  His point is that when there are many distinct, unlinked identifiers for the same object, then what one person says about one of those identifiers (“Chicago”) won’t be visible to someone looking at a different identifier (“the Windy City”).  He opines that “without it [relational density] there would be very little value in it compared to what traditional search engines are already doing”.

Being argumentative by nature, I wanted to highlight some of the benefits of the looser, sloppier approach to data sharing that we took for SIMILE.   Obviously, being able to link data from multiple sources, and feed it into a search engine as Stefano describes, is a great thing.  But there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data with no reference to anyone else’s.

The first is interaction.  As shown with our Exhibit framework (created by David Huynh, now also at Metaweb), structured data enables rich visualization.  If my data objects have coordinates, I can plot them on a map.  If they have dates, I can put them on a timeline.  If they have colors, I can filter or sort by color.  It doesn’t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City—as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.

The second benefit is portability.  If I publish some interesting data as part of an HTML document, then anyone who wants to use that data for something else—to rebut my argument, to mash it up with some other data, to put it some use I never thought of—has the unpleasant job of scraping said data out of the HTML into a usable form.  This generally requires a programmer, and even for them it’s a tedious task that distracts them, and may deter them, from what they really want to do with the data.  But if that data is published as data—even in something old fashioned as a spreadsheet—it becomes way easier to grab it and reuse it.  Look at how much of the blogosphere is made up of cross-references, trackbacks, and responses to other blog postings.  If you’re going to argue about something involving data—for example, whether a single payer system is going to end up saving or costing money, or whether today’s perfect game is all that unusual—you probably want to publish that data to support your argument.  At which point, someone who wants to refute your argument is going to want to use that same data.  That’s going to be a lot easier if they can get that data from your posting.  That’s the theory behind our Datapress project, which aims to let you post data sets (and visualizations of them) in your WordPress blog, and lets other people refer to and reuse that data.  In that sort of one-on-one debate over data, it really doesn’t matter whether I use the same identifiers as Freebase—you can take my identifiers and use them to build your rebuttal.

Uniformity does start to matter when someone wants to mash up data from multiple sources.  If those sources haven’t agreed on identifiers beforehand, then the masher has some work ahead—this is a case where a centralized vocabulary is really helpful.  But again, getting the data at all is such a big jump over the current state of affairs—I imagine how grateful mashup makers would be if all they had to do was merge some identifiers instead of retyping a whole spreadsheet from scratch.  The point here is that unlike Stefano’s hypothetical search engine, that wants to issue a query against all the world’s data at once, your typical mashup author just needs to deal with a couple of (probably small) data sets.  His or her particular data integration problem is quite manageable a posteriori.

I’ll also dust off an argument David Huynh once made to me, even if it might get him in trouble with his current employer.   Unification is not an absolute, but contextual.  Whether two things are the same may change depending on what you are doing with them.   Continuing my never-before attempted forays into sports analogies, are the Brooklyn Dodgers the same as the L.A. Dodgers?  If you want to talk about the team that moved from Brooklyn to LA, the answer must be yes!  But in a different context you might be interested in comparing the lifetime records of these two distinct teams.  (In fact, Freebase tries to have it both ways: it asserts that the Brooklyn Dodgers were “later known as” the Los Angeles Dodgers (implying they are the same team with a name change) but asserts that Los Angeles Dodgers were founded in 1958, which clearly isn’t true of the Brooklyn Dodgers that folded in 57.)

This is obviously one of those half-empty half-full debates:  We both recognize the value of both approaches, but are compelled by different aspects.  Stefano looks at the amazing things that could be done with a single consistent data universe, and worries about how to create it.  I look at the amazing things that can already be done with a host of disjoint but internally-consistent data microverses, and find that compelling enough to allay any worry about whether we’ll ever need http://www.freebase.com/view/en/lion to unify with http://en.wikipedia.org/wiki/Lamb .

7 Responses to “Is RDF any good without a web of linked data?”

  • Peter Keane says:

    Terrific post — I couldn’t agree more, esp. that last sentence. I’ve been involved for the last 5 years on a project at UT Austin (daseproject.org) which began as a digital image repository, but has grown into a framework for building media-rich web sites & presentations. We had to incorporate a mix of legacy collections (old Filemaker databases, spreadsheets, physical card catalogs & slides) and new collections contributed by faculty and others. By *not* enforcing a top-down metadata strategy, we’ve managed to have all of these sets play well together — individual collections are consistent, and compelling webites can be built (elucy.org plus many more to come). More importantly, the entire system itself has a level of internal consistency due to an abstract data model for all collections (basically just key-value pairs, with each collection determining it’s own “keys”). Adding new collections is quite straightforward, and mapping out to global metadata scheme can always (and often does) happen after the fact. We do not actually use RDF, but actually rely on Atom formats (the entire system is built around Atom/AtomPub), which provides a nicely consistent (tree-based!) data format to build on. We’ve begun using SIMILE code quite often and have found the transform from Atom to SIMILE-style Javascript/RDF to be quite straightforward.

  • David Karger says:

    Thanks for your comment. Regarding Dase, I would encourage you to use, or at least export, RDF. Although my post argues against pushing _semantic_ standards, I think a standard _syntax_ is incredibly valuable. While non-programmer humans are great at saying “this data property is the same as that year property” they have a much harder time saying “to ocnvert this data into a syntax my software can read, run the following XSLT transform”. Agreeing one one syntax for those “properties and values” you mention in your comment makes it way easier for someone to take data from one source and move it to another.

  • Peter Keane says:

    Thanks! I have actually, used RDF and I find it useful in many cases. As I’ve argued before, (http://blogs.law.harvard.edu/pkeane/2008/06/26/oai-ore-atom/) I think Atom is actually better for many cases (and certainly ours). We’ve had great success getting uptake from other units on campus, creating interoperability based NOT on any custom data formats, but by simply using Atom/AtomPub as format and protocol. I acknowledge there is wide disagreement between the RDF and non-RDF world. Since the web is based on REST principles, I would say that RDF’s lack of a useful/usable containment model presents a very realy problem (cf. http://dret.typepad.com/dretblog/2009/05/rest-and-rdf-granularity.html). There is simply no foundation at this point on campus (tools or people ) ready to push RDF around. Tools for Atom/AtomPub are in wide use. There is always a possiblity for using Atom as an envelope for RDF, but I argued strongly against that i discussions around OAI’s Object Reuse and Exchange protocol for their Atom serialization. I think the OAI-ORE folks wound up with a fine solution w/ a pure Atom serialization that hits the 80/20 mark quite nicely and a separate RDF serialization as well.

    By the way, I should note, arbitrary key-value pairs are implemented as atom:category (term=key, contained text = value). When a key has a URI (e.g., authority data) we use atom:link.

  • Peter Keane says:

    P.S. Forgot to mention: while we don’t currently use RDF, I look forward to a time when that be useful & important. I’m keeping an eye on RDFa in HTML5 (fingers crossed) and for now, using GRDDL to extract RDF from our Atom is fairly trivial.

  • [...] David Karger flatters me by writing his first (citing him) ‘blog rebuttal‘ and asking whether or not RDF is any good without a web of linked data: [..] there are some [...]

  • [...] priori” approach (trying to keep the identifiers merged at the beginning).  I posted a response arguing the value of a more anarchic “a posteriori” approach where you let anyone [...]

  • [...] This disagreement goes right back to our first blog debate where I argued for the value of small structured data fragments and Stefano for the necessity of linking them up to get any [...]