In Defense of a Semantic Web Wild West

A month ago Stefano Mazzocchi published an interesting article on data reconciliation (detecting when two identifiers refer to the same item, and merging them) where he advocated a more centralized “a priori” approach (trying to keep the identifiers merged at the beginning).  I posted a response arguing the value of a more anarchic “a posteriori” approach where you let anyone create whatever identifiers and relations they want, and worry about detecting linkages later.   Stefano responded to that, but by then I was busy chairing the submissions for the 2009 International Semantic Web Conference.   Now that that’s over (I hope you will attend what should be an interesting meeting—October 25-29 near Washington DC) I’d like to pick up the discussion again.

I argued in favor of letting individuals make their own RDF collections (using, for example, our Exhibit framework) and worry about merging them with other people’s data later.  Stefano’s response accused me of using “RDF” and “structured data” interchangeably, asserting Exhibit is really just a nice UI over spreadsheet (tabular) data—that although it can export RDF, it is “not properly using RDF” because it has “lost the notion of globally unique identifiers (and in that regard, is much more similar to Excel than to Tabulator)”.  Tim Berners Lee has made similar complaints to me about Exhibit not using RDF.

This argument highlights for me yet an important ambiguity about what RDF is.   I occasionally have to help people understand that RDF is a model, not a syntax.  That some data can be RDF even if it isn’t serialized to RDF/XML.  That the key is to have items named by URIs, connected by relations named by URIs.  Stefano’s argument suggests a different blurring: between the model and its intended use.  Stefano’s “not properly using” phrase implies that if you don’t intend to merge your data into the global namespace, then even if you implement the model  and wrote it down as RDF/XML to boot, you won’t be “properly using RDF”.

I want to address both these claims: that Exhibit is just a UI over spreadsheets, and that using RDF this way isn’t proper.

RDF and spreadsheets

Regarding the spreadsheet claim, I’ll begin by admitting that Stefano is absolutely right:  Exhibit is a visualization tool for tabular (spreadsheet) data.  But notice that all RDF is spreadsheet data—I can take all the RDF in the world and throw it into one spreadsheet.  In fact, I only need three columns to contain the subject (tail), object (head), and predicate (link) for each RDF statement.  Admittedly none of today’s spreadsheets would have enough rows, but that’s an engineering detail.  So, the spreadsheet model isn’t the problem.   And we also agree that Exhibit’s interface is nothing like spreadsheets’, and far better for the collection visualization tasks it is designed for.

I think instead that what Stefano is objecting to is a usage characteristic of spreadsheets versus RDF.  When I open a spreadsheet, the data it shows me is right there, in a file on my own system.  Global identifiers don’t matter because the data is all there (and presumably self-consistent) in the one spreadsheet.   In contrast, in Stefano’s image of RDF (and in Tim’s, as one can see from the Tabulator project) the data about a particular entity is spread all over the web, and it is the globally unique identifier that lets you go out, gather all that data together, and know that it is all about the same entity.

This is certainly an appealing vision.  But I want to argue that a focus on globally unique identifiers neglects two benefits of RDF that I consider equally important: data portability and schema flexibility.

Spreadsheets suffice

To illustrate this argument, I’ll hark back to a previous post where I discussed a data integration problem that should have been easy but wasn’t.   I keep an  Exhibit of folk dance videos on the web.   Recently, Nissim Ben Ami posted a collection of 511 new dance videos on Youtube.  I wanted to incorporate it into my site.  But it quickly became apparent that said incorporation would basically require my entering all 511 video descriptions manually into my system, and I still haven’t gotten around to it.

The major barriers were twofold.  The first was syntactic:, the structured descriptions of the videos were delivered as XML.   That meant that in order to get at the data, I was going to have to learn XSLT—something I’ve been putting off for years.   The second hurdle is semantic: Youtube has the wrong schema for my folkdance videos.  I care about choreographer, dance type, and year choreographed; YouTube only offers slots for submitter and submission date of the video.  So, as you can see from this example, the contributor takes the usual approach: he takes his nice structure data and shoves it into the generic comment (info) field as free text.  All that structure is instantly lost.

Suppose instead that spreadsheets (or, in a pinch, RDF) were the accepted framework for publishing information on the web.  The YouTube “spreadsheet” would contain submitter and submission date information, but Nissim could just add “artist” and “composition-date” columns to hold the data he wanted to enter.   I would then be in a great position to download his data and incorporate it into my own catalog (spreadsheet).  What would I have to do?  After opening his spreadsheet and mine, I’d have to match columns—perhaps he called his “artist” and “composition date” while mine are “choreographer” and “year”.  But a simple copy and paste fixes that discrepancy.  Merging entities is not much harder than merging properties: a simple global replace will convert his choreographer “Israel Ya’akovi” to my “Israel Yakovee”.  The local consistency of his data and mine means that I only have to work once per choreographer (and in most cases I won’t have to: there’s a standard spelling for almost every choreographer’s name, which serves as a unique identifier in this context even if it isn’t a URL).

Overall, my work has reduced by order of magnitude.  Instead of laboriously entering 511 new records, I just download a spreadsheet and match up a handful of properties (columns) and a few tens of choreographer names.

Stepping back, observe that I’ve relied on two things.   First, on data portability—my being able to download the data in a convenient form: not XML, which is a programmer’s friend but an end-user’s enemy; rather, something I can just look at and understand.  Second, on schema flexibility—on Nissim’s being able to add whatever columns/properties he decides are important, instead of being limited to those used on the hosting web application.

I’m also relying on some features of this particular scenario, but I believe they often hold.   I am relying on Nissim’s data having only a small number of properties so that I can map them manually to mine.   I also rely on there being a small number of choreographers, and hope to take advantage of most of them having matching names in his data and mine—these names certainly aren’t globally unique identifers, but they are “unique enough” when considering just my data and his.  Critically, I am not thinking of pulling all data about a given dance from a multitude of different web sites—this would demand global unique identifiers to link data since I would never have the patience.  Rather, I am considering a pairwise data acquistion: taking data I want from one internally consistent site.

Such pairwise acquisition is commonplace: any time a scientists wants to pull a data set from some other scientist’s lab, or a consumer wants to download product information about several cameras from a review site, or a student wants to include a Wikipedia data set in a report they are writing, there is an obvious single source and target for a data merger.   And there’s a human being who has the incentive, and with the right tools the capability, to do the limited amount of work needed to accomplish that merger.

This is a simple low-hanging fruit argument.  It would be wonderful to be able to automatically merge data from thousands of different sources into a coherent whole.  And this is a problem Freebase will need to solve, if they want to become the hub for aggregation of structured data.  But right now we can’t even manually merge data from two sites without doing a ridiculous amount of grunt work—so perhaps we should give some attention to that easier problem on our way to solving the hard one.

Don’t skip the wild west

I’d like to so these efforts proceed in parallel, but I’m worried about enthusiasm for the more ambitious goal blocking movement toward the low-hanging fruit.  I recently submitted a proposal to NIH on the topic of data integration that reflected my perspective above.  I argued that the current efforts in the Biology community to force everyone to adopt a common ontology (and sometimes repository) for their experimental data are being resisted by biologists who think they know best how to present their data.  I suggested as an alternative that we give biologists tools, such as Exhibit, that would encourage them to publish their data in a common structured syntax, and worry about integrating all that data after it has become available in structured form.  The proposal rejection was accompanied by a review that said, on the one hand, “The benefit of the proposed approach is that it is very different from some multi-institutional data sharing projects (like caBIG), which have used a very rigid, top-down approach to creating semantics. Even if this project is unsuccessful it could bring to light new ideas and strategies that might make those large-scale projects more responsive to investigators and more successful.”  At the same time, it argued for rejection because “The absence of any control over the information models and ontologies – truly a semantic wild west – is daring and may ultimately be the downfall of this project.”

I’m fascinated to see, in the same review, a recognition of the problems that the current centralized approach is bringing (lack of buy-in to common ontologies by individual scientists who think they know better and probably do), and an unwillingness to tolerate the contrary (anarchic) solution.  I also love the metaphor of the “semantic wild west” because I think it supports my argument.  Would anyone have suggested establishing a city of several million people just after the west was opened for settlement?  The west’s early wildness was an unavoidable phase of its evolution towards the thickly settled and uniformly governed area it is now.    In the same vein, I think that our semantic web is best grown by encouraging individual semantic-web settlers to create their own data homesteads and begin looking for the trails that connect them to neighboring collections.  We need to get the data into plain view first.   Later we can send in the data sheriffs and place all those data sets under uniform governance.

9 Responses to “In Defense of a Semantic Web Wild West”

  • Eric Jain says:

    Spreadsheets are no doubt an improvement over blobs of unstructured text, and no one should punt on offering that because a more sophisticated solution is too much effort. But spreadsheets have a lot of limitations — even for self-contained data sets (e.g. expressing relationships between items or 1:n).

    The “a-priori reconciliation” Stefano mentions may be easier to implement. But even if for most practical purposes Person A’s Paris is equivalent to Person B’s Paris, “most” is not “all” — especially in a scientific context. So what grant agencies should be encouraging is not the indiscriminate use of “standard” ontologies (with people using the same terms in different ways), but that people 1. create structured data with stable identifiers, and 2. map this data to other ontologies in a descriptive manner.

  • [...] different blogs posts that appeared recently on the Haystack Blog: one by Prof. David Karger that continues our blog debate on the dynamics around RDF and one by one of his grad students, Edward Benson, about the differences between Microdata and [...]

  • Peter Keane says:

    I mentioned in a comment on an earlier post, the DASe project at UT Austin, which uses Atom/AtomPub as a standard data format, after which you responded that we should think about using RDF, and to which I replied (essentially) we had no use for RDF just now. But, as this current post correctly points out, RDF is a model, not a syntax. Because Atom is used for its useful, pragmatic properties (update, title, id, links, etc.), we essentially use it a simple structured data with atom:category providing the key->value pair (i.e., triples) mechanism. Every “collection” is internally consistent, with its own set of attributes (which a collection owner can create at will). Some collection attributes *do* have global semantics, though most do not. We *are* in essence using the RDF model, and it’d be trivial to convert to its syntax as well. But it is certainly the wild west and all the more successful in capturing the valuable data that facutly create because of it. (I’ll note that going the other way RDF-to-Atom to DASe is also straightforward, as demonstrated by our conversion of the entire set of Library of Congess Subject headings RDF into our system – - http://github.com/pkeane/lcsh-atom ).

    I think a real key is the realization that as far as the *data* is concerned, there is no difference between bottom-up “internally consistent” data and top-down data with “globally identifiers.” It’s in how that data may/may not be used that the distinction comes into play. The goal should be to make the pathways between those poles easier to traverse. And to allow the two sorts of data to mix freely (give me a library-of-congress subject headings controlled field, but also give me a chance to add “keywords” OR even add my own fields). If the data is structured, mapping can happen any time. And I’d suggest that there is too much valuable data out there already AND being produced everyday that does not have a handy ontology to fit within. It’d be a shame to lose out on that.

    (A quick plug for XSLT — it’s an ideal tool in many cases for the sorts of transformations that are needed to make all of this happen. It’s maddeningly difficult for experienced programmers to pick up, since so different than procedural programming. But a few days pulling one’s hair out, and then the a-ha moment occurs and XSLT is your friend :-) ).

  • My attempt to reorganize the issues:

    Inside a Dataset, Relational Density Is Data’s Best Friend. Outside a Dataset, It’s Too Dark to Read

    http://www.furia.com/page.cgi?type=log&id=332

  • Also:

    Your response to Stefano’s claim that Exhibit is basically Excel-like is to say “all RDF is spreadsheet data”, because you could put triples in a three-column table.

    But this is dopey. A three-column triples spreadsheet in Excel would be almost completely useless. And that’s not even vaguely what’s spreadsheet-like about Exhibit, anyway. The spreadsheetiness of Exhibit is that its items-array basically corresponds to rows, and the key/value pairs in an item to cells.

    The structurally-small but practically-huge advantage of Exhibit’s format over a conventional spreadsheet is that an Exhibit value can be a list. Excel has no real notion of a cell being a list; obviously you can cobble things together with commas or CRs or something, so they appear on the screen as lists, but all Excel functions treat them as single strings. Human information uses lists pervasively, so this difference means that Exhibit’s format is significantly better at expressing human data relationships than a spreadsheet.

    The major non-RDF aspect of Exhibit’s format is that the values (whether single or in lists) are literals, not node-references. So an Exhibit file may have {“college”: “Harvard”} in your item, and {“college”: “Harvard”} in mine, but there’s no assertion about whether these two values represent the same concept or not. In a real graph-model this would be three facts, not two (glenn-college-harvard, david-college-harvard, harvard-name-”Harvard”), and thus would know that we went to the same college, not two different ones with the same name.

  • David Karger says:

    As you say, a three-column spreadsheet in Excel would be almost completely useless. I introduced the idea in order to highlight precisely that fact—to show that the “shape” of a model tells you very little about its usefulness. Most users will be unaware of the specifics of the model; what matters to them is the interaction with it through some suitable user interface (I’ve made this point visually at http://www.flickr.com/photos/pshab/291406665/). A spreadsheet is generally NOT the right interface for interacting with RDF data. I originally wrote, but then cut, the observation that you could make a “better” spreadsheet by, as you say, making one row per item with one column per property. While this is better than three columns, it still isn’t good for the kind of interaction we try to support with exhibit. As we can see from nearly all domain-specific applications, while a tabular layout may occasionally be useful, an “object-oriented” layout (like exhibits) is more natural

    Your point about the power of lists in Exhibit is interesting, but again I don’t think it is the key. Raw RDF can represent lists (http://www.w3.org/TR/rdf-schema/#ch_list) which means that lists too can be shoved into spreadsheets. Again, the problem is that the user-interface experience would be awful.

    Finally, your point about vales being literals is incorrect. Values in exhibit can be either literals or node references as the author chooses. Working from your example, if my exhibit included a node with {“label”: “Harvard”, “location:Cambridge”} then it would be valid for me to create a facet (filter) on “.college.location” that would show “Cambridge” as one of its values. This works because exhibit allows authors to use object labels as shorthand for object uris. You are absolutely correct that my “Harvard” would resolve differently from your “Harvard”, but the same might happen if we user “real” RDF where my “http://www.harvard.edu/” reference resolved differently than your “http://en.wikipedia.org/wiki/Harvard_University”. Conversely, if I did want to ensure a matching resolution, I could use Exhibit’s “url” property to override the default url generated from the label and instead specify your url as the item I am talking about.

  • You’re right, it’s not that you can’t put full graphs into Exhibit format, it’s that it’s only easier than RDF if you don’t. Exhibit allows you to dodge the URIs, violating Tim BL’s rule #1 for Linked Data. The value-uniqueness thing is a clever shortcut you can often get away with in small datasets, but it’s a disaster in larger sets and especially over time.

    As for lists, I think you’re underestimating the issue. RDF lists exist, but suck. N3 has the right idea, but lists really need to be treated as first-class structures, both in the data-model and the query-language, not syntactic shorthand for rdf:head/rdf:rest chains that only a LISP programmer could love. SPARQL is none too good with lists, either, most obviously in returning results as a denormalized variable-binding table instead of a graph. I think this is much more than a UI issue: it affects the power/complexity curve of the query-language, and thus directly affects the nature of the applications that get developed.

  • [...] and wanted to debate what kind of ontologies were needed. Those who’ve followed my slow conversation with Stefano Mazzocchi won’t be surprised at my reaction—a trip to the microphone to voice a strong [...]

  • David Karger says:

    I’ll have to disagree with this claim. You can easily encode graphs into Exhibit because different entities can refer to each other—the value of a property of an item can be (a reference to) another item. And Exhibit doesn’t sidestep URIs; indeed, it ensures they exist even if users are too lazy to create them. While a user can specify a “uri” property for an item, if the don’t the “uri” for that item is automatically generated by exhibit appending the item “label” to the base url of the data file.