A month ago Stefano Mazzocchi published an interesting article on data reconciliation (detecting when two identifiers refer to the same item, and merging them) where he advocated a more centralized “a priori” approach (trying to keep the identifiers merged at the beginning). I posted a response arguing the value of a more anarchic “a posteriori” approach where you let anyone create whatever identifiers and relations they want, and worry about detecting linkages later. Stefano responded to that, but by then I was busy chairing the submissions for the 2009 International Semantic Web Conference. Now that that’s over (I hope you will attend what should be an interesting meeting—October 25-29 near Washington DC) I’d like to pick up the discussion again.
I argued in favor of letting individuals make their own RDF collections (using, for example, our Exhibit framework) and worry about merging them with other people’s data later. Stefano’s response accused me of using “RDF” and “structured data” interchangeably, asserting Exhibit is really just a nice UI over spreadsheet (tabular) data—that although it can export RDF, it is “not properly using RDF” because it has “lost the notion of globally unique identifiers (and in that regard, is much more similar to Excel than to Tabulator)”. Tim Berners Lee has made similar complaints to me about Exhibit not using RDF.
This argument highlights for me yet an important ambiguity about what RDF is. I occasionally have to help people understand that RDF is a model, not a syntax. That some data can be RDF even if it isn’t serialized to RDF/XML. That the key is to have items named by URIs, connected by relations named by URIs. Stefano’s argument suggests a different blurring: between the model and its intended use. Stefano’s “not properly using” phrase implies that if you don’t intend to merge your data into the global namespace, then even if you implement the model and wrote it down as RDF/XML to boot, you won’t be “properly using RDF”.
I want to address both these claims: that Exhibit is just a UI over spreadsheets, and that using RDF this way isn’t proper.
RDF and spreadsheets
Regarding the spreadsheet claim, I’ll begin by admitting that Stefano is absolutely right: Exhibit is a visualization tool for tabular (spreadsheet) data. But notice that all RDF is spreadsheet data—I can take all the RDF in the world and throw it into one spreadsheet. In fact, I only need three columns to contain the subject (tail), object (head), and predicate (link) for each RDF statement. Admittedly none of today’s spreadsheets would have enough rows, but that’s an engineering detail. So, the spreadsheet model isn’t the problem. And we also agree that Exhibit’s interface is nothing like spreadsheets’, and far better for the collection visualization tasks it is designed for.
I think instead that what Stefano is objecting to is a usage characteristic of spreadsheets versus RDF. When I open a spreadsheet, the data it shows me is right there, in a file on my own system. Global identifiers don’t matter because the data is all there (and presumably self-consistent) in the one spreadsheet. In contrast, in Stefano’s image of RDF (and in Tim’s, as one can see from the Tabulator project) the data about a particular entity is spread all over the web, and it is the globally unique identifier that lets you go out, gather all that data together, and know that it is all about the same entity.
This is certainly an appealing vision. But I want to argue that a focus on globally unique identifiers neglects two benefits of RDF that I consider equally important: data portability and schema flexibility.
To illustrate this argument, I’ll hark back to a previous post where I discussed a data integration problem that should have been easy but wasn’t. I keep an Exhibit of folk dance videos on the web. Recently, Nissim Ben Ami posted a collection of 511 new dance videos on Youtube. I wanted to incorporate it into my site. But it quickly became apparent that said incorporation would basically require my entering all 511 video descriptions manually into my system, and I still haven’t gotten around to it.
The major barriers were twofold. The first was syntactic:, the structured descriptions of the videos were delivered as XML. That meant that in order to get at the data, I was going to have to learn XSLT—something I’ve been putting off for years. The second hurdle is semantic: Youtube has the wrong schema for my folkdance videos. I care about choreographer, dance type, and year choreographed; YouTube only offers slots for submitter and submission date of the video. So, as you can see from this example, the contributor takes the usual approach: he takes his nice structure data and shoves it into the generic comment (info) field as free text. All that structure is instantly lost.
Suppose instead that spreadsheets (or, in a pinch, RDF) were the accepted framework for publishing information on the web. The YouTube “spreadsheet” would contain submitter and submission date information, but Nissim could just add “artist” and “composition-date” columns to hold the data he wanted to enter. I would then be in a great position to download his data and incorporate it into my own catalog (spreadsheet). What would I have to do? After opening his spreadsheet and mine, I’d have to match columns—perhaps he called his “artist” and “composition date” while mine are “choreographer” and “year”. But a simple copy and paste fixes that discrepancy. Merging entities is not much harder than merging properties: a simple global replace will convert his choreographer “Israel Ya’akovi” to my “Israel Yakovee”. The local consistency of his data and mine means that I only have to work once per choreographer (and in most cases I won’t have to: there’s a standard spelling for almost every choreographer’s name, which serves as a unique identifier in this context even if it isn’t a URL).
Overall, my work has reduced by order of magnitude. Instead of laboriously entering 511 new records, I just download a spreadsheet and match up a handful of properties (columns) and a few tens of choreographer names.
Stepping back, observe that I’ve relied on two things. First, on data portability—my being able to download the data in a convenient form: not XML, which is a programmer’s friend but an end-user’s enemy; rather, something I can just look at and understand. Second, on schema flexibility—on Nissim’s being able to add whatever columns/properties he decides are important, instead of being limited to those used on the hosting web application.
I’m also relying on some features of this particular scenario, but I believe they often hold. I am relying on Nissim’s data having only a small number of properties so that I can map them manually to mine. I also rely on there being a small number of choreographers, and hope to take advantage of most of them having matching names in his data and mine—these names certainly aren’t globally unique identifers, but they are “unique enough” when considering just my data and his. Critically, I am not thinking of pulling all data about a given dance from a multitude of different web sites—this would demand global unique identifiers to link data since I would never have the patience. Rather, I am considering a pairwise data acquistion: taking data I want from one internally consistent site.
Such pairwise acquisition is commonplace: any time a scientists wants to pull a data set from some other scientist’s lab, or a consumer wants to download product information about several cameras from a review site, or a student wants to include a Wikipedia data set in a report they are writing, there is an obvious single source and target for a data merger. And there’s a human being who has the incentive, and with the right tools the capability, to do the limited amount of work needed to accomplish that merger.
This is a simple low-hanging fruit argument. It would be wonderful to be able to automatically merge data from thousands of different sources into a coherent whole. And this is a problem Freebase will need to solve, if they want to become the hub for aggregation of structured data. But right now we can’t even manually merge data from two sites without doing a ridiculous amount of grunt work—so perhaps we should give some attention to that easier problem on our way to solving the hard one.
Don’t skip the wild west
I’d like to so these efforts proceed in parallel, but I’m worried about enthusiasm for the more ambitious goal blocking movement toward the low-hanging fruit. I recently submitted a proposal to NIH on the topic of data integration that reflected my perspective above. I argued that the current efforts in the Biology community to force everyone to adopt a common ontology (and sometimes repository) for their experimental data are being resisted by biologists who think they know best how to present their data. I suggested as an alternative that we give biologists tools, such as Exhibit, that would encourage them to publish their data in a common structured syntax, and worry about integrating all that data after it has become available in structured form. The proposal rejection was accompanied by a review that said, on the one hand, “The benefit of the proposed approach is that it is very different from some multi-institutional data sharing projects (like caBIG), which have used a very rigid, top-down approach to creating semantics. Even if this project is unsuccessful it could bring to light new ideas and strategies that might make those large-scale projects more responsive to investigators and more successful.” At the same time, it argued for rejection because “The absence of any control over the information models and ontologies – truly a semantic wild west – is daring and may ultimately be the downfall of this project.”
I’m fascinated to see, in the same review, a recognition of the problems that the current centralized approach is bringing (lack of buy-in to common ontologies by individual scientists who think they know better and probably do), and an unwillingness to tolerate the contrary (anarchic) solution. I also love the metaphor of the “semantic wild west” because I think it supports my argument. Would anyone have suggested establishing a city of several million people just after the west was opened for settlement? The west’s early wildness was an unavoidable phase of its evolution towards the thickly settled and uniformly governed area it is now. In the same vein, I think that our semantic web is best grown by encouraging individual semantic-web settlers to create their own data homesteads and begin looking for the trails that connect them to neighboring collections. We need to get the data into plain view first. Later we can send in the data sheriffs and place all those data sets under uniform governance.