Stefano Mazzochi used to work at our SIMILE project here at MIT, where we explored the use of RDF and Semantic Web tools for the sharing of knowledge. He has since gone to work at Metaweb and, it seems, become much more friendly to their “top down” approach of trying to create a centralized repository of structured data with consistent identifiers, as opposed to letting that data grow all over the place any which way and get linked together afterwards. In particular, he argues for the critical importance of relational density in the data. His point is that when there are many distinct, unlinked identifiers for the same object, then what one person says about one of those identifiers (“Chicago”) won’t be visible to someone looking at a different identifier (“the Windy City”). He opines that “without it [relational density] there would be very little value in it compared to what traditional search engines are already doing”.
Being argumentative by nature, I wanted to highlight some of the benefits of the looser, sloppier approach to data sharing that we took for SIMILE. Obviously, being able to link data from multiple sources, and feed it into a search engine as Stefano describes, is a great thing. But there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data with no reference to anyone else’s.
The first is interaction. As shown with our Exhibit framework (created by David Huynh, now also at Metaweb), structured data enables rich visualization. If my data objects have coordinates, I can plot them on a map. If they have dates, I can put them on a timeline. If they have colors, I can filter or sort by color. It doesn’t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City—as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.
The second benefit is portability. If I publish some interesting data as part of an HTML document, then anyone who wants to use that data for something else—to rebut my argument, to mash it up with some other data, to put it some use I never thought of—has the unpleasant job of scraping said data out of the HTML into a usable form. This generally requires a programmer, and even for them it’s a tedious task that distracts them, and may deter them, from what they really want to do with the data. But if that data is published as data—even in something old fashioned as a spreadsheet—it becomes way easier to grab it and reuse it. Look at how much of the blogosphere is made up of cross-references, trackbacks, and responses to other blog postings. If you’re going to argue about something involving data—for example, whether a single payer system is going to end up saving or costing money, or whether today’s perfect game is all that unusual—you probably want to publish that data to support your argument. At which point, someone who wants to refute your argument is going to want to use that same data. That’s going to be a lot easier if they can get that data from your posting. That’s the theory behind our Datapress project, which aims to let you post data sets (and visualizations of them) in your WordPress blog, and lets other people refer to and reuse that data. In that sort of one-on-one debate over data, it really doesn’t matter whether I use the same identifiers as Freebase—you can take my identifiers and use them to build your rebuttal.
Uniformity does start to matter when someone wants to mash up data from multiple sources. If those sources haven’t agreed on identifiers beforehand, then the masher has some work ahead—this is a case where a centralized vocabulary is really helpful. But again, getting the data at all is such a big jump over the current state of affairs—I imagine how grateful mashup makers would be if all they had to do was merge some identifiers instead of retyping a whole spreadsheet from scratch. The point here is that unlike Stefano’s hypothetical search engine, that wants to issue a query against all the world’s data at once, your typical mashup author just needs to deal with a couple of (probably small) data sets. His or her particular data integration problem is quite manageable a posteriori.
I’ll also dust off an argument David Huynh once made to me, even if it might get him in trouble with his current employer. Unification is not an absolute, but contextual. Whether two things are the same may change depending on what you are doing with them. Continuing my never-before attempted forays into sports analogies, are the Brooklyn Dodgers the same as the L.A. Dodgers? If you want to talk about the team that moved from Brooklyn to LA, the answer must be yes! But in a different context you might be interested in comparing the lifetime records of these two distinct teams. (In fact, Freebase tries to have it both ways: it asserts that the Brooklyn Dodgers were “later known as” the Los Angeles Dodgers (implying they are the same team with a name change) but asserts that Los Angeles Dodgers were founded in 1958, which clearly isn’t true of the Brooklyn Dodgers that folded in 57.)
This is obviously one of those half-empty half-full debates: We both recognize the value of both approaches, but are compelled by different aspects. Stefano looks at the amazing things that could be done with a single consistent data universe, and worries about how to create it. I look at the amazing things that can already be done with a host of disjoint but internally-consistent data microverses, and find that compelling enough to allay any worry about whether we’ll ever need http://www.freebase.com/view/en/lion to unify with http://en.wikipedia.org/wiki/Lamb .