In a post last week I argued that the key to making structured data pervasive on the web was tools that make it easy for people to create interesting data visualizations that share their data by default, without adding effort. This prompted a pair of responses that I’d like to address here. One, from Glen McDonald in the comments of that post, argued that simply making the data available isn’t enough if people don’t have tool to process the shared data, massaging it into different forms through useful queries. Another response from Stefano Mazzocchi in his blog, essentially argued that people won’t want to share their data, and will act to block sharing, unless we give them some quid pro quo.
Mobility versus Queryability
Let’s look first at data processing. Glen McDonald has built a nice tool called Needle that offers a query language over a graph-shaped data model. As a query engine, Needle’s focus is not on providing a variety of visualizations (though it has some) but on letting you ask questions of the data and get specific answers. Glen argues that this is more important than data export—that when a publisher is in charge of the visualization, they control the way you look at the data, and thus control the conclusions you draw. Only if you make your own queries, says Glen, can a reader really test the author’s argument the way it ought to be tested.
I consider the difference between a query language and a visualization is one of degree, not kind. People only experience data through visualizations—they can’t see the platonic data object. And any kind of non-static presentation gives a user the ability to construct “queries”—different ways of looking at the data. Conversely, a query-based interface is just a particularly powerful way to construct visualizations, but that power tends to come with complexity that can make them too difficult for most users.
Most importantly, I agree with Glen about the importance of letting the reader drive, but I think that supports my argument about the greater importance of data mobility over rich query languages. No one site or tool can ever offer all imaginable data-query interactions—there’s always a new visualization or a new query function lurking around the corner. On the other hand, given the richness of the web, the reader’s favorite query/visualization is sure to be available on some. So the key is for the user to be able to move the data from where it is to where it needs to be. In particular, if the data the user wants to look at is spread over two different sites, then clearly it has to export from at least one of them to be useful.
(De)motivations for Sharing
Stefano agrees that sharing data is key, but questions my assumption that it will “just happen” if we build default data export into the tools people use to author. Stefano argues that “normal people get a weird feeling when they think that others can just take their entire work and run with it”; he feels that people need a driver, some direct benefit that will incentivize them to share their data. In support of this argument, Stefano talks like a programmer about the horror of having your entire web site codebase copied somewhere else, and about the development of GPL and CC-share-alike licenses that try to force people who copy and adapt your code to share those adaptations back to the community.
I think this is an argument that will only resonate with programmers. And since most of my hoped-for content creators and sharers aren’t programmers, it’s not really relevant. If we look at the broader population of content creators on the web, we see blogs and photo and video-sharing sites like Flickr and YouTube. The driver for these authors is clear, and Stefano points it out himself: “the ability for people to publish something to the entire world with dramatically reduced startup costs and virtually zero marginal costs”. Do these authors care about how their content gets used? It seems not. Oshani Seneviratne, a student at MIT, recently did a study of licensing on content-sharing sites. She found that amidst the tremendous amount of reuse going on at those sites, only a tiny fraction of content producers and consumers paid any attention to questions of licensing. You might suspect that this is because most people have never had their content appropriated. But the reason isn’t important: the fact is that a huge population published without feeling a need to defend their content, and that this seems likely to carry over to structured content. If it does become an issue, we have the same answer as we have for text: attribution. The web is full of blogs quoting other blogs, and that seems to satisfy just about everyone. There’s no reason we can’t do the same with structured data. Indeed, our Datapress project (a wordpress extension that lets you put structured data visualizations in your blog) automatically creates “footnotes” in an article linking back to the source of any data it is importing.
In fact, if we look back to the earliest days of the web, I think Stefano’s got his history a bit wrong. Stefano argues that the difficulty of copying and pasting the entire web-site codebase protected web authors from having all their hard work appropriated, and that is the reason they were not deterred from putting content on the web. But I’d say this is a relatively recent phenomenon. In the earliest days of the web there was no codebase—all a web site offered was a set of static pages, and it was indeed straightforward to copy the entire content of the web site to another location. Clearly this did not pose much of a limit to the growth of content on the web.
I think Stefano’s concern stems partly from the particular model of structured data sharing that his company, Metaweb, is pursuing. At Metaweb, the goal is to get everyone contributing to one giant shared data repository. By its nature, that means that you lose the ownership connection to the data you deposit. So it’s very important for you to feel like you’re getting something back for your “donation” of this data—that just like in the open source community, or on wikipedia, the payback for your donation is that lots of other people will be donating as well. But I’m focused on a more personal and distributed model of data sharing, where individuals are publishing their data through their outlets, where their ownership of it can survive others’ copying (hopefully with attribution) and reuse.
Programmers versus everyone else
I think these two discussions both reflect a difference in mindset. Stefano and Glenn are both high-powered programmers as you can see by the wonderful tools they’ve built. I, on the other hand, am not—just check my thesis if you want proof. If you’re a programmer, you naturally think about designing complex interactions with data. That means you need a powerful query language, which you as a programmer have the skill to learn and use. And once you’ve invested the programming work in creating your powerful visualization, you’re going to be pretty upset if someone takes all that work without giving anything back to you or the community. But if you’re a blogger, or an enthusiast about a particular kind of data, then what really drives you is the opportunity to communicate your enthusiasm to everyone else. As a non-programmer, you’ll be excited by even elementary data interactions like maps and faceted browsing. And, having invested less effort in tossing some structured data into your page, you’ll be less sensitive to others reuse of that data, especially if they cite your work as is typical of blogs today.
So why isn’t structured data being published by non-programmers? Because we are currently missing a collection of tools that let regular people do the easy work with structured data. In particular, we need tools to let regular users create really elementary structured data visualizations on their own web sites, and fill those visualizations using basic copy and paste of structured data they find elsewhere on the web. If we can figure out how to make it easy enough, then I think we’ll see structured data explode.