I just ran into a lovely and frustrating open-government-style map of stimulus funding put together in Colorado. The same tool is used in a number of other states, listed in Brady Forest’s blog post at O’reilly Radar. Lovely because its always nice to look at maps; frustrating because that’s all I can do. Where’s the data? That is, the little table consisting of project name, geographic coordinates, category, and dollar amount? I can’t find it anywhere on the page, or even on the site. I don’t know if this data set was created in Colorado; I’m betting it was actually assembled from information at data.gov. (As evidence, another map on the site claims “The reports were compiled from a variety of sources, including data received directly from government agencies and information posted on the federal Recovery Act website.”) Clearly this data exists, as it’s necessary to drive the application. But there’s no apparent way for me to get at it—the visualization is a flash application that, as far as I can tell, has actually compiled the data into the body of the flash app, where the only access would be a flash decompiler.
This is an example of what I’m going to call “Open Data Entropy” or perhaps “Opentropy”—the natural tendency of open data to decay into closed data over time. While it’s often understandable why certain data has never been opened—because the cost of that initial preparation is too high—it’s a lot harder to justify closing off data that’s already been opened. But it happens a lot. Sometimes, this may be an active decision on the part of author, driven by greed (bringing eyeballs to the site), pride (thinking only his visualization is good enough), or lust (wanting to be engaged in all uses of the data).
But I want to focus on another likely culprit: sloth. Many authors of data visualizations simply don’t care whether or not the data underlying those visualizations is open or not. They just want to publish the visualization, and will do the minimum necessary to get there. This throws responsibility for open data back to us tool developers. If we build tools where the user has to do something extra to open the data, they won’t bother. On the other hand, if we build tools where the user has to do something extra to close they data, they also won’t bother!
This perspective is part of the genius of the Exhibit data visualization framework that David Huynh built while he was still my student at MIT. Exhibit doesn’t say anything about open data. Instead, it focuses on incentivizing the author through beautiful data visualizations that can be created with ease. But as a side effect, any Exhibit created by any author will automatically make its data open through a simple copy-button () that appears on the visualization by default when you hover over the visualization. Many authors probably don’t even notice it’s there. Those who do can dig through the manual to figure out how to turn it off, but very few bother.
Indeed, Exhibit has all the features necessary to replicate the Colorado map—a map view, an icon-based facet for selecting categories, and a pie chart. Plus, they could have thrown in an expenditure timeline and a pivot table for exploring the data. I wonder how much time or money they spent on their custom-built flex application, with its side effect of closing off the data?
At the tail end of a nice post on the cool new Gridworks tool he built with that same David Huynh, Stefano Mazzocchi muses on the challenges of getting people who download data and improve it to share it back. He points at this tweet from someone who’s pondering the pros and cons, and muses about how to push such people to play nice. This is an important question, but I think it misses a much larger and easier target—those people who just don’t care. No matter how willing someone is to share their data, it isn’t going to happen if it’s too hard. On the other hand, if we make open data a default part of our authoring tools, we’ll see it popping up all over.