Is Structured Data like Text or like Code?

My last few posts have discussed structured data for end users.   Given the glaringly obvious (at least to me) benefits of structured data, there must be some barrier in place that is preventing its pervasive use by end-users.   Identifying the barrier is the crucial first step to breaking through it.  I’ve argued that the (technical) barrier is the lack of good authoring tools for this structured data; this has sparked an argument with Stefano Mazzocchi who has focused on the (social) barrier of reluctance to share data that might be appropriated without compensating value to the author.

I claimed by analogy that people will be happy to share structured data (given the right authoring tools) the same way they have been happy to share unstructured data.  Stefano responded that data is different, witha disincentive to share because people might walk off with your data.  I answered that these disincentives only arise in a class of “open source collaboration” settings that leave out a large number of sharing scenarios. In his latest rebuttal, Stefano suggests that his “questioning whether the nature of the content can influence the growth dynamics of a sharing ecosystem makes David dismiss it as being related to a particular class of people (programmers) or to a particular class of business models (my employer’s).” I wasn’t aiming to dismiss this argument, rather to dispute it with the contrary assertion that is context, not content type, that influences people’s willingness to share.  Whether the information is text, data, or even source code, there are certain common usage contexts where sharing is  natural.  As Stefano hints in his last post, we are considering different contexts of data use, and that might explain our differing perspectives on the main obstacle.  In particular, while Stefano is thinking primarily about collaborative maintenance of a large shared structured data repository (which frames the repository as the end goal) I am thinking about individual authors using structured data as a tool for communicating their ideas (so the structured data is a means, not and end).   I think it’s worth elaborating on those different contexts here.

Fighting the assumption that structured data is going to be used a certain way (big data repositories, open linked data) and instead assuming that structured data will be used in all the myriad ways that text is currently used.    Structured microblogging.

Contribution versus Attribution

For hundreds of years, attribution has served as the default compensation for use of other authors’ content.  We’ve evolved a detailed model of how and when to cite the authors we use, and severe public disapproval of plagiarism and other violations of the attribution standard.  This doesn’t apply just to text: many scientific data sets are published and then used in their entirety by other scientists—exactly the kind of “wholesale appropriation” that Stefano worries can deter publication.  The original creators of the data do not become authors on the new work, but they do get cited, and that’s (sometimes) enough.

The emergence of open-source software collaboration posed some challenges for this attribution model.   The original BSD and Apache open source licenses contained an “advertising clause” requiring acknowledgment of the original authors in any derived works, but did not require that modified code be contributed back to the community—an exact analog of the classical attribution model.  A problem with this attribution model arose  in projects with many contributors — the difficulty of keeping track of all of them as code gets massively mixed.   Thus, payback by attribution has been replaced by contribution (back to the community) in newer licenses such as the  GPL and later versions of BSD and Apache BSD and Apache dropped the attribution clause, while licenses such as the GPL arose that require contribution back to the community as an alternate form of payback.   I’m not convinced that attribution poses a fundamental problem: certain scientific disciplines  frequently publish papers with author-lists in the hundreds and the same could be done for source code.   I think the problem may instead be with the “non-decaying” nature of the attribution clause, which requires the attribution to carry forward through arbitrarily many generations of derived works.  In text, citation is not infinitely transitive: you site the works you’ve used, but don’t cite the work that they’ve cited unless you are using it directly.

It seems then that the breakdown of attribution as a suitable compensation has to do with the size of the collaboration rather than the particular form of the content (code versus text).   I think that the text citation model can work just fine for data: you cite the sources of the data you are using, but don’t cite the sources’ sources.  It might even work for code, if we could get away from black-and-white “licenses” to more flexible notions of attribution.   And this attribution model, which has served just fine as a sufficient incentive for authors of text, seems likely to be just as good an incentive for authors of structured information.

Collaboration versus Publication

Stefano is focused on collaboration around structured data.  But structured data is powerful tool for communication as well.  Obviously, communication involves sharing information.  But publishing is a kind of one-way sharing that is quite different from the notion of working together to create a shared artifact.

At a Semantic Mediawiki workshop that I recently hosted at MIT I had a brief discussion with David Mason who does some work helping nonprofits to put their relevant data online.   He observed that one of the obstacles he’d run into is that those groups are often reluctant to put their data into a Semantic Mediawiki because of the sense that they will somehow lose control of it as others begin to edit it.  David speculated that the data-blogging approach of Datapress might be more palatable to them.  Obviously, one can make a wiki read-only to the public, but something about the blog model and the way that it gives the author more control over the order and timing of presentation seems appealing.  Conversely, putting the data on your blog makes at available for anyone to copy, but it does allay the concern that your “authoratative” copy will be modified by others.

Of course this doesn’t always solve the problem.  A few years ago I spent an hour mocking up an Exhibitized version of a video projector review web site.  The authors were very interested in the resulting visualization.  But when they found out that it exposed the underlying data, they lost interest.  Apparently, some months before, someone had scraped all the information out of their site and set up a competing site over the same information.  They didn’t want to do anything to enable that sort of behavior.   But notice some features of this case.  First, this web site is a for-profit information broker: they are trying to charge (via advertising) for making the information they have available.  Thus they are hypersensitive to its appropriation. Contrast this with a blogger whose goal is to communicate to make a point.  The data is being delivered in service of that communication, and if the communication succeeds, then what happens to the data after doesn’t matter.  Or consider Amazon, which is very happy to expose structured data about their products.    They’re selling goods, and sharing the information about those goods only increases their opportunity to sell.   Second, note that the copying of the site’s data happened even though it was not easily accessible in structured form.  Anybody who cares enough can scrape the information they need out of any web site that presents it.  So keeping the data unstructured guards only against the casual user, who is unlikely to be the competing business.  Finally, I would argue that the main value added for the projector site is not the facts about projectors, which are after all facts that can be found on many different sites, but rather the reviews of each projector that were carefully authored for this site.  As original content, this material can be protected by copyright.

Size Matters

Collaboration, which poses some problems for attribution-based sharing, arises when you’ve got a problem that’s too big to be solved by one person.  And certainly creating a giant structured repository of all the facts in the world fits that description.  But the problem here is size, not structure.  When Stefano suggests we need new sharing incentives around structured data, he posits that “incentive economies around sharing change dramatically when collaboration becomes a necessary condition to sustainability”.  The assumption is that structured data requires collaboration for sustainability.  This is connected to his assertion that “the impact of mistakes in hypertext are localized, while the impact of mistakes in structured data or software are not.”  These points are true only for large structured data collections.  Anyone can post a lists of books, authors, and ratings on their blog without any collaborative assistance.  And an error in that list has much the same impact on the world if that list is text or structured data.  Its when we try to create a large and consistent corpus that we need many eyes because one error can have impact throughout the rest of the corpus.    This disagreement goes right back to our first blog debate where I argued for the value of small structured data fragments and Stefano for the necessity of linking them up to get any value.

Which is more significant—the giant, collaboratively authored information cathedrals like Wikipedia and Freebase, or the chaotic bazaar of blogs and personal web pages?  This is like asking whether air or water matters more.  But here are two hypotheses: that the amount of content in collaborative authoring systems is a small fraction of that in individual authoring systems (compare the roughly nine million articles in all the world’s Wikipedias to the 133 million blogs indexed by Technorati), and that people spend more time consuming individually-authored content than mass-collaboratively authored content (I make plenty of use of Wikipedia, but that is dwarfed by my consumption of blogs like Stefano’s).  My point is not that collaborative content is unimportant, but that individually authored content is important.

Text as Code and Code as Text

Hopefully, I’ve clarified my position that context, rather than content type, determines people’s perspective on sharing.  Open source software projects typically reflect one context:  a large-scale collaborative edit of a shared artifact, where attribution is  too complicated to track (but note that most open source projects do keep a list of contributors) and is replaced by a share-alike contribution model.  Historically, information sharing through text has reflected another context, where individuals “own” what they work alone to publish but others can make use of it with proper attribution.   More recently, Wikipedia has demonstrated that the open source model can apply to text artifacts—that a textual content-type does not force a particular mode of sharing.  Freebase is doing the same for structured data.

I think there are also examples of more traditional text authoring models being applied to code.  For example, I can point to my brother Amir Karger’’s scriptome project.  Scriptome is a repository of one-line perl programs for biologists, to do the basic file format and data transformations that they often need.  It isn’t collaborative—just a stack of one liners my brother put together, with a cute web interface that lets the visitor specify parameters (filenames, column numbers) that get filled into one-liners (or sometimes 5-line “protocols”) that are generated on the fly so you can copy and paste them.

That’s right, copy and paste.  This is the opposite of a version-controlled source code repository that everyone is supposed to check out from and, just as important, check back into.   It encourages multiple “forked” versions of the programs (with different filenames and other parameters) to spread to many locations.   These one-liners don’t come with licenses; it’s obvious that use of them is unrestricted.   Suppose some of these programs turned up on other web pages.  I doubt my brother would be upset.  He’d surely be gratified by an acknowledgment of the scriptome sorigin, but he wouldn’t be angry about its lack–the individual programs are too small to care about.

In a similar vein, the CoScripter project at IBM challenges Stefano’s claim that “there are no successful software projects that use a wikipedia model for collaboration and allow anybody that shows up to be able to modify the central code repository”.  In fact, that’s exactly what CoScripter offers: a wiki of small scripts for automatic tasks on the web that anybody can show up to and modify.  A “serious programmer” might sniff at the idea that the globally modifiable scripts on CoScripter are real programs, but I’d say that the only difference is one of size.  Because CoScripter’s scripts are just a few lines long, and often generated by a macro recorder, nobody seems likely to care much about attribution.  Instead, they see the benefits of contribution—that the small script they contribute at little effort can be improved by the community to benefit the original contributor.

Both of these are examples of efforts in the direction of End User Programming, an effort to give end-users the ability to create or change simple code artifacts.   Some great work in this space happens in the UID group led my colleague Rob Miller.  These days the research community is getting even more ambitious and considering .  I expect that as this field matures, we’ll see more occurrence of the traditional text approach to publication and attribution.

Authoring Tools

Hopefully I’ve made a convincing case that there are plenty of already existing incentives for sharing structured data on the web, and plenty of contexts where concerns about payback and ownership already have well-tested solutions.  So where’s the barrier to structured data authoring?  I’m convinced that it’s in the tools.  Where tools exist to manage a certain kind of structured data—Semantic Mediawikis, content management systems, web sites devoted to recipes, photos, or music—people author that structured data.  The problem lies in the limited reach of those tools.   Creating a semantic mediawiki or other content-specific site is way beyond most users.  And right now that’s the only way to create an environment for authoring the data a you care about.  We need tools that integrate into the ones we already use—semantic email that lets you send and consume structured data in email messages, automation tools that can create and consume structured data feeds around your social information spaces, and of course, tools for publishing structured data in your blog.  When structured data is as easy to work with as text, it will be as common as text in end-user authoring.

6 Responses to “Is Structured Data like Text or like Code?”