Keynote at ESWC Part 2: How the Semantic Web Can Help End Users

I’ve just returned from the European Semantic Web Conference, where I gave a keynote talk on “The Semantic Web for End Users”.   My talk addressed the problem that has interested me for eighteen years: making it easier for end users to manage their information.  The thesis was that

  • The current state of tools for end users to capture, manage, and communicate their information is terrible (yesterday’s post), and
  • The Semantic Web presents a key part of the answer to building better tools (this post), but
  • Not enough work is being directed toward this problem by the community (tomorrow).

Since I had a lot to say (217 slides) I’m breaking the summary into three separate posts aligned with these three bullets.  Come back tomorrow for the next.


Our story so far

Yesterday, I discussed the dire state of information management for end users.  I argued that our traditional applications are designed around fixed schemas, and that whenever an end user wants to use their own schema, or connect information from different schemas, these traditional applications fail them.  Users are forced to settle for generic tools like spreadsheets and spread their data over multiple tools.  Voida et al.’s Homebrew Database paper (a must read) shows how awful the results are.


The Semantic Web can Help

The Haystack Client The Haystack Client is managing a schema-mixed collection of email messages, news items, and people. All are displayed in a list collection, but each type is shown using its own “lens”. Facets can be used to filter the collection. Context menus expose operations suited to each type of item.

Our first attempt to address the “schema diversity” problem was Haystack, a tool that could be used to store and visualization arbitrary information.  Haystack could store arbitrary user-defined entities with arbitrary properties and relations to other entities, and also allowed its user to customize visualizations of those entities.  You could create something that looked quite like a traditional application, over whatever schema you decided was useful.

We created the first version of Haystack before the Semantic Web was visioned.  However, it was obvious after the fact that Haystack was a Semantic Web application (more specifically a Semantic Desktop) and when RDF was released as a web-standard data model, we adopted it as the native model for later versions of Haystack.

Haystack reflects what I consider the key novel perspective of the Semantic Web community—the idea of a web of data reflecting vast numbers of distinct schemas.  While the database community has devoted significant effort to data integration, their canonical example has been, e.g., the combination of a few large corporate databases when two companies merge.  It hasn’t really addressed the far more anarchic situation of a different schema on each web site.

I believe that this setting demands a new kind of application.  Instead of traditional applications with their hard-coded schemas and interfaces, we need applications like Haystack whose storage and user interface can effectively present and manipulate information in any schema that their user encounters or creates.  This is a challenging task since we tend to rely on knowing the schema to create good user interfaces; however, I believe the challenge can be met.


Concrete Examples

The related worksheets system being used to display information about courses.  Each course has reading and sections, with presentation nested inside the relevant cells of the courses table. The related worksheets system being used to display information about courses. Each course has reading and sections, with presentation nested inside the relevant cells of the courses table.

To support this argument, I presented three of these flexible-schema Semantic Web applications.  The first is Related Worksheets, being developed by my student Eirik Bakke.  Eirik recognized the incredible dominance of spreadsheets as a schema-flexible data management tool, and asked how we can make spreadsheets better for this task without changing their fundamental nature.  His approach is to improve spreadsheets to better present and navigate the entities and relationships represented in them.

A typical spreadsheet may have, e.g., one table consisting of university courses (one row per course) referring to another table consisting of course readings (one row per reading) and another table of course instructors.  In a traditional spreadsheet this “reference” is just a textual correspondence—there’s a cell in the course table that names the title of a reading that’s in the readings table.  But if you recognize that the reading is actually an entity, you can do better. First, you can present information about each reading nested inside the cell in the course listing table, so you can immediately see more information about the reading without having to go find it in the readings table.  Second, you can “teleport” from the reading shown in the course table to the corresponding row in the readings table, where you can see or modify more data about the reading (and, e.g., teleport onward to the author of the reading).  A user study showed that these features can significantly improve end users’ speed at extracting information from the worksheet.

    The Exhibit framework being used to present information about presidents of the United States, plotted on a map and a timeline. Facets on the left offer filtering by religion and party, along with text search. The Exhibit framework being used to present information about presidents of the United States, plotted on a map and a timeline. Facets on the left offer filtering by religion and party, along with text search.

I then presented is Exhibit, a tool that lets end users author interactive data visualizations on the web.  The motivation for Exhibit was the recognition that while professional organization are able to create fancy data-interactive web sites that offer templating, sorting, faceted browsing, and rich visualizations, end users generally don’t have the necessary programming and database administration skills necessary to do so, and thus tend to publish only text and static images.

My student David Huynh recognized that a large number of the professional sites fit a common pattern, and that it was possible to add a quite small extension to the HTML vocabulary that was sufficient to describe these professional sites just in HTML.  The vocabulary describes common elements such as views (lists, tables, maps, timelines), facets (for filtering the data shown in the views), and lenses (HTML templates for individual items).  Any end user can drop these extended HTML elements into a web page, point them at a data file (spreadsheet, json, or csv) and instantly publish an interactive data visualization.  To make it even easier, Ted Benson and Adam Marcus created Datapress by integrating Exhibit into Wordpress, so you can “blog your data” using Wordpress’ built-in WYSIWYG editor.

There are now over 1800 exhibits online, covering an incredible spectrum of data sets—from ocarinas  to failing bridges, European Court for Human Rights cases, pollution measurements in Spain,  map stores, classical music composers, strange sports, ,  mining information, teacher bonuses in Florida  and an Urdu-English dictionary.

By the way, anybody who wants to try exhibit for themselves can just copy one of the ones on the web and start playing with it.  For example, if you’re an academic, perhaps you could use a nicer publications page.  Just download mine and replace the data with your own.  But if you want a more careful introduction, take a look at this tutorial I put together.

Atomate being used to generate the rule "". The user adds one word at a time; each step uses a dropdown/autocomplete to ensure only valid words are added. Atomate being used to generate the rule “remind me to take out the trash when I get home on Tuesday evening.” The user adds one word at a time; each step uses a dropdown/autocomplete to ensure only valid words are added.  Click the image for a video demonstration.

The last tool I described was Atomate, built by my student Max van Kleek and Brennan Moore to demonstrate how end users could author automation rules to reduce their effort handling incoming social media and other information streams.  For example, a user might want to be notified when their calendar shows that a certain band is performing and their social media stream reports that a particular friend is in town, so that they can attend the performance together.  A big challenge is coming up with a query language that is simple enough for end users.  We settled on a controlled natural language—a query language that looks like English but is actually unambiguous filters over the properties and values in the user’s structured data collection.  Drop-down menus and autocomplete ensure that the user is only able to create meaningful queries.  You can click the image on the right to see a demonstration video.

A user study of Atomate revealed that users were able to create meaningful queries when given a specific task, that they recognized the general utility of this system, and that they were able to envision specific ways (particular rules they could write) to use it for their own benefit.

Since the publication of the Atomate work, we’ve seen some of its approach appear at If This Then That, a web site that can connect to numerous web sites to pull and push data, and that lets end users specify triggers to activate when that data changes and actions to take that may modify other data.



I’ve now outlined four applications that, in my mind, leverage the “special sauce” of the Semantic Web—the idea that applications must be designed to work effectively over whatever schemas their users choose to create or import.  This creates major challenges in the design of user interfaces, since we often want to leverage a hard-coded schema to determine the ideal domain-specific interface.  But there are ways around this problem, either using generic interfaces like spreadsheets (Related Worksheets) or natural language (Atomate), or putting more of the user interface authoring in the hands of the end user (Haystack and Exhibit).  Each of these tools demonstrate that it is possible to give an end-user a tool that can work with arbitrary schemas.

Given the potential, I’m disappointed with the level of effort being invested in this kind of work by the Semantic Web community.  In my next post, I’ll discuss what work I think is missing, how to do it well, and changes we might make to our Semantic Web conferences to encourage it.


Keynote at the European Semantic Web Conference Part 1: The State of End User Information Management

I’ve just returned from the European Semantic Web Conference, where I gave a keynote talk on “The Semantic Web for End Users”.   The slides are here . My talk addressed the problem that has interested me for eighteen years: making it easier for end users to manage their information.  The thesis was that

  • The current state of tools for end users to capture, communicate, and manage their information is terrible (this post), and
  • The Semantic Web presents a key part of the answer to building better tools (tomorrow), but
  • Not enough work is being directed toward this problem by the community (Monday))

Since I had a lot to say (217 slides) I’m breaking the summary into three separate posts aligned with these three bullets.  Come back tomorrow for the next.


The Situation is Dire

I began my talk by trying to convince people of how bad things currently are.  For this, I didn’t rely on my own work, but on presenting Voida, Harmon and Al Ani’s fascinating CHI 2011 talk on Homebrew Databases. Thanks to Amy Voida for sharing her slides and her script!  Choosing a specific domain, the authors spent a bunch of time in volunteer-driven nonprofit organizations of varying sizes. They extensively interviewed the volunteer coordinates—responsible for managing information about volunteers, skills, needs, and tasks—to learn about how they did their jobs.  The results were painful to hear.  Because there was no application specifically designed to manage the information these coordinators used, they were forced into a baroque assemblage of excel spreadsheets, outlook lists, paper, index cards, and binders.  With this mix of tools they had terrible versioning problems, wasted inordinate amounts of time on data entry and transfer, and struggled to organize, query, and visualize their information.

The tasks these volunteer coordinators wanted to support were not complicated—they weren’t doing Big Data Analytics.  Rather, they were trying to answer elementary questions like “which volunteers are available for the following activity” or “what’s a summary of all the work this volunteer has done.”  Questions that would be trivial for a good database administrator with a well-maintained SQL database.  Unfortunately, few users fit that profile.

college essays – top 100 essays that worked – study notes

I consider it a major embarrassment for all of us in databases (and the Semantic Web) that this is the current state of the art.  This paper ought to be required reading for anyone in these fields, helping us to realize that we’ve got our heads in the clouds while people are stuck in the dirt.  For those who argue that these users should “know better” and learn the right database tools for managing their data, I defer to famed designer Don Norman, who observes in The Design of Everyday Things:

When you have trouble with things-whether it’s figuring out whether to push or pull a door or the arbitrary vagaries of the modern computer and electronics industries-it’s not your fault. Don’t blame yourself: blame the designer.

The designer, of course, is us.


What’s the Problem?

The Homebrew Database paper focuses on symptoms, but I have a strong opinion about the causes.  When I first showed up at MIT, I intended to do research in information retrieval.  But I rapidly concluded that the real problem wasn’t retrieval.  Rather, it was that our computers were actively getting in the way of people recording and organizing their information.  If can’t record it, they certainly can’t retrieve it!

In particular, in our traditional model each application is developed with a fixed schema in mind.  This schema determines both what information can be stored and how it will be presented and manipulated.  Any user whose information is or ought to be in a different schema is out of luck—they can’t record it properly (my physical therapists recently observed how frustrated she was struggling to enter all the data about her patients in the electronic medical record—until she discovered she could put it all in the comments!).  Thus, users who have these nonstandard schemas are generally forced into the small set of tools that can handle arbitrary schemas, most frequently spreadsheets.  The Homebrew Database work highlights how severe the consequence are and even observes that schemas frequently need to change on the fly as underlying information needs change.

Fixed-schema applications also pose a severe barrier for users who want to connect information from multiple applications—for example, linking a person in your address book to the music soundtracks in your media player which that person has composed.  Since these applications are unaware of each others’ schemas, they can’t do anything with (or even refer to) each others’ data.  I discuss this issue further in a paper on data unification.


A Semantic Web Fix?

Now that I’ve argued that there’s a severe problem to be solved, the next step is to propose an approach to solving it.  Tomorrrow, I’ll argue that ideas at the core of the Semantic Web offer a way forward, and justify my claim with a number of example Semantic Web applications targeting end users.


Try out Habitbug!

One of the things that’s been interesting to us for a while now is how we can use our friends to help us get things done – friendsourcing. (See some example previous posts here, here, here, and here.)

We’re excited to launch habitbug, which is a Twitter app that helps you form and maintain habits by holding you accountable to all of your friends.

Since this is an experimental study, there are some slight differences for different users, but here’s the basic system:

  1. Pick a habit. This should be something you want to do (or avoid) every day. My current goal is to meditate, even for just a few minutes.
  2. Set a time that you want us to remind you. We’ll send you an @reply when it’s time.
  3. Check in every day that you do your habit. To make it easy, you can just reply to your reminder from @habitbug with any text you want.

With this system, we’re hoping to learn more about how we can use your friends as an accountability mechanism. I’d love to hear more about what you think! Try it out here! If you find any bugs or issues, please send us an email.

Converging Online Education and Online Journalism

The Neiman Journalism Lab recently collected a number of opinions on interesting trends in online journalism.  You can read the whole set here, but for those too lazy to click, here’s my own contribution.

Massive open online courses (MOOCs) are widely believed to be revolutionizing education. But I think they also suggest some really interesting futures for journalism.

buy viagra soft

In particular, I’m excited about the online discussion forums that accompany the MOOCs. These forums transform students from passive consumers of information into a community of inquiry who are actively engaged in asking questions and collaboratively working out answers. We need the same in journalism.

Too often, the forums hanging off news sites are troll-filled wastelands, where the best content one can hope for is a particularly well crafted putdown. In contrast, the MOOC forums exhibit high quality discussion where questions are asked, answers proposed and critiqued, and conclusions drawn in a style that supports and encourages other students. We’ve even seen the emergence of student leaders who are particularly adept at guiding others to find or construct needed information.

For most people who’ve finished school, journalism is probably the primary source of new information. What can we do to improve the news consumer’s “education”? Can the news “anchor” become the course “teacher”? With current events as the source material, what kind of MOOC in foreign affairs or government policy could be taught by a big-name journalist? Driven purely by interest in learning, thousands of MOOC students are doing “homework” to improve their knowledge, exercises that are graded by the computer and essays graded by peers in the class. What assignments could the journalist create to enhance a student’s understanding of a foreign country or a difficult budget or policy question? What would it be like if readers could submit peer-graded essay responses instead of grouchy complaints about biased media? Could this student-authored content actually start contributing to the news?

Journalism and education are siblings: if you’re informed but not educated, you have no context to interpret the information you’re getting; if you’re educated but not informed, you’re living in an ivory tower. In MOOCs I see the beginnings of a trend that might draw these two information-delivery mechanisms together in a powerful way.

Two Funny Things at the 2012 International Semantic Web Conference

I spent last week at the 2012 International Semantic Web Conference.  This conference addresses the important topic of structured data on the web.  I had two “funny” experiences; one humorous and one peculiar.

At the beginning of the conference, I was amused to see that ISWC, whose central theme is linking the web’s data together into a coherent whole, had more trouble than any other conference I’d been to in picking a twitter hashtag for the event.  Most conferences just announce one at the beginning, but at ISWC it was left to emerge “organically”, which meant tweets were inconsistently tagged as #iswc, #iswc12, #iswc2012, or #iswcboston.  I tweeted a joke to this effect.  The responses that I got back were classic.  Reflecting the philosophy of the Semantic Web, various individuals argued that this was a good thing; that expecting everyone to agree on a single vocabulary was contrary to the Semantic Web vision of linking disparate ontologies.  Another pointed out that if only twitter were “doing things right”, treating their hash tags as ontological entities, and letting different people label each entity differently, then we wouldn’t have this problem.  These responses are completely logical but ignore reality.  We may know better than twitter how things ought to be, but in the meantime there’s an easy solution (that most other conferences have adopted) that works fine with the way twitter is now.

The more seriously funny experience was at the ISWC demo session.  The two demos that most impressed me were systems for  (i)  browsing upcoming events (concerts etc.) and (ii)  browsing academics and their publications.  Both of these systems were characterized by rich data models and nicely designed user interfaces that delivered valuable information and insights from their chosen domains.

The funny part is that neither of these applications should really be called a “Semantic Web application.”  Someone unaware of the Semantic Web, tasked with building these applications, would see a traditional data management and visualization problem that they would solve using traditional database tools (SQL) and web APIs.  The fact that these tools are storing their data in a triple store instead of a SQL database is irrelevant to the user experience.  And the fact that at least one of them is exposing a SPARQL endpoint for querying the data they are managing is good citizenship, helpful to the next project, but not important for this one.

This story fits what I argued in a talk at an ISWC workshop on programming the semantic web.  The original description of the semantic web envisioned applications that could wander through a linked world of hundreds of different ontologies, discovering/learning those new ontologies as it went and combining information from all of them to produce valuable answers.  It seems to me that the vast majority of applications don’t need this power.  Instead, these applications have fixed ontologies imposed by their creators at creation time.  They can therefore be created using traditional techniques.

This begs an obvious broader question: what kinds of work is Semantic Web research that should appear at ISWC.?  I ask this not in the interest of jealously guarding the “purity” of a discipline—I like breadth—but in the interest of directing research to the venue where it can be best disseminated and evaluated.  The semantic web technology stack is pretty mature at this point.  But that means that using a semantic web back-end doesn’t automatically turn your project into semantic-web research.   If I build a traditional interactive application on top of a triple store, my contribution is the application, and it should probably go to a conference like CHI that specializes in assessing human computer interaction.  A system that uses natural language processing or machine learning to recognize entities in text doesn’t suddenly become a semantic web contribution by outputting its results in RDF; instead it should be submitted to a venue like NAACL or ICML where it can be assessed by the best researchers in NLP and ML.

One might worry that the Semantic Web is going to suffer the same image problem as AI: that as soon as it works, it isn’t Semantic Web.  But I don’t think that’s the case.  There are certain research questions that are, and will continue to be, core to the Semantic Web.

With regard to the Semantic Web’s role in traditional applications, I would love to see at ISWC some studies that compared the relative developer effort required to build applications using the traditional and Semantic Web tool stacks.   Nobody’s going to argue against making data easier to reuse.  But the Semantic Web community still has to prove, I think, that their approach to reusability is better than others.  If I’m going to build a traditional application that consumes and manipulates data from one or two fixed sources, does using a triple store instead of a SQL (or noSQL) database make it easier for me to build that application or maintain it later?  Most applications hide their databases behind object-relational mappers, so will it even be noticeable which underlying database technology I am using?  When I want to pull data from my target source, does it help me to have that data available via a SPARQL endpoint, or would it be just as effective to present it me via a SQL endpoint, or an API that returns JSON objects?

If we are able to make a case that the Semantic Web really does help with reuse of data, then there’s a host of ISWC-relevant questions around transitioning the legacy of traditional data repositories to the Semantic Web.  For example, this paper shows how to “scrape” a traditional web API so it can be used with other Semantic Web tools.

Then there are the true Semantic Web applications, pan-schematic systems with no built-in assumptions about the schemas to which they are applied.  Almost by definition, these systems aren’t designed for domain specific tasks; however, they can be really useful for general-purpose information seeking, browsing, or organization.  Tools like tabulator try to support generic data browsing; semantic desktops like our old Haystack system provide personal information management over arbitrary schemas.   There’s also the Semantic Web search problem, of being able to search data that is structured but has no particular schema, more effectively than we can via text search.  Progress on these problems has been far slower than I expected or hoped; it seems like we’re mostly still stuck in the world of “big fat graph” visualizations.   This is a place where I’d really like to see ISWC focus its attention.  Perhaps this could serve to define a Semantic Web Challenge for next year: build an application that would let you win a scavenger hunt over the linked open data web.



“Living with Big Data: Challenges and Opportunities”, Jeffrey Dean and Sanjay Ghemawat, Google Inc.

As part of the Big Data Lecture Series — Fall 2012, Google’s Jeff Dean gave a talk on how Google manages to deliver services which involve managing huge amounts of data. In order to make things work over the distributed infrastructure of Google’s several data-centers, they use services and sub-services. Each service uses a protocol to communicate with other services. These protocols are language independent. Dean gave an example of a simple spell correction service which takes a request, such as correction{query:”……”}. The advantage of this model is that it is independent of client and it’s easy to make changes with no ripple effect. For an instance, to add a language feature in their spell correction service, they just need to add an extra optional request parameter: correction{query:”…….”, lang:”en_US”}. Also, it allows them to build each service independently.

Since Google has a lot of clusters in different datacenters, the list of potential things that can go wrong is long – rack failure, router failure, hard drive failure, machine failure, link failure (especially long distance links which are susceptible to external failures like attacks from wild dogs, drunken hunters etc.) are just a few! So, the software itself must provide reliability and availability. Replication allows them to solve hardware failures and issues such as data loss, slow machines, excessive load, bad latency etc. In order to tolerate latency, they primarily use two techniques – cross request adaptation and within request adaptation. The cross request adaptation technique examines recent behaviour and accordingly makes a decision for the future requests. On the other hand, the within request adaptation technique copes with the slow subsystems in context; it uses “tied requests” i.e. each request is sent to two servers (with a delay of 2ms). As soon as one of the two starts processing the request, it notifies the other to stop. Google ran experiments and they deduced that the latency hugely improves with a small overhead of a few extra disk reads.

In order to manage huge amount of data over distributed infrastructure, Google has several cluster level services, such as GFS/Colossus, Map Reduce, Cluster Scheduling System, Big Table. Although these services solve many problems, they also introduce several cross-cluster issues. To solve these cross-cluster issues, Google has built the ‘spanner’, a large-scale storage system that can manage data across all of Google’s data-centers. The ‘spanner’ has a single global namespace for data. It supports constant replication across data-centers and auto-migration to meet various constraints, such as a resource constraint (“file system is getting full”). A migration could be an app-level hint — “place this data to Europe”. The key idea is to build high level systems which provide a high level of abstraction. This black box is incredibly valuable since applications don’t need to deal with low level issues.

Monitoring and debugging are crucial in distributed environment. Every server in Google supports request tracing (call graph), online profiling, debugging variables and monitoring. Google has a tool called dapper which allows them to monitor and debug their infrastructure.

Much of Google’s work is approximating AI. Recently, they have been working on infrastructure for deep learning. Deep learning is an algorithmic approach to automatically learn high level representation from raw data. It can learn from both labelled and unlabelled data(unsupervised learning). The model could be as complex as number of parameters in billions, and requires several CPUs. In order to deal with this huge amount, Google partitions the model, adding another dimension of parallelism with multiple model instances communicating with each other. Google has built a deep network for machine learning (learning image representation, natural language processing (speech and text both)) that has significant reduction in the training time. They in fact trained an acoustic model for speech recognition in approximately 5 days with 800 machines in parallel.

Faculty Summits and Industy-Faculty Collaborations

By some statistical fluke this summer I got invitations to and attended faculty summits at Google, Microsoft, and Facebook within a period of two weeks.  All were well run and a lot of fun, but left me wondering whether there are better ways to foster collaborations between faculty and these great companies.

Each company put on an admirable event.  The scales were different—400 attendees at Microsoft, 100 at Google, and 30 at facebook.  But the overall structure was pretty similar for all three.  The bulk of the time was devoted to conference-type presentations by company engineers and researchers, highlighting a bunch of the interesting work they were doing.  A few faculty also presented on the results of their collaborations with the company.  There was a presentation on funding opportunities.  And, in the evenings, conversational dinners.

The summits were very well done and entertaining to attend.  But I’m not convinced they took best advantage of the opportunity presented by the gathering of faculty.  This was really brought home to me at a Google summit presentation on online education: It explained how the traditional model of education, with one faculty member presenting to a room full of passive students, has become outmoded now that such presentations can be recorded for online consumption by anyone.  Of course, this presentation was given to a room full of passive summit attendees.   And it felt a lot like a classroom, with many “listeners” directing their attention to their email.

Given these summits’ strong similarity to conferences, centered on a sequence of talks, it’s worth remembering that the real value of conferences is in the hallways where attendees can have one-on-one conversations.  I saw many of these conversations happening at the summits, but the majority seemed to be among faculty who already knew each other, and have plenty of opportunity to talk to each other at real conferences.

What was disappointingly scarce was the one thing that these summits seem distinctively suited to generate: faculty-industry dialog.  Emphasizing this point, each summit offered one event that did support such dialog; in each case I found it to be the most valuable part of the summit which highlighted the limits of the rest.  Microsoft held a demo/poster session, where faculty circulated among various Microsoft groups who presented the projects they were working on.   With the faculty spread out among numerous projects, there was lots of opportunity for small-group discussions that could really dive into technical issues and possibly identify shared research interests.  At Google, the summit held “breakout sessions” on various topics; small mixed groups of faculty and Googlers spent an hour discussing specific topics of interest posed by Google.   I’ve already held a followup discussion with some Googlers around the topic of one such discussion, and can see some great research questions emerging.  Facebook held a single mixed “round table” (feasible given its small size) that went meta, discussing the question of how to enhance Facebook-faculty collaboration.   Also noteworthy at Facebook was my lucky dinner seating between two Facebookers that gave us time for lengthy discussion of some research questions.

These relatively short interactions gave me a sense of how much potential these summits have to foster interaction.  Working off them, here are some thoughts on how to fulfill that potential.

  1. Can the lectures.  Instead of presenting company research to the few physically-present faculty, record them and post the canned lectures so everyone, not just attendees, can see.  Have summit attendees watch them in advance to prepare for the summit.
  2. Bipartite poster/demo sessions.   Copy Microsoft’s demo/poster session which lets faculty learn about lots of different projects happening at the company and engage in small focused dialogs on projects they’re interested in.  Then invert it: set up a mirror session where the faculty are the ones with the demos and posters and employees circulate to discover interesting connections.
  3. Bring the mountain to Mohammed.  In particular, a faculty poster/demo session is a much cheaper way for potentially thousands of employees to get a sense of faculty research, and a chance to influence it, than sending those thousands of employees to conferences.  Faculty seemed to significantly outnumber company attendees at these summits, suggesting a missed opportunity for interaction.  I know these companies are full of PhDs who enjoy talking about research.  Where were they?
  4. Mix things up.  Break up the knots of old-buddy faculty and get them talking to employees.  Enforce mixed seating at meals.  Use some if the company’s cool technology to decide which faculty should be meeting which employees, and make sure it happens.
  5. Questions not answers.  Most of the talks presented finished work.  If the work is done then there’s no collaboration opportunity.  It would be great instead to see presentation of research objectives, that might help flush out others with related objectives, or with tools or data that could help meet those objectives.
  6. Unconference planning.  I’ve attended a few “unconferences” such as Foo camp where attendees set the agenda communally after they arrive.  This ensures that the topics are what the attendees actually want to talk about (the planning process also helps identify shared interests).  And the inability to prepare in advance means less presentation and more productive discussion.

Admittedly, a these suggestions are predicated on the assumption that faculty research might have something useful to offer to these companies.  Given the tremendous creative talent that these companies employ, that isn’t clear: perhaps our research is unimportant and the goal is simply to impress us into recommending these companies as good employment destinations for our students.  I’ll try to tackle that question in a separate post.



Congress, the NSF, and Social Science Research

For the past few weeks I’ve been following the Monkey Cage blog as it has followed the vote by the House of Representatives to prohibit the National Science Foundation (NSF) from funding Political Science research.   These days I tend to roll my eyes and feel helpless when Congress takes silly positions for political reasons (though I enjoy the irony of the House deciding that its own activity isn’t worthy of study).  But today an op-ed appeared in the Washington Post, which I expect to be more sensible/less political than Congress, arguing that the NSF should defund research not only politics but all social science, on the grounds that it isn’t possible to design rigorous controlled experiments in the social sciences or draw objective conclusions from the results.  While I suppose I might benefit from the redirection of that money towards “hard” science like CS, I don’t want to let such a blatantly false claim stand unchallenged (Monkey Care also has a nice rebuttal).    Especially since, if they do, an renewal of the congressional assault on research in human computer interaction is surely around the corner.

So, to falsify the claim, I’ll just mention two fine examples of careful controlled studies in social science of computer supported cooperative work.  Matthew Sagalnik and Duncan Watts did some a great controlled experiment showing how the popularity of music is substantially “self fulfilling”—music that is initially highly ranked is ultimately highly ranked by the whole community, even if the initial rankings are random.  At the conference on Computer Supported Cooperative Work, a source of plenty of such studies, Farzan, Pal, Kraut and Konstan used a controlled experiment to assess the impact of different kinds of socialization strategies on the socializiation of new entrants to a technical support forum.

These are just two examples of many; even one would serve to falsify the claims made in the Post’s op-ed.  I wonder if the author will be scientific/objective enough to retract his piece given the new data?

Oh well.  I suppose we should all write our congressmen before this goes any further.

To improve the CHI conference, would you share which talks you attended?

I’m having a great time at CHI (including my first time two-stepping today) but I strongly believe, as Jonathan Grudin asserted today, that we can make use of data to improve the conference.  I’ve already analyzed historical data that demonstrates that we can substantially reduce reviewer workload.  We’ve also created a way you can use current data to meet new people at CHI.  And if you’d be willing to share some of your own data, I’d like to try to use it to improve future conferences.

For the current conference, there’s a nice tool, created by Alireza Sahami, for meeting interesting new people at CHI this year.  If you visit the mobile version of the CHI program (which also works fine in a regular laptop browser) you can mark some papers you’ve attended or noted to read (just as you can with the CHI program apps—unfortunately, the data is not shared between them, so you’ll need to repeat).  Then, clicking the recommendations link and enabling recommendation will give you a list of users who are most similar to you in terms of the papers they’ve chosen to attend.  If you don’t know them, send an email and meet them to discuss your shared interests!

For future conferences, I’d really like to investigate the talk attendance data to see if we can do a better job of scheduling the conference to avoid conflicts.  In a past life I did a lot of work in combinatorial optimization, and I have my eye on some algorithms that I believe would do an excellent job of automated low-conflict scheduling.  But to test this hypothesis (and convince the CHI chairs to give it a try for real) I need actual talk preference data from the attendees.  Right now, if you’re marking those interesting talks using the android or iphone apps , that data is locked inside the app.  Would you be willing to share it with me so I can experiment with it?

I still need to get IRB approval to use this (personally identifying) data.  So I don’t want it yet.  Instead, please fill out this form that will let me contact you once I have set up a repository and a procedure for collecting the data?  In the meantime,  don’t delete your app!  It’s the only place where your talk selections are stored.

For CHI 2012: Discussion Forums in the Document Margins

Would you like some feedback on your CHI paper?  We’ve set up a site to let people read and comment on it.

On Wednesday at CHI, we’ll be presenting our paper on nb, a discussion forum situated in the margins of documents being discussed.  Its original intended usage was for discussion of classroom lecture notes, but we have discovered through our reading group that it is also quite useful for reading, commenting on and discussing research papers.   With this in mind, we’re making nb available to the CHI 21012 conference community.

We’ve created a folder on nb for CHI papers; there, you can read and annotate/discuss papers that have been uploaded by their authors.  There are currently 4 CHI papers there.  The first, of course, is our own paper on nb.  We’d love your comments, especially if they arrive before we present on Wednesday morning!  If you want to read and discuss, you just register on the site.  If you want to make your paper available for comments, please email your paper to and he’ll upload it (yes, this is clunky—the system is currently designed for a faculty member who is uploading their course content, without means for “mere participants” to do the same).  I’d like to suggest, since you are seeking comments on your paper, that you “pay it forward”—read and comment on one of the papers already there to balance the hope that someone will then do the same for you.

The paper itself is focused on classroom deployment.  Nb has been used in about 60 classes at 5 universites, with several faculty choosing to use it repeatedly.  We study a particular successful use of the tool, in a class at MIT where it acquired over 14,000 comments from 100 students.   Using both quantitative and interview data, we explain why situating discussions in the margins can work better than the typical separate-forum approach.

If you’re teaching a class, we’d love to help you use nb for it.  Contact us!