It may not be obvious from this blog, but I started out as an algorithms researcher and still work in the area. One of favorite kinds of work is helping practitioners use algorithms to solve their real-world problems. Recently (more precisely, a couple of too-busy-to-blog months ago) I attended a workshop on Algorithms in the Field devoted specifically to that area, which prodded me into some action. I believe that often the algorithmic solution to a practical problem already exists, and the main challenge is connecting the practitioner with the problem to the algorithmicist aware of the solution. At the extreme, you can sometimes solve the whole problem with a five minute conversation and a preexisting algorithms paper (that was my experience with this paper on natural language processing that we published at NAACL).
Too often when I ask a practitioner for such problems, they come back with a problem where they aren’t quite sure what it they want to compute. Algorithms isn’t as directly useful in this “problem definition” stage. Intead, the kinds of problems most susceptible to algorithmic contributions are ones where the practitioner already knows exactly how to solve them with too much time/space/computational resources. Good algorithms can often substitute for these resources.
So if you happen to have a problem of that sort, please be in touch.
Among the other talks I attended at CHI 2011 was Justin Matejka et al.’s “Ambient Help” (). The Ambient Help system is designed for complex desktop applications like AutoCAD or Photoshop, which tend to come with a steep learning curve, and which tend to require a continued learning process even from more experienced users that use the product on a frequent basis. Much of the system’s novelty comes from its unobtrusive nature; the system runs on a separate monitor from the application in use, and uses a context-sensitive search heuristic to display articles and tutorial videos that might be relevant to whatever the user is currently doing. If the user sees a resource of interest among the several suggested ones, it is easy to open or play a particular one. Conversely, it is trivial to ignore the help system entirely by simply looking away from the second monitor. Contrast this with MS Office “Clippy”…
The project includes a user study on AutoCAD, which shows the search heuristic to be quite successful in finding help resources that the user ends up pulling up for more details. For the creative, not time-constrained task, users visited 2.6 times as many interesting resources with the real-time Ambient Help system enabled as with a corresponding manual system (YouTube search, standard online documentation), without investing a larger portion of their time interacting with the help system.
There is little I like more than a fine cheese and fresh-baked bread. Still, to fill the rest of my day without expanding my waistline, I go for a mix of databases and human-computer interaction. That’s why I was excited to see several database-oriented papers presented at CHI. While many papers contained some amount of data, I’ll stick to the three that are unquestionably of interest to the databases community.
The first paper was for the social scientist in all of us. Amy Voida, Ellie Harmon, and Ban Al-Ani presented Homebrew Databases: Complexities of Everyday Information Management in Nonprofit Organizations. Nonprofits are arguably some of the most difficult database users to design for. They have minimal resources, rarely employ fulltime technical staff, and solve non-core problems as they show up. This practice leads to homebrew, just-functional-enough solutions to many data management problems. The authors provide an interesting qualitative study of how nonprofits manage volunteer demographic and contact information. They provide descriptions of the homebrewed, often fractured collections of data stored in several locations. Reading this paper, I couldn’t help but think of how perfectly these homebrewed databases resembled Franklin, Halevy, and Maier’s dataspaces.
Sean Kandel presented Wrangler, a project he’s been working on with Andreas Paepcke, Joe Hellerstein, and Jeff Heer. Wrangler lets users specify transformations on datasets by example. Each time a user shows Wrangler how to modify a record (or line of unstructured text), Wrangler updates its rank-ordered list of potential transformations that could have led to this modification. Wrangler borrows concepts such as interactive transformation languages from Vijayshankar Raman and Joe Hellerstein’s Potter’s Wheel. Its interface has a taste of David Huynh and Stefano Mazzocchi’s Refine as well as Huynh’s Potluck. Wrangler’s novelty comes in combining the interfaces and transformation languages with an inference and ranking engine. Since Wrangler is hosted, it is also capable of learning which transformations users prefer and improving its rankings over time!
The last slot goes to our own Eirik Bakke, who presented Related Worksheets along with David Karger and Rob Miller. Related worksheets make foreign key references a first-class citizen in the world of spreadsheets. Just as spreadsheets secretly made every office worker capable of maintaining a single-user, single-table relational database, Eirik has secretly enabled those workers to make references between spreadsheets without having to program. While adding foreign key references to a spreadsheet requires a simple user interface modification, its implications on how to display multi-valued cells in the spreadsheet are significant. Read the paper to see Eirik’s hierarchical solution to this problem!
Keep it up, data nerds! Soon we’ll be able to start a data community at CHI!
There’s a lot of discussion about the right way to evaluate and support systems research in SIGCHI. Maybe too much. (I’m allowed to say that because I contributed to it, right?) But for this to be a productive conversation, we need to tackle the other half: what makes for a bad systems paper?
I say bad paper, rather than bad research, because often this is about framing and not the actual work. My conversations at CHI and throughout the alt.chi process helped draw out some of the common killer problems that HCI systems papers run into. These are legitimate problems with a paper, and we need to own up to them if we want our work to be taken seriously.
Issue 1. My Contribution is the System
Pete Pirolli hit this one on the nose at the alt.chi presentation. Systems authors often frame the technological artifact they built as the entire contribution of the paper. The fact that I built a system, say one called ACRONYM, is largely immaterial. In a way, it’s part of the evaluation: ACRONYM is proof that the ideas can be instantiated. But what are the ideas driving the system design? In order to learn something from the paper, we need to focus on the ideas rather than the system when describing our contribution.
Issue 2. My Study Proves That This Is Unquestionably The Best
Many social scientists who I talked to complained that systems papers often overclaim their results based on a small study. If you read a CHI paper by one of your favorite social scientists, they are very good at clearly scoping what can and can’t be concluded from a study. CS has a way of always claiming that my ACRONYM system absolutely buries all the competition. If we are a little more careful in our claims, I think it will help many systems papers on the bubble.
This past week at CHI, our very own Michael Bernstein participated in a panel discussion about the role of replication and reproduction in the CHI community. Thanks to Max Wilson, the panel coordinator, I got the opportunity to log the event and live-tweeted the whole thing; here are my notes.
Max starting things off, with these comments:
- Replication is a cornerstone in some fields, in CS it’s often a benchmarking tool.
- HCI often suffers from generalizability, but replication to fix that problem can be very time consuming.
- We also aren’t entirely a science community – would you try to replicate art?
Wendy Mackay was the first invited panelist:
- CHI crosses disciplines, and so do attitudes about replication.
- We often draw from experimental psychology (start with a model, revise the model, and replicate things in between), as well as from ethnography (observations and re-observations).
- These approaches focus on developing theories or knowledge about the world, whereas design focuses on building artifacts.
- We also draw from engineering and computer science – engineering has repetition, but much of CS does not.
Harold Thimbleby followed:
- He promised to share a core science background (as opposed to Wendy’s psychology framing).
- “The only reason you’re in this room today is because you’ve got hope [...] to live [...] and hope for the future of CHI.” (maybe paraphrased a bit!)
- CHI hopes to change the world for the better. In order to do it with confidence, we often use statistics measuring our confidence.
- We get excited at conferences by ideas, and we go home and try to use those ideas. That’s replication.
- Those iterations cause evolution-like improvements of ideas and knowledge.
- Deliberate reproducibility is good science, and it can train young scientists and fix issues.
- “Non-reproducibility is cheating” – if we don’t make the process needed to reproduce work clear in papers, we fail as authors.
- In reality, we need to get people to use our ideas. We write papers to spread our ideas.
- “Sadly, most of what we publish isn’t reproducible.”
- A third of papers published in a machine learning journal weren’t reproducible (this was determined by a survey of authors in that journal).
- HT replicated this by asking three other journals and found the same thing – this is a problem in computer science, not just in HCI.
- We can look at post-war cargo cult examples as a parallel to our work – they built planes and other war paraphernalia hoping that it would result in cargo drops, but missed the point. Similarly, we often neglect to reproduce things at a useful level.
- We do have reasons for not being reproducible, including business ones. A study of different Casio calculator models saw different answers to arithmetic problems, which was obvious not something Casio wanted published.
- Being reproducible on consumer devices can be really detrimental to a business.
- “Go forth and reproduce [create new scientists] and be reproducible [with your work].”
Next up was Ed Chi, with the point of view of industry research:
- “There is more to replication than simply duplication.”
- Early contributions to the field came from computer scientists and cognitive psychologists.
- In a memo establishing HCI research at PARC, it was evident that there was a need to establish HCI as a science.
- The intellectual heritage of HCI comes from Vannevar Bush and JCR Licklider, augmenting cognition.
- Our background comes from psychology, where replication is the norm (echoing Wendy).
- Psychology teaches students early on to design good studies.
- In CHI 97, there was a browse-off – the hyperbolic browser won, but replication attempts showed no clear winner.
- Individual differences in subjects where overwhelming anything in the design of the browser, showing the value of replication as a tool to more fully understand what was happening.
- This first experiment at CHI 97 was just the beginning of something bigger, and that’s why replication was needed, and is still needed.
Michael spoke next on behalf of grad students everywhere:
- He couldn’t speak for everyone but used an “unassailable, extremely scientific data collection protocol” (this is facetious) and got responses from 93 students (his social network and student volunteers).
- 83% of grads hadn’t ever replicated a study, 62% said “hell no” they never would replicate a study or a system.
- One response said “I’m more creative than that”, another said “New studies confirming old studies have no chance of publication.”
- There’s a general perception that reviewers don’t feel that work is necessary, and that it isn’t novel.
- “The grad student must conform”, and so, since no one’s publishing replication work, there isn’t any more being published.
- He also solicited haikus – “Think analyzing / CMC is tough? Try it / reproducibly” and “Repeat to be sure / We stand on giant’s shoulders / But do so on faith.”
Dan Russell from Google, speaking with the experience of someone with access to large data sets:
- What CHI insights can we replicate?
- Replicating a measure should be straightforward, but it’s not in our very diverse community.
- The knowledge needed for replication sometimes gets left out of papers.
- Changing things slightly, such as in wording or font, can dramatically change the ability to reproduce work – so does a change on the web.
- DR was conducting a study about finding difficult-to-locate information online, and suddenly, everyone got WAY better… because someone had posted the answer online on a Q&A site! Changes that are out of our control online can dramatically affect reproduction.
- Google is kind of a Large Hadron Collider. We can’t reproduce the LHC studies without our own, so we must take them on faith. Likewise, we don’t all have access to Google’s huge data sets or user bases, as so we must take some of that on faith as well.
- “Ultimately, we are a faith-based community. And that’s the nature of science.”
NB that the panelists posted statements beforehand on replichi.org; look there for more detailed summaries.
There were several questions and comments that prompted discussion. I’ve gotten them down here as best I could. Apologies if I’ve misquoted or misattributed anything!
- Gary Olson, from UC Irvine – Wendy said we should replicate and extend. [...] Extension is critical.
- Wendy – “I of course agree. But there’s a disciplinary issue.” Something are relatively easy, depending on what their intellectual heritage is, some can’t be done.
- Ed – We often place the responsibility of generalizability on the author. He or she must make that claim. In other fields, that burden falls on the reader.
- Sharoda Paul from PARC – We must address the interdisciplinary nature of CHI. How can we manage the expertise and backgrounds between reviewers?
- Ed – depending on the person, there can be a sense of “why should we waste our time on replication?” – but replication can heighten understanding.
- Ed – part of the goal of this panel is to change the between-reviewers issue.
- Harold – we should note that there are different types of reproducibility:
- Replication work done to acquire skills and to learn.
- Just redoing work (because of a failure to immerse oneself in literature), which is not publishable. (This is the bad kind of reproduction.)
- Writing papers honestly to be reproduced.
- Reproduction with an adaptation to a different area, or an extension on previous knowledge.
- Wendy – part of it may also be finding ways to publish more philosophical things. PC meetings are a place where things like this are discussed as well.
- Eric Baumer from Cornell – “Replication is not reproduction.” There are different kinds of replication; we should consider what replication means.
- Lorrie Cranor from CMU – SOUPS gets around paper length issues by including appendices with information for reproduction.
- Wendy – we should think about who will be reproducing the work as well – we should let people reproduce work in products, or in things that affect the real world.
- Wendy – of course there are IP issues, but this could be part of our long term goal. We don’t pursue just science, but world-changing innovations.
- Michael – Rebuilding systems is so, so hard. We often only have screenshots to go off of, and there might even be errors in the paper. Replication happening in Rob Miller‘s HCI class led to a discovery of a constant being off by a factor of 10 in a noted paper.
- Harold – Papers can also be about inspiring, rather than being about reproduction… or they can be entirely open-sourced.
- Harold – we should be clear about how reproducible we intend things to be in our papers.
- Ed – paper limits come from the publishing model, but in the digital world, we need to now change the community standard.
- Question from an unknown person (sorry! let me know if it was you!) – When you replicate and find different results, what do we do? Some reviewers might be insulted. Do we reproduce things specifically to falsify others’ work?
- Michael – that feeling echoes grad student opinions, and it’s worsened by the assumption that if you find errant results, you messed up, especially if it’s work by an important researcher.
- Max – sometimes we reproduce things and it confirms surprising results though – the value of the content may change the value of reproduction.
- Wendy – the hope is that there are multiple reviewers, and this hopefully means that any controversy is viewed very clearly.
- Wendy – controversial findings like that are more interesting than others.
- Michael – Unfortunately, we don’t always know why, and that causes increased skepticism.
- Michael – It’s good when intro classes include replication of results. It can demystify things.
- Wendy – I have more faith in program committees than to believe that good papers would disappear if they’re controversial.
- Lora Oehlberg from Berkeley – Design research discusses failures as well as successes. Do we encourage people not to replicate pointless results, which could be considered failures?
- Replication of results can improve the quality of data.
- What’s the role of releasing code in systems work?
- Ed – “Ownership of code [and data] has been a way research territory is protected. Monetization might be the root of all evil.”
Panelists shared their final thoughts:
- Max – perhaps we need an alt.chi or similar session called repliCHI, a place for people to publish work like this.
- Wendy – that might be possible! “I think we should encourage students to replicate in coursework” and then publish like that.
- Harold – Think of how you can “build something that improves reproducibility” – we can change the models of publication this way.
- Ed – We must change the HCI curriculum. It doesn’t always [though there are notable cases where it does] include stuff drawn from psychology. We can always experiment in conferences.
- Michael – There are techniques to “replicate” systems quickly, like as part of a prototyping process, that can inform our design, and we shouldn’t neglect these.
- Dan – I almost always ask interns to reproduce results. Perfect reproductions are boring, but they’re almost never perfect, and then we learn something.
Recently the CHI workshop on Crowdsourcing and Human Computation got some press courtesy of Jim Giles and New Scientist. Near the end of the workshop, the working group on Future Directions and Community had some interesting suggestions that I’ll echo here.
Can we take some of the crowdsourcing tools and techniques we have developed as a community and put them to use in our own publishing and review processes?
The bigger question put to the group was: should crowdsourcing and crowd computing develop into their own disciplines, or continue to jump around between existing conferences in the ACM, IEEE and AAAI?
I’ve been attending the CHI conference in Vancouver this week, presenting some of my work on database user interfaces. It was interesting to attend Tuesday’s “Re-Engineering Health Care with Information Technology” panel and hear about what appears to be one of the biggest application areas for database UIs on the planet: Electronic Medical Records (EMRs). Ben Schneiderman referred to the thousands of different systems that are currently used for communications between and within health care institutions as a giant “Medical Internet” that indirectly serves more Americans (94%) than the regular Internet. US health care spending is currently far higher per GDP (and relative to performance metrics such as life expectancy and infant mortality) than that of any other country in the world, and it is clear that effective IT use must be at least a part of a solution to this problem.
I took note of several interesting anecdotes from the panelists:
- In many cases today, EMRs actually disrupt the workflows of health care workers. A physician may log onto their computer system in the morning, browse through several poorly adapted views of patient records in order to find the information she needs for the day, and then write it down on paper. At the end of the day, she (or her assistant) returns to the computer to type in handwritten changes to the various records involved.
- Thomas Payne, MD, talked about the Computerized Patient Record System (CPRS) of the Veterans Health Administration. The CPRS has been recognized as an example of a highly successful large-scale EMR system in the US. We got to see a screenshot, and it’s actually a good old text-based DOS interface (or at least it used to be in 1997—fair enough).
- There are between 300 and 600 vendors of EMR systems in the US, and they differentiate themselves by each having a separate architecture and user interface. Thus, a physician who might work at one hospital for three days a week and another for two will need training in two completely different systems.
Although I’ve been going to CHI for a few years, I still feel like something of a foreigner, not certain which talks to attend. Many of my friends and colleagues probably have a much better idea than I of which talks are given by speakers I would like and which offer insights I would find particularly valuable. So I try to ask around, but I often get the information too late.
So I convinced my students, Michael Bernstein and Adam Marcus, to build a system to help me out. We connected our FeedMe recommender system (presented in a paper at CHI last year) to the CHI program presented in Danny Soroker’s Eventmap. As you build your own personal program of talks to attend, you can also recommend any you think I (or any of your other friends) will be interested in. I hope you will.
Eventmap already let you browse for talks you might attend (I suggest using the table view, which shows all the abstracts) and click on them to add them to your own schedule. Now you’ll also get a “recommend using Feedme” button. If you click it then you’ll be able to specify email addresses of friends who’ll be interested. FeedMe will take care of notifying them of your recommendation and incorporating it into their personal eventmaps—if they log into FeedMe they’ll see a little green bubble over each talk that’s been recommended to them. A convenience of Feedme is that after a little bit of practice, it will start to guess which of your friends you’re going to recommend a particular talk to, and let you do so with a single click instead of typing in email addresses.
FeedMe also works as a standalone system; you can use it recommend any google newsreader story, or any arbitrary webpage, to any of your friends. You can find details on the FeedMe site.
Feedme reflects our interest in friendsourcing—getting your friends to help you in crowdsourcing workflows that rely on their knowledge of you. While I wouldn’t expect random crowds to do a very good job recommending information to me, I can hope that people who know me and my interests well can do a better job than any pure-computer (e.g. machine learning) system. So please, while you’re looking over the CHI program to plan your attendance, if you see a paper you think I’ll like, fire off a recommendation to me using Feedme, and I’ll thank you for it (every feedme notification includes a one-click thank-you button). And if you’d like to receive some recommendations, tell your knowledgeable friends about Feedme!
I attended the VISSW 2011 workshop last sunday. It was fun, but a few of the papers exhibited a painfully familiar pattern: they put together a plausible-seeming user interface but didn’t evaluate it with a user study. I left frustrated, with no sense of whether the ideas of the interfaces would be good or bad to incorporate in my own work. With the system already implemented, other researchers are disincentivized from implementing (it wouldn’t be novel) so they can’t evaluate it. Thus, if the original researchers don’t do the evaluation, nobody will. This is a not uncommon complaint in computer science—our field doesn’t seem committed to following through with evaluations of the ideas they invent and implement. Some faculty at Stanford have even created a course aimed at teaching students how to properly evaluate their research systems.
So here’s a proposal for improving the incentives a little bit. Change the submission requirements for conference papers: they have to contain the system description and the hypothesis to be tested, along with a detailed evaluation plan. Papers are then evaluated and accepted on the basis of a commitment to execute the evaluation plan (and update the paper with results) before the conference but after acceptance.
This approach would have several benefits.
- Researchers could defer the work of evaluation until their submission is accepted. Once it’s accepted, they have strong motivation to do the evaluation (else the paper cannot be presented). For work that turns out not to be publishable, the evaluation work is not wasted.
- The evaluations would take place after the submission deadline, meaning work on the system could continue right up to that deadline. This gives us something to do in the “dead space” between acceptance and presentation (which is forced upon us by the long lead time for required for travel planning). The work presented at the conference would be “fresher”; the long lead time on conference submission would have less impact on the publication of timely results
- This approach would also address the recently popularized problem of a that may lead to incorrect claims of statistical significance in outcomes. If reviewers consider a paper that contains only the system and evaluation procedure, they will be forced to asses the paper purely on the grounds of whether the proposed system is interesting enough to be worthy of evaluation. If it is, then the paper should be accepted regardless of whether the outcome of that evaluation is positive or negative. If it is not, then the inclusion of a positive evaluation should not change the rejection decision.
Turning to logistical concerns, this approach means that the paper is not finalized until shortly before the conference (a couple weeks, to give reviewers a chance to confirm that the evaluation plan was followed). But as more conferences move towards electronic-only publication, this schedule becomes feasible. And this scheme wouldn’t cover e.g. multi-year longitudinal evaluations. But it would certainly cover a large number of the papers with short (inadequate?) user studies appearing in our HCI conferences.
Of course, there’s the simpler approach of requiring evaluations at submission. This meets the primary goal of having systems evaluated, but loses the three benefits I’ve outlined above: researchers invest energy evaluating systems that would be rejected independent of the evaluation; the evaluation work will be older/staler by the time of the conference, and the bias of reviewers to accept positive results would continue.