Talk: Community-based ontology development alignment and evaluation

Natasha Noy gave a talk at CSAIL with the above title.  She works in with a at Stanford.  The bioinformatics community in general couldn’t care less about cool computer science but is one of the few groups that have heavily adopted formal ontologies as a way to get their work done.  They have tons of data partitioned over many silos.   Biologists have adopted ontologies to provide canonical representations of scientific knowledge, or to annotate data to let others make use of it.  Often, it is not the authors who do it, but curators or automatic tools.

There are now hundreds of ontologies with tens of thousand of terms.  However, it has always been a “cottage industry”—various groups develop their own ontologies, then publish them for use by others.  Is there a way to open the development of the ontology up to the community?  Community might be just a few or thousands.

As an example, the gene ontology (28K terms) has 3 full time curators.  People from the community submit to an issue tracker to get new terms etc.  A ne version is released daily.  In contrast, the NCI thesaurus (for cancer) has 20 full time editors with 1 lead editor who runs everything, and a slow cycle of “releases” with less community input.  Others work like typical open source projects with 20-30 team members involved in active discussions.

Natasha’s group builds on Protege, a very old open source ontology editor that is now one of the most popular, with 120,000 registered users.  It has a very open plugin architecture with dozens of plugins for visualization, import, export, nlp, and lots of unknowns.  They’ve been working to augment protege with support for collaboration.  It works in a distributed fashion (desktop and web clients).  It support simultaneous editing, but also annotation, discussion, proposals and voting in the context of the ontology.  There are many types of annotations—questions, comments, proposals—on any elements of the ontology—classes, properties, instances.   While the tool handles most types of structured data, it is focused on taxonomic hierarchies were stuff gets inherited down the hierarchy.

They investigated use of their tool for several tasks.  One is ontology evaluation—finding existing ontologies that might be useful for you.  This source of information for this is author-contributed metadata about the ontologies—domain, key classes and concepts, who the developer is, etc. Another is automatic tools that compute quality metrics, and another is annotations by other users of the ontologies.

This last is important because some ontology metrics are subjective—a feature that is “good” in one setting can be awful in another.  An example might be a high level of axiomatization.  This is important for inference, but creates clutter if you just want description.  There’s also the problem of crosscutting taxonomies—you might have two different ways of describing the same domain that form a “matrix” of non-overlapping hierarchies.  To address this sort of subjectivity, they allow users to record evaluations of ontologies.

These tools can be explored at their bioportal web site where they have a large library of biomedical ontologies.  On that site, users can describe their ontology based projects, and list/review the ontologies they are using.  Reviewers give general reviews, usage information problems encountered, coverage of the key terms, major gaps, and issues with specific elements of the ontology.  This site aims to make ontology evaluation/creation a truly democratic process.  This is controversial—some argue that ontologies need a more rigorous editorial process (mirroring a current debate about open vs. traditional journal publication).

Another big task is mapping: connecting two ontologies by asserting that terms in two different ontologies “match”.  They aren’t trying to find mappings, but want to enable others to upload the mappings they have found.  Mappings can be created manually or uploaded in bulk (if computed by someone’s tool).  Mappings are themselves metadata, which can be annotated and discussed just like other data in the ontology.

Of course a big question is whether people will use these tools.  Right now, many users are asking for these features and reporting lots of bugs—good signs of demand.

A lot of questions have now arisen that need some serious user studies—what are the dynamics of the social networks that form around collaborative ontologies?  What are the different types of users/editors?  What produces the most discussion/controversy?  Do these tools help or hinder collaboration?

Comments are closed.