Structured Search

Next: Unstructured Search Up: Previous Work Previous: Previous Work

Structured Search

The major contrast between unstructured and structured search is the items we are trying to categorize and index. While the unstructured IR system concentrates on the act of ``indexing'' across one dimension (language features), structured systems tend to index across multiple dimensions. Traditionally, structured search tools fall into the domain of database technology. A very good introduction to database technologies is [12].

What kind of information is structured? The easiest way to think of structured information is something one could put in a table. For example, if someone had a large compact disk (CD) collection, and wanted to be able to find certain CDs easily, they might create a database based on a number of different dimensions: CD title, performer's name, date released, date purchased, number of songs, song titles, etc. In setting up the database, the user would specify the dimensions and the data types of each dimension. Most databases provide a number of different data types for indexing. These can include, strings, characters, real numbers, integers, and esoteric multimedia and BLOB types.

Another way to look at structured search applications is in terms of metadata. Metadata, for our purposes, are information about information. That is, if the CD in the previous example is the ``information'' we are looking for, the dimensions by which we categorize the CD (title, performer, etc.) are the metadata. In the structured search a user generally does not know anything about the data itself, but understands (or at least has access to) the metadata. Because databases understand data types, it is possible to formulate powerful queries on information other than text. An example of this type of search is, ``give me all the names of employees who make more than $30,000 a year.'' Equivalently in SQL, the (S)tructured database (Q)uery (L)anguage, this could be, ``SELECT name FROM employees WHERE salary ;SPMgt; 30000.'' Notice that it is now possible for us to ask questions about numerical values. Symbols such as the ;SPMgt; sign have a significance when applied to numerical dimensions.

While it was possible to test cetain mathematical predicates, most databases are incapable of judging between the semantic similarity of two items. For example, while they can tell you that ``1 ;SPMlt; 2'' is true, they can't decide if ``MIT is near Harvard.'' A notable exception to this are systems such as VAGUE [34] that allow for extensions to the notion of similarity by allowing additional domain and rule knowledge to be added to the database schema. Unfortunately, querying such a system is tedious and requires much interaction as the system attempts to understand what the user meant. If we look for an Italian resturant near our house, the system will try to understand the concept of ``near'' and how important certain portions of the query are (i.e. is it more important that the restaurant is ``near'' us, or that it serves Italian food?).

To take full advantage of database systems a user has to have a very good understanding of the structure of the search space as well as the query language. On the other hand, if the user does understand both the structure of the data and the structure of the query language, information can be found in fewer iterations of searching.

An additional constraint of most database systems is their inability to extend past their original schema definitions. That is, once a user describes the dimensions of his data this is more or less set in stone.

In summary:

Database systems are able to generically deal with many data types.
Database are good at deciding equallity between queries,
But they have no sense of similarity at the semantic level.
Databases define constrained information spaces (schemes or dimensions).

Next: Unstructured Search Up: Previous Work Previous: Previous Work