The HsArchive Service

Next: The HsIndex Service Up: Data manipulation Services Previous: Data manipulation Services

The `HsArchive` Service

In order to intitially ``introduce'' documents into a user's Haystack a basic service is necessary to create a kernel of an HDM subgraph. From this kernel Haystack services can begin the process of assembling a representation of the document in the HDM. HsArchive serves in this capacity by instantiating the ``kernel'' composed of a bale.HaystackDocument and a needle.Location.

Once the bale.HaystackDocument and needle.Location have been created, the event based service, haystack.service.fetch.FetchService (or a more specialized fetch service) is informed of a new event in the HDM. The service will subsequently try to retrieve the data being pointed at by the needle.Location. When this data is obtained a needle.Body is attached to the bale.HaystackDocument. Additionally, the FetchService will also calculate the MD5 checksum and generate a checksum needle to attach to the bale.HaystackDocument.

HsArchive is invoked with a variety of options that indicate how the HDM cluster for the archived document should be constructed. A document's uniquness is measured by two values: its location and its checksum. HsArchive maintains two tables, one mapping locations to needle.Location objects with that location, and a table mapping checksums to the needle.Body objects that produced the checksum value. In invoking the archive(...) method in HsArchive, a user may specify Haystack's behavior in dealing with the four possible cases: location and checksum match, one of the two matches (2 cases), and neither matches. The user's specification for archiving behavior for a specific document are retained in a table.

After the FetchService obtains the object and makes the appropriate connections, another event-driven service is invoked. This service, the HsFileChecker, awaits the appearance of a bale.HaystackDocument attached to a location and checksum. The HsFileChecker will lookup the user specified archive options described above. It will also check to see if the location and checksum are already in the HsArchive's databases. If both the URL and checksum match, the implication is that the user asked for the same document to be archived twice. The default behavior is to remove the new HDM cluster and abort. However, the user may specify that a new (duplicate) cluster be created. The last case (neither match) is simiarily trivial, as we know the object is entirely new, the cluster is left alone and archiving continues.

If the location matched but the checksum did not, the implication is that the document has changed. The default behavior is to connect the new needle.Body object to the old bale.HaystackDocument and in effect supercede the previous body. Alternatively, a user may specify that the HDM cluster continue to exist seperately. Finally, if the checksum matched but the location did not, one of two things may have happened, the document was copied or the document was moved. This is easily verifiable by looking at the old location and determining if the document is still there. The options available to the user in this state would be to merge, supercede, or abort the archive.

If the HDM cluster has survived the HsFileChecker, other services may begin to act on it. The HsTypeGuesser can decide type type of the object and attach the appropriate needle.Filetype object. Once a type is guessed, field finders and textifiers can start their work on the cluster. A field finder can parse through the needle.Body of a file, finding relevant pieces of data corresponding to the document state previously discussed. Trivially, a HsMailFieldFinder service can extract the to, from, subject, date, and id lines for an e-mail document and generate the appropriate needle.Mailfield objects. Finally, a textifier service specific to the type of document may extract the text form the document. This extraction does not need to be done immediately and the textifier may leave a Promise in the needle.Text object. At the time of this writing services have been implemented to extract text from Postscript and PDF files as well as some trivial ascii based file formats.

Next: The HsIndex Service Up: Data manipulation Services Previous: Data manipulation Services