«The Smart environment provides a test-bed for implementing and evaluating a large number of different automatic search and retrieval processes. In ...»
The Smart environment for retrieval
system evaluation—advantages and
The Smart environment provides a test-bed for implementing and evaluating
a large number of different automatic search and retrieval processes. In this
chapter, the basic parameters underlying the Smart system design are briefly
outlined, and a comparison is made with the characteristics of more
conventional retrieval systems. The principal lessons learned from the Smart
experiments are described, and some of the methodological problems raised by the system design are outlined. Finally, some comments are included about the disadvantages inherent in working in the laboratory, and the insights that can be gained in such a situation.
15.1 Retrieval system environment Automatic, or semi-automatic information search and retrieval systems have now been in existence for some twenty years. In the early years, only small collections could be searched, and the search requests received from the user population would be accumulated for some period of time, or 'batched' before actually being processed, with the result that several weeks would normally elapse before answers could be obtained to a given query.
At the present time, the role and importance of information retrieval has greatly increased for two main reasons: the coverage of the searchable collections is now extensive and collection sizes may exceed several million documents; furthermore, the search results can now be obtained more or less instantaneously, using online procedures and computer terminal devices that provide interaction and communication between system and users. The large collection sizes make it plausible to the users that relevant information will in fact be retrieved as a result of a search operation, and the probability of obtaining the search output without delay creates a substantial user demand for the retrieval services. It is not surprising in these circumstances that several million search requests are currently submitted each year to a variety of automatic retrieval services.
* This study was supported in part by the National Science Foundation under grant DSI-77Retrieval system environment 317 While the operational retrieval environment has thus drastically changed over the last few years, the intellectual design of the retrieval operations has remained reasonably unchanged for some decades. The following principal
characteristics may be noted:
(a) documents are normally indexed manually, that is, subject indicators and content descriptions are manually assigned to the bibliographic items by subject experts and professional indexers;
(b) search statements are manually formulated by users or search intermediaries using one or more acceptable search terms and appropriate boolean connectives between the terms; subsequent reformulations and improvements in the query formulations are also carried out manually;
(c) the principal file search device is an auxiliary, so-called inverted directory which contains for each accepted content descriptor a list of the document references to which that term is assigned; the documents to be retrieved are then identified by comparing and merging the document reference lists corresponding to the various query terms;
(d) an 'exact match' retrieval strategy is carried out by retrieving all items whose content description exactly matches the term combination specified in the search request; normally, all retrieved items are considered by the system as being equally relevant to the user's needs, and no special method is provided for ranking the output items in presumed order of goodness for the user.
Enhancements are included in many of the modern search systems in the form of'free text' manipulations allowing the user to choose arbitrary search terms, that is natural language terms that are not controlled by any dictionary or authority lists, leading to the retrieval of all documents whose stored texts (or text excerpts) contain a particular term combination included in the search requests. But even in the free text search mode, inverted directories are created containing all the text words that could lead to the retrieval of a given document in the collection. Additional refinements in the search mode are available in some modern online environments in the form of dictionary and vocabulary displays leading to better query formulation capabilities.
However, the basic manual query formulation and exact match retrieval strategy based on inverted files is maintained in practically all operational retrieval situations.
When the work on the Smart retrieval experiments was initiated in the early 1960s, some attempts had been made at implementing so-called automatic indexing systems 1 " 4. These consisted in using the computer to scan document texts, or text excerpts such as document abstracts, and in assigning as content descriptors words that occurred sufficiently frequently in a given text. The early retrieval experiments conducted with such automatic indexing products showed that a large number of the automatically chosen index terms would also have been assigned by manual indexers, and that the automatic indexing products contrary to expectation did not prove to be totally inadequate.
Moreover, it appeared that the rudimentary early automatic indexing products could be easily improved. Thus linguists led the way by pointing out that a number of linguistic processes were 'essential' for the generation of 318 The Smart environment for retrieval system evaluation effective content identifiers characterizing natural language texts. Among the linguistic techniques of interest, the following were considered to be of
(a) The use of hierarchical term arrangements, relating the content terms in a given subject area. With such preconstructed term hierarchies, the standard content descriptions can be 'expanded' by adding hierarchically superior (more general) terms as well as hierarchically inferior (more specific) terms to a given content description.
(b) The use of synonym dictionaries, or thesauri, in which each term is included in a class of synonymous, or related terms. Using a thesaurus each originally available term can be replaced by a complete class of related terms thereby broadening the original context description.
(c) The utilization of syntactic analysis systems capable of specifying syntactic roles for each term and of forming complex content descriptions consisting of term phrases and large syntactic units. A syntactic analysis scheme makes it possible to supply specific content identifications and avoids confusion between composite terms such as 'blind Venetian' and 'Venetian blind'.
(d) The use of semantic analysis systems in which the syntactic units are supplemented by semantic roles attached to the entities making up a given content description. Semantic analysis systems utilize various kinds of knowledge extraneous to the documents, often specified by preconstructed 'semantic graphs' and other related constructs.
The design of the original Smart system was then based on the premise that effective automatic indexing procedures could be built by incorporating into a content analysis system one or more of the foregoing language processing methods. Most of the required constructs such as the hierarchical term arrangements and the syntactically analysed text excerpts could be represented by
trees, and other constructs such as semantic graphs and thesauri are easily represented by graph structures. Well known automatic procedures were also available for traversing and manipulating tree and graph structures 5. The original Smart system was then designed to process natural language texts using these complex data structures.
To validate the linguistic analysis procedures it was necessary to compare the search results obtained by using term hierarchies and thesauri with other simpler systems based on the use of single, frequency-weighted terms extracted from the document texts. From the beginning, the Smart system thus contained an evaluation package based on the use of sample document and query collections and on the availability of full relevance assessments specifying the presumed relevance of each document with respect to each user query. This made it possible to compute for each processed query the recall and precision values measuring respectively the proportion of relevant items retrieved and the proportion of retrieved items that are relevant.
The early tests in turn led to additional experiments and to the development of a full evaluation system for a large variety of search and retrieval procedures. These developments are described in more detail in the remainder of this study.
Basic Smart system assumptions and early results 319
15.2 Basic Smart system assumptions and early results In the Smart system each record, or document, is represented by a vector of terms, that is Di = (dil9di2,...,du), where dv represents the weight or importance of term j for document Dt. By 'term' is meant some form of content identifier such as a word extracted from a document text, a word phrase, a thesaurus class, an entry from a term hierarchy, etc. A query Qj can be similarly represented as £2j = (?/i % • • • %r) and retrieval of a stored item can be made to depend on the magnitude of a global similarity coefficient s(Dt, Qj). Specifically, whenever s(Dt, Qj) ^ T for some threshold T, Dt is retrieved in answer to Qj. It should be noted that an exact match between any particular query and document terms is never required for retrieval of an item. Instead, the similarity measure s may be based on the composite similarities between the full query and document vectors.
Furthermore, since s(Dt,Qj) represents a measure of closeness between Dt and Qj, the output documents can be presented to the user population in ranked order of presumed relevance to the user, that is, in decreasing order of the corresponding s coefficients.
The following assumptions are immediately implied by the vector
(a) In principle, each term included in a given vector is as important as any other term (except for the possible distinction implied by a particular term weight assignment); that is, each term represents a particular dimension in the f-dimensional vector space defined by the t terms used to index the document collection.
(b) No relationships are defined between distinct terms; that is, the coordinate axes representing the distinct terms are assumed to be orthogonal.
(c) A document is represented by a particular position, and possibly by a given length, in the /-dimensional vector space. (In practice, it is often convenient to normalize all vectors to some given standard length.) In examining the Smart system, it is necessary to consider also another principal characteristic of the experimental environment, namely the use of small sample collections of documents and user queries for test purposes.
Such a test environment makes it possible to carry out many different experiments at reasonable cost. Furthermore, a great many inconveniences inherent in the use of large operational collections are immediately eliminated. Thus full relevance assessments can be obtained from the user population of each document with respect to each query, leading to the generation of accurate recall-precision measures. The alternative would consist in using sampling techniques and obtaining relevance assessments for a portion of the document collection only. The use of sampling methods, however, introduces additional variables and the evaluation results may then be subject to substantial fluctuations.
The small document environment used in the Smart experiments also renders unnecessary the choice of various parameter values which would otherwise be required to control the retrieval process. Because the documents are ranked at the output in decreasing order of query-document similarity, 320 The Smart environment for retrieval system evaluation there is thus no need to choose a retrieval threshold to distinguish the retrieved from the non-retrieved items. Instead, recall-precision values can be computed for all possible retrieval thresholds—that is, after retrieving one, two, and eventually n documents in decreasing order of the similarity with the query—and the results can be plotted in a composite recall-precision graph. The experiments can then be carried out using a very small number of variable parameters such as collection size, number of queries, relevance assessments of documents with respect to queries, interpolation procedures for calculating precision values at fixed recall intervals, and methods for averaging the results over a number of different user queries6. The Smart experiments have thus come close to achieving the conditions often assumed for ideal retrieval test environments79.
The artificial collection environment does, however, have implications about the conclusions derivable from the experiments. Thus it is difficult to obtain really believable efficiency (as opposed to effectiveness) criteria, such as response time, processing cost, and user effort needed to submit queries and to obtain results, because no obvious procedure is available for extrapolating these efficiency measures to large, operational retrieval situations. Furthermore, when a restricted number of user queries is used to evaluate retrieval effectiveness, the implicit assumption is that these queries and the corresponding users are representative of a general user population at large.