Although indexing makes it possible to access information from very large document collections, the
conversion from a document text to a list of weighted keywords does result in a loss of information.
Writing a document is an intentional process; a document is intended to convey a message. The
translation to a list of keywords retains the essential building blocks of the message, the terms
themselves, but the message(s) that the author intended cannot be accessed by the retrieval mechanism.
The effect of this loss of information may be ameliorated or deteriorated by the use of controlled
vocabularies pre defined sets of indexing terms, [Ing92, Chap 3]. However, the fact remains that when
we talk of representing the information content of documents we are only representing the components
of the message, not the message itself.
The reduction of the document text into a series of keywords also transforms the task of an IR system
from retrieving information to retrieving objects that contain information. Some authors argue that
objects such as documents cannot be held to contain information as such, rather information is a change
in a cognitive, or internal, state brought about by exposure to the contents of these objects. The
following early quote by Maron, [Mar64], illustrates this concern,
"..information is not a stuff contained in books as marbles might be contained in
a bag even though we sometimes speak of it in that way. It is, rather a
relationship. The impact of a given message on an individual is relative to what
he already knows, and of course, the same message could convey different
amounts of information to different receivers, depending on each one s internal
model or map."
The degradation of the document text, necessary for computation, and the subjectivity of relevance
results in a layer of indirection between the user and the documents. The goal of the IR system is to
bridge this gap between the user and potentially relevant material. Indexing techniques identify and
highlight potentially good indicators of relevant material, and retrieval techniques use these indicators
of relevance to select which documents to present to the user. How individual retrieval systems use
these indicators to retrieve documents is the topic of the next section.
2.2 Retrieval and feedback
Retrieval is the process of matching a representation of an information need, usually a user supplied
query, to an indexed document representation. Queries will be indexed in the same way as a document
and compared with a document index to determine if a document is likely to be relevant to a query.
How the indexed query is compared with the indexed document differentiates the major retrieval
models. In this section we shall briefly outline the four main models of retrieval: Boolean, vector space,
probabilistic, and logical, and describe the basic approaches to RF in each of the models.
2.2.1 Boolean model
The first operational IR retrieval model was the Boolean model, based on Boolean logic. In this model
queries are keywords combined, by the user, with the conjunctive (AND), disjunctive (OR) or negation
(NOT) operators. This is an exact match model: the system only retrieves those documents that exactly
match the user's query formula. For example, for the query `information AND retrieval AND system'
the system will return all documents that contain the three words `information', `retrieval' and `system',
whereas the query `information OR (retrieval AND system) will return those documents that contain
the word `information' and those documents that contain both `retrieval' and `system'.
The Boolean model has been used in a large number of on line public access catalogue (OPAC)
systems but has been shown to demonstrate a number of difficulties. Firstly, traditional Boolean
systems do not use term weights and consequently return the complete set of documents that match the
query as an unordered set. This means the users may have to add or remove terms, or generate more
complex query expressions to reduce the set of retrieved documents to a manageable size. Willie and
Bruza, [WB95], argue that the problems with interacting with Boolean systems are not only a matter of
the formal query language but a conceptual problem: the Boolean model does not lend itself to
supporting how users think about searching and their individual search techniques. A further problem
with Boolean systems is that the order in which operators are applied may not be consistent across
systems, resulting in the fact that different systems may retrieve different documents for the same query,
[Borg96]. Nevertheless Boolean systems do remain popular with users, perhaps because of the explicit
control that is offered by these systems to the user. Web search engines often allow Boolean style
querying performed on an underlying best match model (see section 2.2.2).
5
<
New Page 1
UK Web Hosting