Harman [Har92c] suggests two possible methods for implementing RF on Boolean systems. The first is
to present the user with a list of possible new query terms. These can be chosen, for example, by the
term distribution in the relevant documents. This means selecting those terms that appear more often in
the relevant than non relevant documents and which would be useful to include in a new query. The
second approach is for the system to automatically modify Boolean queries. An example of the latter
type of query modification can be found in the system proposed by Khoo and Poo, [KP94], which is
intended to automatically modify both the terms and the Boolean connectives of queries based on the
documents marked relevant by a user.
An alternative to exact match systems, such as the Boolean model, are best match systems. These
systems use term weights, such as tf and idf, to rank documents in decreasing order of matching score
or estimation of relevance. The two most common best match models are the vector space model,
which orders documents in decreasing similarity of query and document, [Sal71], and the probabilistic
model, [RSJ76], which orders documents based on an estimate of the probability of relevance of a
document to a query. In section 2.2.2 we discuss the vector space model, in section 2.2.3 we discuss the
probabilistic model.
2.2.2 Vector space model
In the vector space model, a document is represented by a vector of n weights, where n is the number of
unique terms in the document collection. Figure 4 shows an example vector where xi is the weight
3
of
the ith term in document x if x contains the term, and 0 if the term is not present in x.
x = (x , x ,..., x )
1
2
n
Figure 4: Document vector
Queries are also represented as a vector of length n, and the similarity of the document vectors to a
query vector gives a retrieval score to each document, allowing comparison and ranking of documents.
A range of similarity measures exists to calculate this similarity, e.g. DICE, inner product, cosine
correlation, [VR79, Chap 3]. Equation 3 shows the cosine correlation, one of the more common vector
space matching functions.
n
(term
ik
qterm
jk
)
k =1
cos(doc
i
, query
j
) =
n
n
(term
ik
)
2
(qterm
k=1
jk
)
2
k=1
Equation 3: Cosine correlation between document doci and queryj
Unlike the Boolean model, which retrieves documents according to the query terms and query
connectives, in the best match models all documents that contain at least one query term will receive a
non zero score; the highest score going to documents that contain all the query terms. Documents that
contain only some of the query terms will be ranked according to the sum of the weights of the query
terms they contain. The documents that contain more query terms or contain query terms with a higher
discriminatory power (term weight) will be retrieved above those that contain fewer query terms or
query terms with lower weights. Similarity is then a function of term overlap between query and
document, and the weights assigned to the terms.
Rocchio [Roc71] is generally credited with the first formalisation of a RF technique, developed on the
vector space model. In [Roc71] he defines the problem of retrieval as that of defining an optimal query;
one that maximises the difference between the average vector of the relevant documents and the
average vector of the non relevant documents. As discussed in section 1, it may not always be possible
for a user to submit such an optimal query, so RF is required to bring the query vector closer to the
mean of the relevant documents, and further from the mean of the non relevant documents. This is
3
Some implementations of the vector space model use 1 if a term occurs in a document, 0 if it does not occur.
Most implementations will use some form of tf*idf weighting and some form of length normalisation will usually
be performed to avoid retrieval bias towards long documents.
6
<
New Page 1
UK Web Hosting