accomplished by the addition of query terms and by the reweighting of query terms to reflect their
utility in discriminating relevant from non relevant documents.
Rocchio s original formula for defining a new query vector in the vector space model, is as follows,
Equation 4
n
n
1
2
Q
R
- 1
S
1
= Q
0
+ 1n
i
i
1
n
2
i=1
i=1
Equation 4: Rocchio s original formula for modifying a query
based on relevance information
where Qo = initial query vector, Q1 = new query vector, n1 = number of
relevant documents, n2 = number of non relevant documents, Ri = vector for the
ith relevant document, Si = vector for the ith non relevant document
The new query vector is the original query vector plus the terms that best differentiate the relevant
documents from the non relevant documents. A modified query contains new terms (from the relevant
documents) and has new weights attached to the query terms. If the weight of a query term drops to
zero or below, it is removed from the query.
This formula is capable of being constrained further, e.g. by weighting the original query vector so that
the original query terms contribute more to the modified query than the new query terms or by varying
the amount of feedback considered. A variation of this formula was tested experimentally with positive
results on the SMART retrieval system [Roc71]. The small size of the document collection used in
Rocchio s experiments meant that certain modifications had to be made to the formula. For example,
although Rocchio tried to keep the size of the relevant and non relevant feedback sets identical, this
was not always possible. In addition a term was only considered if it was one of the original query
terms or if it appeared in more relevant than non relevant documents and in more than half the relevant
documents. These modifications highlight the recurring difficulty of aligning theory with experimental
practice.
Ide [Ide71] extended the SMART relevance feedback experiments, examining different aspects of RF,
such as only using relevant documents for feedback, varying the number of documents used for RF, and
using non relevant documents. She found that using only relevant documents for feedback or varying
the number of documents used at each iteration of feedback gave inconclusive or poor results.
Her third strategy was a variation of Rocchio's original formula, using only the first non relevant
document found, si. The formula used by Ide is shown in Equation 5. This was compared against
Rocchio's original formula. Although this technique, the Ide dec hi formula, did not improve results
greatly it was more consistent; improving the performance of more queries.
nr
Q = Q +
r - s
1
0
i
i
i
Equation 5: Ide dec hi formula for modifying a query based on relevance information
where Q0 = initial query vector, Q1 = new query vector, nr = number of relevant
documents, ri = vector for the ith relevant document, si = vector for the first non
relevant document
A common modification to the vector space RF formulae, e.g. [IdS71], is to weight the relative
contribution of the original query, relevant and non relevant documents to the RF process. In Equation
6, the
,
and
values specify the degree of effect of each component on RF.
n
n
1
2
Q
1
=
.Q
0
+ n
R
i
-
S
i
1
n
2
i =1
i=1
Equation 6: Rocchio modified relevance feedback formula
7
<
New Page 1
UK Web Hosting