normalisation of the relevant and non relevant documents by the number of relevant/non relevant
documents. Equation 15 shows the Ide regular formula.
n
n
1
2
Q
1
= Q
0
+
R
i
-
S
i
i=1
i=1
Equation 15: Ide regular
Two of the other algorithms were modifications of F4. The first used the ratio [Rob86] n N to replace
i
the 0.5 correction factor introduced to cope with the case where no relevant documents were retrieved
(R = 0) or when no relevant documents contain an individual term (r = 0), Equation 16.
n
r
i
i
+
(
R - r
)
N
i
+1
w
x
= log
i
n
n
i
i
- r
i
+
N
(
- n
)
N
i
- R + r
i
+1
Equation 16: Modified F4 function using ni/N
The second modified F4 scheme placed extra emphasis on terms that appeared in the query. Specifically
this was achieved by assuming that a term s appearance in the query is equivalent to an occurrence in 3
relevant documents (i.e. ri = ri + 3, R = R + 3).
Salton and Buckley found that, for all collections, except the NPL collection
19
, the models performed
fairly consistently with respect to each other, with the Ide dec hi performing best overall. In general,
although the probabilistic model performed well, it did not quite reach the performance level set by the
vector space models. This was advantageous as the vector space Ide dec hi RF technique is
computationally very efficient.
Salton and Buckley also provide some general guidelines based on predicting RF performance. For
example, short queries, on the whole, do better with RF than longer queries. Longer queries, or those
queries with more terms that appear in the relevant documents, will tend to achieve better initial
rankings. This means that there is greater potential improvement to be gained from RF on short initial
queries. For a similar reason queries that do poorly on initial runs tend to obtain greater improvements
with RF than those with good initial retrieval runs
Finally, domain specific collections also perform better with RF than domain independent collections.
This may be because it is easier to select good expansion terms from a domain dependent collection, or
because the ambiguity of search terms is less significant.
As well as considering variations on the probabilistic and vector space models Salton and Buckley
investigated weighting document terms (as opposed to binary weighting based on term
presence/absence in each document) and three variations on query expansion no expansion (only
reweighting), full expansion by all the terms in the relevant documents and partial expansion, adding
only some of the relevant terms to the query. For all collections, again except the NPL, weighting
document terms gives a considerable improvement in feedback, as does full expansion by all terms in
the relevant set
20
. Queries should be expanded by those terms that appear with the highest frequency in
the relevant documents rather than those with the highest feedback weight.
Rocchio s original formula and the Ide dec hi variant perform the joint function of modifying query
terms and query term weights. These and the other vector space RF techniques use the original
19
The NPL collection differed in a number of ways from the other collections investigated. It had much shorter
query and document vectors, and lower term frequency. For this collection, although the same relative ordering
was found between algorithms, binary document weighting was better than weighting document terms. This may
result in the vector space length normalisation procedure being ineffective for this collection.
20
Although full expansion is preferable, partial expansion also gives good results and can be used to reduce
storage. In larger collections than the ones tested here partial expansion may actually perform better than full
expansion.
21
<
New Page 1
UK Web Hosting