document term weights to calculate the new term weights for query terms. The probabilistic based F4
weights, on the other hand, are derived directly from the feedback process itself. The traditional
probabilistic version presented in section 2.2.3 however, ignores the frequency with which a term
appears in the query and in documents. This latter feature has been extended in [RW94]. Harman,
[Har92b] section 2.5.3, and Salton and Buckley, [SB90], both showed that query expansion and query
term reweighting are essential to RF.
Salton and Buckley's experiments were carried out in an experimental setting. In such a setting,
especially with smaller test collections such as the CACM, Cranfield, and NPL, we can assume
complete relevance information; that we know all the relevant documents for a query. However in a real
information seeking situation, users will not necessarily assess every retrieved document, often they
may only assess a small number of documents, before trying RF. This could be significant as a standard
assumption in operational systems is to assume all documents that are not explicitly marked relevant
should be treated as non relevant. Sparck Jones, [SJ79], ran a set of experiments to test how well the
probabilistic F4 weighting scheme performed with little relevance information and demonstrated that
even very few relevance assessments, as few as one or two relevant documents can still improve a
search over no term weighting.
2.5.3 Query expansion vs term reweighting
In [Har88, Har92b] Harman examined the relationship between query expansion and reweighting in the
probabilistic model. As the original probabilistic model did not incorporate the addition of new terms to
the query, it is important to make sure that best possible terms are added. One obvious solution is to
add all terms in the relevant documents but Harman hypothesised that improved performance could be
obtained by ranking these terms and adding only a number of them to the query. This raises two
questions both examined in [Har88]: how to rank the terms, and how many terms to add to the query?
In [Har88] she examined six techniques for ranking terms, and demonstrated on the Cranfield 1400 test
collection, that adding between 20 40 terms much improved performance over adding all terms with a
peak at around 20 terms. The best technique for ranking the terms was one that combined idf like
information and frequency of term occurrences in relevant documents.
In [Har92b] she extended this work, on the same document collection, using a set of new algorithms for
term ranking, and reinforced the suggestion of adding around 20 terms to the query
21
. She also
explored the relationship between query expansion and term reweighting: query expansion and
reweighting of query terms gave increased performance, with the major benefit coming from query
expansion component rather than reweighting. [Har92b] also explored a number of alternative methods
for ranking terms. The details of these new algorithms are not significant here but what is important to
note is that, although the improvements of certain of these techniques were similar, the terms they
added to the query we not identical. This means that different algorithms may present different
documents to the user based on the same relevance assessments. One possible way to exploit this is to
combine methods for RF as in section 3.4, an alternative is to allow the user to make the choice of
which terms to add to the query, discussed in section 5.
In this section we have outlined basic operations of IR systems and how RF is implemented in the
major retrieval models. In the remainder of this paper we shall discuss extensions to these models to
incorporate aspects such as changing information needs, alternative models and uses of relevance
feedback, section 3. We shall summarise the overall features of automatic RF in section 4 and turn to
the interactive aspects of RF in sections 5 7.
3 Extensions to RF
The three sections that follow all extend, rather than challenge, the RF techniques discussed previously.
In section 3.1 we outline approaches to incorporate relations between terms. In section 3.2 we describe
how the fact that what a user finds relevant may change over time. In section 3.3 we discuss negative
RF users making feedback decision on what is not relevant to their needs. In section 3.4 we discuss
21
Experiments carried out by Magennis and Van Rijsbergen [MvR97] indicate that the optimal number of
expansion terms for a test collection can vary between collections and query sets. Ruthven et al. [RLVR01]
showed that smaller scale expansion, with more careful selection of expansion terms, can perform better than
larger scale expansion.
22
<
New Page 1
UK Web Hosting