recall precision figures. This method of evaluation is, then, biased somewhat towards queries that
have more relevance assessments or those that perform poorly during initial iterations. An
alternative, e.g. [SB90], is to only use the residual collection of both the rankings before and after
feedback. This means that the two rankings are directly comparable but this method is really only
suitable for small numbers of feedback iterations, otherwise the number of relevant documents in
the residual collection can become relatively small and unrepresentative of the entire set of
relevant documents.
freezing. The method known as freezing is based on the rank position of documents and comes in
two forms: full freezing and modified freezing. In full freezing the rank positions of the top n
documents, the ones used to modify the query, and are frozen. The remaining documents are re
ranked and RP figures are calculated over the whole ranking. As the only documents to change
rank position are those below n (the ones used for RF) any change in RP happens as a result of the
change of rank position of the unseen relevant documents. There is, then, no ranking effect. In
modified freezing, the rank positions are frozen at the rank position of the last marked relevant
document.
The disadvantage of freezing approaches is that at each successive iteration of feedback a higher
proportion of relevant documents are frozen. This means that the frozen section of the ranking
contributes more to recall precision at later iterations of RF, so although RF may work better at
these later iterations, it can appear to be performing more poorly due to the higher contribution of
the frozen documents.
In the previous discussion on the residual method of evaluating feedback runs, we mentioned that
the residual collection method was forced to eliminate queries once all the relevant documents had
been found. For the freezing methods, once all the relevant documents have been found for a
query, recall precision figures can still be calculated. However the recall precision figures will not
change once all the relevant documents have been frozen. Intuitively this seems correct: once we
have found all the relevant documents for a query, feedback does not improve or worsen retrieval
effectiveness.
test and control groups. In this technique, the document collection is randomly split into two
collections the test group and the control group. Query modification is performed by RF on the
test group and the new query is then run against the control group. RP is performed only on the
control group, so there is no ranking effect. Successive queries can be run against the control group
to assess modified queries on what can be regarded as a complete document collection unlike the
residual ranking method. Unlike the freezing methods, all relevant documents in the control group
are free to move within the document ranking. This means that recall precision figures, before and
after query modification, are directly comparable.
The difficulty with this evaluation method is splitting the collection. It is easy to randomly split a
document collection (e.g. by putting all evenly numbered documents in test group and all odd
numbered documents in the control group). However, a random split will not ensure that the
relevant documents are evenly split between the two collections. Neither will it ensure that the
relevant documents in the test group are representative of those in the control group. Other factors
such as document length or distribution of index terms may also be important to the RF method
being tested, and may not be equally split between the two collections.
Each of these methods has advantages and disadvantages but all are standard methods of assessing RF
algorithms. However, they only compare the performance of the algorithms in an idealised setting. For
example, it is usual to use the same number of documents per feedback iteration to modify the query. A
user, however, is unlikely to examine an identical number of documents per search iteration. Also RF
experiments based on recall precision assume complete knowledge of the document collection: a fixed
set of relevant documents is known beforehand. In interactive searching this is also unrealistic as what a
user finds relevant may change over time, e.g. [Kuh93, Ell89, SW99, Vak00a]. Additional methods are
required to test the effectiveness of RF algorithms in more realistic settings.
A final point regarding these measures of RF evaluation is that they may not be directly comparable:
each measure may appear to give different results depending on how the results are compared and on
what factors affect the retrieval. An example of this is given in Table 3 which shows the results of RF
18
<
New Page 1
UK Web Hosting