on the same collection
15
but evaluated using the three RF evaluation schemes. An initial document
ranking, for each query, was obtained using the idf weighting function, followed by four iterations of
RF, in which the top 6 expansion terms were added, based on an F4 ranking of expansion terms. 50 new
documents were used in each iteration of feedback. After feedback all query terms were weighted using
the idf weighting scheme and these values were used to score documents. Table 3 gives the percentage
change, over no feedback, after four iterations of feedback using each of the three evaluation
techniques.
AP 88
Full
Residual
Residual
Test and
freezing
collection
collection
control
(removal)
(no removal)
%age increase over
+2.9% 77.0%
25.0% +21.5%
no feedback
Table 3: Example RF evaluation
As can be seen from Table 3, the results vary according to how they describe the retrieval effectiveness
of the system. Full freezing (column 2) gives a small increase in the effectiveness of the system. The
test and control method gives a larger percentage increase in effectiveness (column 5). These two
approaches give different absolute performance figures (average precision) as they use different data to
calculate idf values, F4 values and do not have identical terms in the collection. The test and control
method used two less queries (as all the relevant documents for this query appeared in the test
collection), and several of the queries were expanded by terms that appeared in the test collection but
not the control collection
16
. These differences cause the different performance figures for the two
evaluation methods.
The residual collection method (column 3) gives a large drop in retrieval effectiveness. This is because
the residual collection method eliminates queries that have no relevant documents in the residual
section of the collection. This means that queries, for which all relevant documents have been retrieved
in early iterations of feedback, have been removed from the evaluation. The queries that are being used
to calculate average precision are the ones for which the system finds it difficult to retrieve the
remaining relevant documents
17
. If we do not remove queries when all relevant documents are found
and, instead use the RP figures from the previous iteration, then we obtain the figure in column 4 for
residual collection. This is an attempt to soften the effect of removing queries that perform well. This
also shows a drop in retrieval effectiveness but not so severe a drop as in column 3. The drop in
retrieval effectiveness is caused, again, by the effect of the queries for which the system finds it difficult
to retrieve all relevant documents.
An alternative method of examining RF performance is to plot the average precision values at each
iteration of feedback, as in Figure 13. We can see that different methods give different shaped graphs.
The freezing graph gives slight, but steady, increases in retrieval effectiveness at each iteration of
feedback. The test and control method gives an initial large increase followed by decreases at the last
iteration of feedback. The residual methods, however, give very different, but similar shaped graphs:
large decrease initially followed by increases in performance at later iterations.
The graphs can be used to highlight interesting areas where RF is working well or where it is
operating poorly. However as with recall and precision the graphs can be misleading: all four lines
plotted in Figure 13 are evaluating the same feedback technique on the same collection. The point is
that the evaluation measures are calculating different aspects of feedback: freezing is measuring
cumulative effectiveness, residual collection is measuring the effectiveness of retrieving only the
remaining relevant documents and test and control is measuring the relative performance of the
modified queries produced at each iteration.
15
AP (Associated Press) collection 1988.
16
This was also true for one of the original query terms.
17
The remaining queries may also include some queries that have a large number of relevant documents, but this
is unlikely to be the case in this test as 200 documents have been used for feedback whereas the queries have an
average of only 35 relevant documents per query.
19
<
New Page 1
UK Web Hosting