Precision
100
90
80
70
60
System 1
50
System 2
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Recall
Figure 12: Example recall precision graph
Figure 12 shows the results of the two systems for a different test collection. In Figure 12, the two lines
cross at 70% recall, so we can say that, on the average of the queries tested, System 1 was better than
System 2 at high recall levels (initially better at retrieving the relevant documents). On the other hand
System 2 was better at lower recall levels (if the user is looking for all the relevant documents they will
find them first with System 2).
Although these measures have been widely criticised for being capable of misrepresentation [FMS91],
not reflecting the dynamic, situational and subjective nature of information seeking [BI97], and not
reflecting users evaluation criteria, e.g. [Su94], they have remained popular and standard measures of
assessing an IR system performance.
However, as early as the early 1970's Chang et al., [CCR71], demonstrated that evaluation of RF
algorithms poses certain problems for recall and precision. Given that RF, as described here, attempts
to improve recall and precision by using information in marked relevant documents, it is usually the
case that one of the main effects of RF is to push the known
14
relevant documents to the top of the
document ranking. This ranking effect, will artificially improve RP figures for the new document
ranking simply by re ranking the known relevant documents. What is not directly tested is how good
the RF technique is as improving retrieval of unseen relevant documents the feedback effect. Chang et
al [CCR71] investigated three alternatives, originally suggested by Ide and briefly outlined here to
measure the effect of feedback on the unseen relevant documents:
residual ranking: in this technique, the documents which are used in RF are removed from the
collection before evaluation. This will include the relevant and some non relevant documents.
After RF, the RP figures are calculated on the remaining (residual) collection. The advantage of
this method is that it only considers the effect of feedback on the unseen relevant documents but
the main disadvantage is that the feedback results are not comparable with the original ranking.
This is because the residual collection has fewer documents, and fewer relevant documents, than
the original collection.
A further difficulty is that, at each successive iteration of feedback, RP figures may be based on
different numbers of queries. This arises because relevant documents are removed from the
collection. If all the relevant documents are removed for a query, then this query cannot be used in
subsequent iterations of feedback as there are no relevant documents upon which to calculate
14
These are the relevant documents that are used for RF.
17
<
New Page 1
UK Web Hosting