P
q
(x
i
| rel)
r
(
R
)
w
x
= log
= log
i
P
q
( x
i
)
n
(
N
)
F1
P
r
(
R
)
w
q
(x
i
| rel)P
q
(rel)
x
= log
= log
i
P
q
(x
i
| rel)P
q
(rel)
(
n
(
- r
) (
N - R
)
)
F2
P
r R
(
- r
)
w
q
(x
i
| rel) / P
q
(x
i
| rel))
x
= log
= log
i
P( x
i
) /(P(x
i
)
n
(
N - n
)
F3
P
r R
(
- r
)
w
q
(x
i
| rel) / P
q
(x
i
| rel)
x
= log
= log
i
P
q
(x
i
| rel) / P
q
(x
i
| rel)
n
(
- r
) (
N - n - R + r
)
F4
Figure 5: Term weighting functions F
1
F
4
In [RSJ76], Robertson and Sparck Jones used the four term weighting schemes to carry out two sets of
experiments. The first set was based on retrospective weighting. This involves deriving optimal weights
to retrieve the relevant documents already found the known relevant set. The second group of
experiments were based on predictive weighting. Predictive weighting uses the weights from the
retrospective stage to retrieve new documents. If the known relevant set is a representative sample of all
relevant documents, then predictive weighting should be better at retrieving unseen relevant documents
than the original term weights. Naturally, it is the latter, predictive, case that is mainly of interest as RF
is intended to retrieve relevant documents that the user has not yet seen.
All functions outperformed no relevance weighting, and the idf function. F
1
and F
2
, and F
3
and F
4
perform within the same range with F
3
and F
4
outperforming F
1
and F
2,
and F
4
slightly outperforming
F
3
. This confirms Robertson and Sparck Jones intuition that ordering principles O2 is correct and that
it is necessary to consider both presence and absence of query terms. No conclusive evidence was
provided to distinguish between the two versions of the independence assumption, however Robertson
and Sparck Jones favoured the second, I2, assumption as the more realistic assumption.
Given that the preferred weighting scheme is F
4
, the odds function in Figure 6 (Equation 10
a
) can be
converted to that of Equation 10
b
by eliminating the division operators. By noting that P
q
( x
i
| rel) = 1
P
q
( x
i
| rel) , and P
q
( x
i
| rel) = 1 P
q
( x
i
| rel) it is possible to convert the representation of F
4
in
Figure 6 to that in Equation 10
c
.
P
P
P
w
q
(x
i
| rel) / P
q
(x
i
| rel)
q
( x
i
| rel)P
q
(x
i
| rel)
q
(x
i
| rel)(1- P
q
( x
i
| rel))
x
= log
= log
= log
i
P
q
(x
i
| rel) / P
q
(x
i
| rel)
P
q
( x
i
| rel)P
q
(x
i
| rel)
P
q
(x
i
| rel)(1- P
q
( x
i
| rel))
a
b
c
Equation 10:
Term weighting function based on term s distribution
in relevant and non relevant documents
where w
xi
= the weight of term x
i
This equation (Equation 10
c
), which expresses the F
4
function solely as a factor of the presence of a
term in the relevant and non relevant documents, can alternatively be represented as in Equation 11.
The probability of relevance of a document, then, is measured as the sum of the term weights of the
query terms in the document, i.e. the sum of the F
4
weights of each query term in the document.
11
<
New Page 1
UK Web Hosting