probability is estimated based only on the presence of query terms within a document or the presence
and absence of terms.
Ordering principle O1: That probable relevance is based on the presence of search terms in
documents
Ordering principle O2: That probable relevance is based both on the presence of search terms in
documents and their absence from documents
Four weighting schemes, F
1
F
4
, can be derived from the combination of the two variants of the
independence assumption and the ordering principle, Table 1.
Independence
Independence
assumption I1
assumption I2
Ordering principle O1
F1
F2
Ordering principle O2
F3
F4
Table 1: Term weighting functions derived from the combination of independence
assumptions and ordering principles
In [RSJ76] each of these possible strategies was instantiated to give an actual method for weighting a
query term, summarised in Figure 5. The weighting methods themselves are based on a contingency
table, Table 2, which converts the probability values into values that can be calculated from term
occurrence information.
rel
rel
x
i
= 1
r n r n
x
i
= 0
R r N n R+r N n
R N R
Table 2: Contingency table to calculate term weights
where r = the number of relevant documents containing term x
i
n = the number of documents containing term x
i
R = the number of relevant documents for query q
N = the number of documents in the collection
Each of the four term weighting functions is a ratio of two proportions
8
:
F
1
is the ratio of the proportion of relevant documents in which the query term t occurs (ordering
principle O1) to the proportion of all documents in which t occurs (independence assumption I1).
F
2
is the ratio of the proportion of relevant documents in which the query term t occurs (ordering
principle O1)) to the proportion of all non relevant documents in which t occurs (independence
assumption I2).
F
3
and F
4
both use odds
F
3
, the ratio of `relevance odds' (the ratio of relevant documents containing term t and relevant
documents not containing t ordering principle O2) and `collection odds' (the ratio of documents
containing t and documents not containing t independence assumption I1).
F
4
is the ratio of relevance odds ordering principle O2 and `non relevance odds' (the ratio of
non relevant documents containing t and the non relevant documents not containing t
independence assumption I2).
8
It may be the case, especially when using small samples, that some of the values in the weights could be zero,
resulting in error when taking logs. The solution is to add 0.5 to each cell in the numerator and denominator of
each function.
10
<
New Page 1
UK Web Hosting