p
i
(1- q
i
)
w
x
= log
i
q
i
(1 - p
i
)
Equation 11:
Term weighting function based on term s distribution
in relevant and non relevant documents
where
w
= the weight of term
x
,
p
( x |rel) and q
x i
i
i = P
q
i
i = P
q
( x
i
|rel)
The function in Equation 11 was examined as a basis for ranking terms for query expansion. Robertson,
[Rob90], argued that a weighting function that ranks terms for matching (as in Equation 10) may not be
appropriate for term selection
9
. That is, the degree to which a term indicates relevant material
(matching) is not necessarily related to how well a term will improve retrieval effectiveness if added to
a query (term selection). For term selection, Robertson proposed the formula in Equation 12, which
provides a better estimate for how much a term will increase a search's effectiveness. Terms should be
chosen for expansion based on the value shown in Equation 12 rather than the w value from Equation
11. Equation 12 incorporates the w value of a term but also takes into account the different between the
relevant and non relevant distributions based on i.
a
i
= w
i
(
p
i
- q
i
)
Equation 12: Formula for ranking expansion terms based on term t s distribution
in relevant and non relevant documents
where ai = the value of term i for query expansion, wi = weight of term i given by Equation 11, pt =
P
q
( x
i
| rel)
and qi =
P
q
( x
i
| rel)
The formula in Equation 12, with the appropriate substitutions for pi and qi becomes the term ranking
function in Equation 13. This allows the calculation of Equation 12 based on the distribution of terms
within the relevant documents and the collection. It should be made clear here that, although at each
iteration of RF the same calculations are taking place (the weighting functions are identical even if that
values are not), theoretically different probabilities are being calculated at each iteration: the
distribution that calculates
P (rel | x)
P (rel | x
q
and
)
q
are different at each iteration [VR86].
r
(
)
r
n
w
i
R - r
i
i
i
- r
i
i
= log
-
n
(
i
- r
i
) (
N - n
i
- R + r
i
)
R N - R
Equation 13: Term expansion ranking function
where ri = the number of relevant documents containing term i
ni = the number of documents containing term i
R = the number of relevant documents for query q
N = the number of documents in the collection
The F4 reweighting function calculates weights for terms based on their distribution in the relevant and
non relevant documents. The probabilistic model is then a retrieval model that is specifically designed
for RF. At the start of a search, of course, there is no relevance information to estimate the probabilities
in Equation 10. One standard solution to this problem is to use a weighting function that does not
depend on relevance information, such as idf. After an initial ranking of documents and relevant
information has been obtained, a function such as F4 can be used to provide improved term weights.
The use of idf comes from substitution of appropriate values for r, R, and n into the F4 weight in Figure
6.
It is possible to treat the query as an additional, and relevant, document and use the F4 weight, however
this will turn into something very like an idf weight [RWH+93]. An alternative to this was proposed by
9
In [Rob86] Robertson also discussed the appropriateness of the 0.5 addition to the entries in the F4 calculation,
arguing that better estimations are more suitable for selecting new query terms.
12
<
New Page 1
UK Web Hosting