2.2.3 Probabilistic model
In the probabilistic model, suggested by Maron and Kuhns [MK60], and developed by amongst others,
Robertson and Sparck Jones [RSJ76], and Van Rijsbergen [VR79], documents and queries are also
viewed as vectors but the vector space similarity measure is replaced by a probabilistic matching
function. The probabilistic model is based on estimating the probability that a document will be
relevant to a user, given a particular query. The higher this estimated probability, the more likely the
document is to be relevant to the user
4
. This is instantiated in the probabilistic ranking principle,
[Rob77].
If a reference retrieval system s response to each request is a ranking of the
documents in the collection in order of decreasing probability of relevance to the
user who submitted the request, where the probabilities are estimated as
accurately as possible on the basis of whatever data have been made available to
the system for this purpose, the overall effectiveness of the system to its user will
be the best that is obtainable on the basis of those data.
The estimated probability of relevance can be expressed as
P
q
(rel | x)
, the probability of relevance
given a document x and a query q. This probability can be used to decide whether or not to retrieve a
document: if
P
q
(rel | x)
= 0 then the probability of relevance given x is 0, and x should not be
retrieved
5
.
This can be refined by also considering the probability of non relevance given x and q,
P
q
(rel | x)
. If
P
q
(rel | x)
>
P
q
(rel | x)
then it can be asserted that the probability of relevance is greater than the
probability of non relevance and hence x should be retrieved
6
. Thresholds may also be used, i.e. the
difference between the probability of relevance and the probability of non relevance must be greater
than some threshold value before x is retrieved, ((
P
q
(rel | x)
P
q
(rel | x)
) > threshold). In this case
threshold is a value set by the user or system, in order to further restrict the retrieval function.
Having decided which documents to retrieve, the odds of relevance to non relevance, Equation 7, can
be used as a document ranking function: the higher the ratio of the probability of relevance to non
relevance, given x, then the more likely document x is to be relevant to a user.
P
q
(rel | x)
P
q
(rel | x)
Equation 7: Odds of relevance to non relevance for document x and query q
Bayes, [Bay63], theorem can be used to calculate
P
q
(rel | x)
and
P
q
(rel | x)
. Equation 8 demonstrates
this for the relevance case.
P
P
q
(x | rel)P
q
(rel)
q
(rel | x) =
P(x)
Equation 8: Calculation of
P
q
(rel | x)
through Bayesian inversion
where
P
q
(rel)
is the prior probability that any document in the collection is relevant to q
P
q
( x | rel)
is the probability of observing document x given relevance information
P( x)
is the probability of observing document x irrespective of relevance
4
The probabilistic model measures the probability of relevance, i.e. the probability that a document will be
relevant, not the degree of relevance as is sometimes suggested. A good discussion of the difference between these
two notions is found in [RB78].
5
In an operational system
P
q
(rel| x)
will generally only equal 0 if x does not contain any query terms. This rule
then decides only to retrieve those documents that contain at least one query term.
6
In the case where the two probabilities are equal, it is arbitrarily decided that x is non relevant [VR79].
8
<
New Page 1
UK Web Hosting