After Bayesian inversion and deletion of P( x) (which is identical for both the relevance and non
relevance case), the odds function from Equation 7 turns into Equation 9
a
.
The probability of relevance, P
q
(rel) , and the probability of non relevance, P
q
(rel) , are identical for
all x's, That is when we use the odds in Equation 7 to rank documents, the ranking is dependent on the
values of the probabilities P
q
( x | rel) and P
q
( x | rel) , not on the values P
q
(rel) and P
q
(rel) . We
can therefore eliminate these elements and arrive at the odds in Equation 9
b
. This is then the odds of
observing x given relevance or non relevance.
P
q
( x | rel)P
q
(rel)
P
q
( x | rel)
P
q
( x | rel)P
q
(rel)
P
q
( x | rel)
a
b
Equation 9:
Odds of relevance, or non relevance, having observed document x
The odds in Equation 9 refer to the probability of relevance, and non relevance, after viewing the actual
document text rather than the vector representation of the document. That is, it measures the odds of
relevance to non relevance based on the content of the document and is independent of the document
representation. This means that the model can be used for many different types of document indexing
but it also means that Equation 9 must be ultimately be expressed as a retrieval function based on the
specific document indexing technique used to represent the documents.
There are many probabilistic models based on the model outlined so far in this section. In the remainder
of this section we shall describe the transformation from Equation 9 to a function based on the term
based representation outlined in section 2.1. Specifically the discussion will be based on the
probabilistic model known as the Binary Independence Model, as this is the most traditional variant of
the overall probabilistic approach. This model was one of the first probabilistic models of IR, and will
be used as an example of how the theoretical model is transformed into an actual retrieval model.
Before converting Equation 9 into an equation that can be estimated based on the probability of
relevance and non relevance of the terms in document x, it is necessary to consider how the
probabilities of relevance and non relevance interact. In particular, two aspects of retrieval are
important: the independence of terms and what information is used to order documents.
The probabilistic model assumes that terms are distributed independently of other terms, that is the
probability of seeing term t in a document is not affected by seeing term s in the same document. This is
a simplifying assumption that reduces the computational complexity of the model. However it is
necessary to define over what sets the independence holds. Two versions of the independence
assumption were proposed in [RSJ76]. Both term independence assumptions assume that terms, query
terms in particular, are distributed independently in the set of relevant documents: the probability of a
term appearing in the relevant documents is not dependent on the probabilities of other terms appearing
in the relevant documents. The two assumptions differ in whether the relevant document set should be
distinguished from the whole document collection or only from the set of non relevant documents.
Independence assumption I1: The distribution of terms in relevant documents is independent and their
distribution in all documents is independent
Independence assumption I2: The distribution of terms in relevant documents is independent and their
distribution in irrelevant
7
documents is independent
These two versions of the independence assumption are important in distinguishing whether we should
measure the difference in the probability of a term's occurrence against the non relevant documents (I2)
or against its probability of occurrence the collection as a whole (I1).
The probabilistic model ranks documents according to their probability of being relevant to a query
the ordering principle. Two versions of this principle distinguish between the case where this
7
The labels irrelevant and non relevant are treated as synonymous in this paper.
9
<
New Page 1
UK Web Hosting