the combination of evidence in RF: combining multiple queries, retrieval algorithms or feedback
algorithms, and in section 3.5 we discuss pseudo RF: employing RF without the user's involvement.
3.1 Dependence between terms
The vector space and probabilistic models assume that terms are independent of each other, that is the
presence of one term in a document does not alter the probability of seeing another term in the same
document. Although this simplifying assumption has facilitated the construction of successful retrieval
systems, it is not true. Words are related by use, for example in phrases, and their similarity of
occurrence in documents can reflect underlying semantic relations between terms.
Incorporating information on co occurrence patterns of terms in documents may improve retrieval
effectiveness as indicated by the Association Hypothesis [VR79]:
If an index term is good at discriminating relevant from irrelevant documents
then any closely associated index term is also likely to be good at this.
Author such as Spiegel and Bennet, [SB64], as early as 1964, suggested that dependency information of
this kind may be used to choose further search terms for query expansion. Not all query expansion
based on dependence information is used for RF, for example we could use dependency information to
automatically expand initial queries in the absence of relevance information from the user. However
three investigations of dependency information, with a RF connection, are outlined below.
Van Rijsbergen, Harper and Porter [VRHP81] proposed using a maximum spanning tree (MST) in
which each node represents a term and each link represents the association or similarity between the
two terms. The MST links each term to its most similar terms as measured by the association measure.
The association measure used in [VRHP81] was the EMIM (Expected Mutual Information Measure)
measure, based on the probability distribution of the two terms. The MST can be potentially be used in
many ways to expand a query. In [VRHP81] the most similar terms to the query terms (the ones directly
linked in the MST) are added to the query. The query and expansion terms in [VRHP81] are also
reweighted by a weight based on the F4 weight. On the whole, Van Rijsbergen et al. show that their
term dependence approach behaves better than the F4 term independence weighting scheme. They also
demonstrate the relative robustness of the MST approach, in that although, the EMIM based MST gives
superior results, alternative association measures do not give significantly different results.
Smeaton and Van Rijsbergen [SVR83] investigate query expansion and term reweighting using term
dependence. Their investigation centred around three methods of query expansion: the MST approach
of Van Rijsbergen et al, a Nearest Neighbours (NN) approach (this added terms that were statistically
most similar to a query term) and query expansion by a list of possible expansion terms from the
relevant documents. The third technique, expansion with terms from relevant documents is similar to
the term independence approaches outlined in section 2. The results from these experiments were
largely negative. Query expansion via the MST generally degraded performance over the unexpanded
query, as did expansion via the NN or expansion terms chosen from the relevant documents. One
striking feature was that the performance degradation increased as the number of terms added to the
query increased. Smeaton and Van Rijsbergen point to the difficulty in estimating probabilities as the
main reason for this failure.
In [Bha92] Bhatia also presented a model of dependence trees for query expansion to incorporate user
specific information. Bhatia suggests that the dependence tree approach can be improved by not only
being more selective about which terms appear in the tree but by weighting the links between elements
in the tree according to user preference. The claim is that although spanning trees can suggest
expansion terms based on statistical similarity they do not suggest them based on conceptual similarity.
The solution presented is to elicit from the user what concepts are present in documents and how they
relate to each (how similar or dissimilar they are). This can be used to develop a new spanning tree that
more accurately reflects the user s personal constructs based on concepts rather than explicitly
mentioned terms. A spanning, or dependence, tree would have to be constructed for each user but the
argument is that it would better support the users searching and choice of terms.
23
<
New Page 1
UK Web Hosting