Interactive query expansion
interactive query expansion
interactive query expansion
modifies queries using terms
modifies queries using terms
modifies queries terms
from a user. Automatic query
from a user automatic query
automatic query
expansion expands queries
expansion expands queries
expansion expands queries
automatically.
automatically
automatically
a
b
c
Document text
Tokenisation
Stopword removal
interact queri expan
automat 28
expand 28
modifi queri term
expand 17
interact 17
automat queri
modifi 17
queri 41
expan expand queri
term 17
automat
d
e
Stemming
Term weighting
Figure 1: Indexing a document
Once the document text has been tokenised it is necessary to decide which terms should be used to
represent the documents. That is, we need to decide which descriptors are useful for the joint role of
describing the document's content and discriminating the document from the other documents in the
collection. Very high frequency terms, ones that appear in a high proportion of the documents in the
collection, tend not to be effective either in discriminating between documents or in representing
documents.
There are two main reasons for this. The first is that, for the majority of realistic user queries, the
number of documents that are relevant to a query is likely to be a small proportion of the collection. A
term that will be effective in separating the relevant documents from the non relevant documents, then,
is likely to be a term that appears in a small number of documents. Therefore high frequency terms are
likely to be poor at discriminating. The second reason is related to the notion of information content.
Terms that can appear in many contexts, such as prepositions, are not generally regarded as content
bearing words; they do not define a topic or sub topic of a document. The more documents in which a
term appears (the more contexts in which it is used) then the less likely it is to be a content bearing
term. Consequently it is less likely that the term is one of those terms that contribute to the user's
relevance assessment. Hence, terms that appear in many documents are less likely to be the ones used
by a searcher to discriminate between relevant and non relevant documents.
A common indexing stage is, then, to remove all terms which appear commonly in the document
collection, and which will not aid retrieval of relevant material, (Stopword removal, Figure 1c). The
list of terms to be removed is known as a stop list; these can either be generic lists, ones that can be
applied to most collections, e.g. [VR79], or lists that are specifically created for an individual
collection. A term does not have to appear in the majority of documents to be considered a stop term.
For example, in [CRS+95] the removal of all terms that appeared in more than 5% of documents did
not significantly degrade retrieval performance in a standard IR system.
Terms may appear as linguistic variants of the same word, e.g. in the example in Figure 1, the terms
queries and query are the plural and singular of the same object and the terms expansion and expand
refer fundamentally to the same activity. As most IR systems rely on functions that match terms (see
section 2.2) to retrieve documents, this variation in word use could cause problems for the user. For
example, if a user enters a query hill walks then an IR system will retrieve all documents that contain
the term walks but not documents containing hill walking , hill walk or hill walker , any of which may
contain relevant information. To avoid the user having to instantiate every possible variation of each
3
<
New Page 1
UK Web Hosting