Interactive query expansion 
interactive query expansion 
interactive query expansion 
modifies queries using terms  
modifies queries using terms 
modifies queries terms 
from a user. Automatic query 
from a user automatic query 
automatic query  
expansion expands queries 
expansion expands queries 
expansion expands queries 
automatically. 
automatically 
automatically 
a 
b 
c 
Document text 
Tokenisation 
Stopword removal 
interact queri expan 
automat 28 
expand  28 
modifi queri term 
expand 17 
interact 17 
automat queri 
modifi 17 
queri  41 
expan expand queri 
term 17 
automat 
d 
e 
Stemming 
Term weighting 
Figure 1: Indexing a document 
Once the document text has been tokenised it is necessary to decide which terms should be used to 
represent the documents. That is, we need to decide which descriptors are useful for the joint role of 
describing the document's content and discriminating the document from the other documents in the 
collection. Very high frequency terms, ones that appear in a high proportion of the documents in the 
collection, tend not to be effective either in discriminating between documents or in representing 
documents.  
There are two main reasons for this. The first is that, for the majority of realistic user queries, the 
number of documents that are relevant to a query is likely to be a small proportion of the collection. A 
term that will be effective in separating the relevant documents from the non relevant documents, then, 
is likely to be a term that appears in a small number of documents. Therefore high frequency terms are 
likely to be poor at discriminating. The second reason is related to the notion of information content. 
Terms that can appear in many contexts, such as prepositions, are not generally regarded as content 
bearing words; they do not define a topic or sub topic of a document. The more documents in which a 
term appears (the more contexts in which it is used) then the less likely it is to be a content bearing 
term. Consequently it is less likely that the term is one of those terms that contribute to the user's 
relevance assessment. Hence, terms that appear in many documents are less likely to be the ones used 
by a searcher to discriminate between relevant and non relevant documents. 
A common indexing stage is, then, to remove all terms which appear commonly in the document 
collection, and which will not aid retrieval of relevant material, (Stopword removal, Figure 1c). The 
list of terms to be removed is known as a stop list; these can either be generic lists, ones that can be 
applied to most collections, e.g. [VR79], or lists that are specifically created for an individual 
collection. A term does not have to appear in the majority of documents to be considered a stop term. 
For example, in [CRS+95] the removal of all terms that appeared in more than 5% of documents did 
not significantly degrade retrieval performance in a standard IR system.  
Terms may appear as linguistic variants of the same word, e.g. in the example in Figure 1, the terms 
queries and query are the plural and singular of the same object and the terms expansion and expand 
refer fundamentally to the same activity. As most IR systems rely on functions that match terms (see 
section 2.2) to retrieve documents, this variation in word use could cause problems for the user. For 
example, if a user enters a query  hill walks  then an IR system will retrieve all documents that contain 
the term  walks  but not documents containing  hill walking ,  hill walk  or  hill walker , any of which may 
contain relevant information. To avoid the user having to instantiate every possible variation of each 
 3 
<





New Page 1








Home : About Us : Network : Services : Support : FAQ : Control Panel : Order Online : Sitemap : Contact : Terms Of Service

 

Our web partners:  Jsp Web Hosting  Unlimited Web Hosting  Cheapest Web Hosting  Java Web Hosting  Web Templates  Best Web Templates  Web Design Templates  Interland Web Hosting  Cheap Web Hosting  Filemaker Web Hosting  Tomcat Web Hosting  Quality Web Hosting  Best Web Hosting  Mac Web Hosting

 
 

Virtualwebstudio. Business web hosting division of Vision Web Hosting Inc. All rights reserved

UK Web Hosting