query term, many indexing systems reduce terms to their root variant, a process known as stemming 
[Por80] (Stemming, Figure 1d)
2
. 
The result of the indexing process, so far, is a list of low to medium frequency terms that represent the 
information content of the document and help discriminate the document from other documents. This 
information can be included in a file containing the information on all the document collection, known 
as an inverted file, Figure 2. In this file each line consists of information on one of the terms in the 
collection; in this example we have the term (automat), followed by a series of document identifiers. 
automat 
  1 2 3 
.... 
expan 
  1 4 6 
.... 
expansion 1 17 
46.... 
   ... 
Figure 2: Inverted file with no term weights 
The final stage in most IR indexing applications is to weight each term according to its importance, 
either in the collection, in the individual documents or some combination of both, (Term Weighting, 
Figure 1e). Two common weighting measures are inverse document frequency (idf) [SJ72] and term 
frequency (tf) [Har92a]. idf (or as it is sometimes referred to, inverse collection frequency) weights a 
term according to the inverse of its frequency in the document collection: the more documents in which 
the term appears, the lower idf value it receives, Equation 1. The idf weighting function, then, assigns 
high weights to terms that have a high discriminatory power in the document collection. 
N
idf (t) = ln
  
n
Equation 1: Inverse document frequency 
where   N = number of documents in the collection 
n = number of documents containing the term t 
Term frequency, or tf, measures (see [Har92a] for an overview) assign larger weights to terms that 
appear more frequently within an individual document. Unlike the idf value, the tf value of a term is 
dependent on the document in which it appears, Equation 2. The tf weighting function assigns high 
weights to terms that appear more frequently within a document. 
ln(occs
t
)
tf
d
(t) =
  
ln(length
d
)
Equation 2: Term frequency 
where   lengthd = the number of terms in document d 
occst = number of occurrences of term t in document d 
Term weighting information can be also be included in the inverted file; in Figure 3 we have the term 
(automat), its idf value (36), followed by a series of tuples of the form  
automat 36 
<1, 28> <2, 14> <3, 28> .... 
expan   
14 
<1, 28> <4, 15> <6, 29> .... 
expansion 11 
<1, 
17>... 
  ... 
Figure 3: Inverted file with idf and tf weights 
Some kind of inverted file will form the main data structure of most IR systems and its use means that 
the IR system can easily detect which documents contain which query terms. Stopword removal and 
stemming reduce the size of the inverted file and increase the efficiency of the system.  
                                                           
2
We shall continue to refer to stemmed terms as terms for ease of description. 
 4 
<





New Page 1








Home : About Us : Network : Services : Support : FAQ : Control Panel : Order Online : Sitemap : Contact : Terms Of Service

 

Our web partners:  Jsp Web Hosting  Unlimited Web Hosting  Cheapest Web Hosting  Java Web Hosting  Web Templates  Best Web Templates  Web Design Templates  Interland Web Hosting  Cheap Web Hosting  Filemaker Web Hosting  Tomcat Web Hosting  Quality Web Hosting  Best Web Hosting  Mac Web Hosting

 
 

Virtualwebstudio. Business web hosting division of Vision Web Hosting Inc. All rights reserved

UK Web Hosting