query term, many indexing systems reduce terms to their root variant, a process known as stemming
[Por80] (Stemming, Figure 1d)
2
.
The result of the indexing process, so far, is a list of low to medium frequency terms that represent the
information content of the document and help discriminate the document from other documents. This
information can be included in a file containing the information on all the document collection, known
as an inverted file, Figure 2. In this file each line consists of information on one of the terms in the
collection; in this example we have the term (automat), followed by a series of document identifiers.
automat
1 2 3
....
expan
1 4 6
....
expansion 1 17
46....
...
Figure 2: Inverted file with no term weights
The final stage in most IR indexing applications is to weight each term according to its importance,
either in the collection, in the individual documents or some combination of both, (Term Weighting,
Figure 1e). Two common weighting measures are inverse document frequency (idf) [SJ72] and term
frequency (tf) [Har92a]. idf (or as it is sometimes referred to, inverse collection frequency) weights a
term according to the inverse of its frequency in the document collection: the more documents in which
the term appears, the lower idf value it receives, Equation 1. The idf weighting function, then, assigns
high weights to terms that have a high discriminatory power in the document collection.
N
idf (t) = ln
n
Equation 1: Inverse document frequency
where N = number of documents in the collection
n = number of documents containing the term t
Term frequency, or tf, measures (see [Har92a] for an overview) assign larger weights to terms that
appear more frequently within an individual document. Unlike the idf value, the tf value of a term is
dependent on the document in which it appears, Equation 2. The tf weighting function assigns high
weights to terms that appear more frequently within a document.
ln(occs
t
)
tf
d
(t) =
ln(length
d
)
Equation 2: Term frequency
where lengthd = the number of terms in document d
occst = number of occurrences of term t in document d
Term weighting information can be also be included in the inverted file; in Figure 3 we have the term
(automat), its idf value (36), followed by a series of tuples of the form
automat 36
<1, 28> <2, 14> <3, 28> ....
expan
14
<1, 28> <4, 15> <6, 29> ....
expansion 11
<1,
17>...
...
Figure 3: Inverted file with idf and tf weights
Some kind of inverted file will form the main data structure of most IR systems and its use means that
the IR system can easily detect which documents contain which query terms. Stopword removal and
stemming reduce the size of the inverted file and increase the efficiency of the system.
2
We shall continue to refer to stemmed terms as terms for ease of description.
4
<
New Page 1
UK Web Hosting