Term Vector Theory and Keyword Weights

An Introductory Series on Term Vector Theory for Information Retrieval Students and Search Engine Marketers

Salton's Vector Space Model

IR systems assign weights to terms by considering

  1. local information from individual documents
  2. global information from collection of documents

In addition, systems that assign weights to links use Web graph information to properly account for the degree of connectivity between documents.

In IR studies, the classic weighting scheme is the Salton Vector Space Model, commonly known as the "term vector model". This weighting scheme is given by

Equation 1: Term Weight = Term Vector

where

  • tfi = term frequency (term counts) or number of times a term i occurs in a document.
  • dfi = document frequency or number of documents containing term i
  • D = number of documents in the database.

Many models that extract term vectors from documents and queries are derived from Equation 1.

Local Weights

Equation 1 shows that wi increases with tfi. This makes the model vulnerable to term repetition abuses (an adversarial practice known as keyword spamming). Given a query q

  1. for documents of equal lengths, those with more instances of q are favored during retrieval.
  2. for documents of different lengths, long documents are favored during retrieval since these tend to contain more instances of q.

Global Weights

In Equation 1 the log(D/dfi) term is known as the inverse document frequency (IDFi) --a measure of the shear volume of information associated to a term i within a set of documents. Inspecting the dfi/D ratio, this is the probability of retrieving from D a document containing term i. In Equation 1 we simply invert this probability and take its log. The result is then premultiplied by tfi. Over the years, several modifications to Equation 1 have been proposed. The expression "a tf*idf model" is often reserved for a model using -or derived from- this equation.

Equation 1 shows that wi decreases as dfi increases [1 - 11]. For example, if in a 1000-document database only 10 documents contain the term "pet", the IDF for this term is IDF = log(1000/10) = 2. However, if only one document contains the term, IDF = log(1000/1) = 3.

Thus, terms which appear in too many documents (e.g., stopwords, very frequent terms) receive a low weight, while uncommon terms which appear in few documents receive a high weight. This makes sense since too common terms (e.g., "a", "the", "of", etc) are not very useful for distinguishing a relevant document from a non-relevant one. The two extremes are not recommended in rutinary retrieval work. Terms with acceptable weights are those that are not too common or too rare; i.e. their term vectors are not too far or too close to the query vector.

Note. In a vector space representation, when uncommon terms are found in documents and queries, the corresponding term vectors (document and query vectors) end too close from each other. After scoring and sorting results the system tends to rank these documents very high while returning few search results. This tells us that absolute ranking results derived from these term vectors not always are good discriminators of relevancy. In plain English, being #10 out of 5,000,000 results is not the same as being #1 out of 5 results.

Keyword Density Values

From Eq1 is evident that keyword weights are affected by

  1. local term counts
  2. the shear volume of documents in the database.

Therefore, the popular notion that term weights are or can be estimated with "keyword density values" is quite misleading.

Keyword density is defined as

Equation 2: Keyword Density =  Co-Occurrence

where as given in Eq 1 tfi = number of times a term i occurs in a document and Li = total number of terms in a document. That is, keyword density is just a local word ratio. This ratio expresses the "concentration" of terms in a document. Thus, the keyword density of a 500-word document that repeats the term "pet" 5 times is KD = 5/500 = 0.01 or 1%. Note that this value does not account for contextuality (relative position) and distribution (relative dispersion) of terms in the document. These elements affect document relevancy and topic semantics.

Many search engine marketers (SEOs/SEMs) waste their time fine tuning keyword density values with "Keyword Density Analyzer" tools. Some go to the extreme of computing localized values in page identifiers and descriptors (eg., urls, titles, paragraphs, etc). Others propose keyword weighting schemes based on formulas created out of thin air. Even others claim that keeping documents within an "optimum" keyword density value affects the way search systems rank documents.

Keyword Density Failures

Equation 2 tells nothing about the semantic weight of terms in relation to other terms, within a document or collection of documents. Frankly, SEOs/SEMs that spend their time adjusting keyword density values, going after keyword weight tricks or buying the latest "keyword density analyzer" are wasting their time and money.

According to Eq 2, a term k1 that is equally repeated in two different documents of same length should has the same keyword density, regardless of document content or database nature. However, if we assume that keyword density values are or can be taken for keyword weights, then we are

  1. not considering the shear volume of information that the queried term retrieves.
  2. assigning term weights without regard for term relevancy.
  3. assigning weights without considering the nature of the queried database.

Points 1 - 3 contradict Salton's Model. According to Equation 1, term weights are not local word ratios disconnected from the queried database. Often, a term k1 and equally repeated in two different documents of same length (regardless of content) is weighted differently in the same queried database or in different databases.

Analyzing Illusions

If a search marketer wants to compute term weights, he/she may need to replicate the weighting scheme of the target system. But, this is not an easy task since:

  1. tf and IDF are defined differently across IR systems [1 - 11].
  2. if using Eq 1, he/she need to know D, total number of documents in the queried system, and dfi, number of documents containing the queried term.
  3. number of documents containing the queried term is not necessarily the same as number of documents retrieved.
  4. IR systems or search engines do not publish their working schemes.
  5. the target system may not use Salton's Term Vector Model at all.
  6. the target system may use a variant of Salton's Term Vector Model combined with other scoring schemes (eg. Google, Yahoo and MSN).

To conclude, keyword density values should not be taken for term weights. As local word ratios these are not good discriminators of relevancy.

Acknowledgements

The author thanks the following authority sites for referencing this series of articles

Next: The Term Count Model
References
  1. Salton, Gerard. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
  2. Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval; Addison Wesley, 1999.
  3. Vector Model Information Retrieval; Rich Ackerman (2003).
  4. Vector Model of Text Retrieval; School of Library and Information Science, University of Iowa.
  5. Term Weighting and Ranking Algorithms; University of California, Berkeley; Ray Larson and Marti Hearst, SIMS 202: Information Organization and Retrieval: Lecture 17 (1998).
  6. Automatic Hypertext Link Generation based on Similarity Measures between Documents; Institut d'Informatique, FUNDP; Luc Goffinet, Monique Noirhomme-Fraiture.
  7. The Term Vector Database: fast access to indexing terms for Web pages ; Raymie Stata, Krishna Bharat and Farzin Maghoul; W3C Conference (2000).
  8. Graph structure in the Web; Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins and Janet Wiener; W3C Conference (2000).
  9. WTMS: A System for Collecting and Analyzing Topic-Specific Web Information; Sougata Mukherjea, C&C Research Laboratories; W3C Conference (2000).
  10. On Near-Uniform URL Sampling; Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher and Marc Najork; W3C Conference (2000).
  11. What is this Page Known for? Computing Web Page Reputation; Davood Rafiei and Alberto O. Mendelzon; W3C Conference (2000).
 
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值