Term Vector Theory and Keyword Weights

最新推荐文章于 2024-04-19 11:27:13 发布

gjk0223

最新推荐文章于 2024-04-19 11:27:13 发布

阅读量914

点赞数

文章标签： vector search database scheme system powerpoint

An Introductory Series on Term Vector Theory for Information Retrieval Students and Search Engine Marketers

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 10/27/06

Article 1 of the series Term Vector Theory and Keyword Weights

Topics

References

Salton's Vector Space Model

IR systems assign weights to terms by considering

local information from individual documents
global information from collection of documents

In addition, systems that assign weights to links use Web graph information to properly account for the degree of connectivity between documents.

In IR studies, the classic weighting scheme is the Salton Vector Space Model, commonly known as the "term vector model". This weighting scheme is given by

Equation 1: Term Weight = Term Vector

where

tf_i = term frequency (term counts) or number of times a term i occurs in a document.
df_i = document frequency or number of documents containing term i
D = number of documents in the database.

Many models that extract term vectors from documents and queries are derived from Equation 1.

Local Weights

Equation 1 shows that w_i increases with tf_i. This makes the model vulnerable to term repetition abuses (an adversarial practice known as keyword spamming). Given a query q

for documents of equal lengths, those with more instances of q are favored during retrieval.
for documents of different lengths, long documents are favored during retrieval since these tend to contain more instances of q.

Global Weights

In Equation 1 the log(D/df_i) term is known as the inverse document frequency (IDF_i) --a measure of the shear volume of information associated to a term i within a set of documents. Inspecting the df_i/D ratio, this is the probability of retrieving from D a document containing term i. In Equation 1 we simply invert this probability and take its log. The result is then premultiplied by tf_i. Over the years, several modifications to Equation 1 have been proposed. The expression "a tf*idf model" is often reserved for a model using -or derived from- this equation.

Equation 1 shows that w_i decreases as df_i increases [1 - 11]. For example, if in a 1000-document database only 10 documents contain the term "pet", the IDF for this term is IDF = log(1000/10) = 2. However, if only one document contains the term, IDF = log(1000/1) = 3.

Thus, terms which appear in too many documents (e.g., stopwords, very frequent terms) receive a low weight, while uncommon terms which appear in few documents receive a high weight. This makes sense since too common terms (e.g., "a", "the", "of", etc) are not very useful for distinguishing a relevant document from a non-relevant one. The two extremes are not recommended in rutinary retrieval work. Terms with acceptable weights are those that are not too common or too rare; i.e. their term vectors are not too far or too close to the query vector.

Note. In a vector space representation, when uncommon terms are found in documents and queries, the corresponding term vectors (document and query vectors) end too close from each other. After scoring and sorting results the system tends to rank these documents very high while returning few search results. This tells us that absolute ranking results derived from these term vectors not always are good discriminators of relevancy. In plain English, being #10 out of 5,000,000 results is not the same as being #1 out of 5 results.

Keyword Density Values

From Eq1 is evident that keyword weights are affected by

local term counts
the shear volume of documents in the database.

Therefore, the popular notion that term weights are or can be estimated with "keyword density values" is quite misleading.

Keyword density is defined as

Equation 2: Keyword Density = Co-Occurrence

where as given in Eq 1 tf_i = number of times a term i occurs in a document and L_i = total number of terms in a document. That is, keyword density is just a local word ratio. This ratio expresses the "concentration" of terms in a document. Thus, the keyword density of a 500-word document that repeats the term "pet" 5 times is KD = 5/500 = 0.01 or 1%. Note that this value does not account for contextuality (relative position) and distribution (relative dispersion) of terms in the document. These elements affect document relevancy and topic semantics.

Many search engine marketers (SEOs/SEMs) waste their time fine tuning keyword density values with "Keyword Density Analyzer" tools. Some go to the extreme of computing localized values in page identifiers and descriptors (eg., urls, titles, paragraphs, etc). Others propose keyword weighting schemes based on formulas created out of thin air. Even others claim that keeping documents within an "optimum" keyword density value affects the way search systems rank documents.

Keyword Density Failures

Equation 2 tells nothing about the semantic weight of terms in relation to other terms, within a document or collection of documents. Frankly, SEOs/SEMs that spend their time adjusting keyword density values, going after keyword weight tricks or buying the latest "keyword density analyzer" are wasting their time and money.

According to Eq 2, a term k1 that is equally repeated in two different documents of same length should has the same keyword density, regardless of document content or database nature. However, if we assume that keyword density values are or can be taken for keyword weights, then we are

not considering the shear volume of information that the queried term retrieves.
assigning term weights without regard for term relevancy.
assigning weights without considering the nature of the queried database.

Points 1 - 3 contradict Salton's Model. According to Equation 1, term weights are not local word ratios disconnected from the queried database. Often, a term k1 and equally repeated in two different documents of same length (regardless of content) is weighted differently in the same queried database or in different databases.

Analyzing Illusions

If a search marketer wants to compute term weights, he/she may need to replicate the weighting scheme of the target system. But, this is not an easy task since:

tf and IDF are defined differently across IR systems [1 - 11].
if using Eq 1, he/she need to know D, total number of documents in the queried system, and df_i, number of documents containing the queried term.
number of documents containing the queried term is not necessarily the same as number of documents retrieved.
IR systems or search engines do not publish their working schemes.
the target system may not use Salton's Term Vector Model at all.
the target system may use a variant of Salton's Term Vector Model combined with other scoring schemes (eg. Google, Yahoo and MSN).

To conclude, keyword density values should not be taken for term weights. As local word ratios these are not good discriminators of relevancy.

Acknowledgements

The author thanks the following authority sites for referencing this series of articles

Forge.MySQL.com - MySQL Internals Algorithms, MySQL AB Corporation.
MySQL Internals, Manual :: 4.7 Full-text Search, MySQL AB Corporation.
University of San Francisco Personal Web Neighborhood Project Uddhav, G. and Kien, D. See another version and powerpoint presentation. Dr. Garcia want to thank the researchers for reproducing some of his formulas and tables for this project.
Projet e-Quest Thesaurus et Questions Laboratoire didactique informatique of l'Ecole d'Ingenieurs de Geneve

Next: The Term Count Model

References

Salton, Gerard. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval; Addison Wesley, 1999.
Vector Model Information Retrieval; Rich Ackerman (2003).
Vector Model of Text Retrieval; School of Library and Information Science, University of Iowa.
Term Weighting and Ranking Algorithms; University of California, Berkeley; Ray Larson and Marti Hearst, SIMS 202: Information Organization and Retrieval: Lecture 17 (1998).
Automatic Hypertext Link Generation based on Similarity Measures between Documents; Institut d'Informatique, FUNDP; Luc Goffinet, Monique Noirhomme-Fraiture.
The Term Vector Database: fast access to indexing terms for Web pages ; Raymie Stata, Krishna Bharat and Farzin Maghoul; W3C Conference (2000).
Graph structure in the Web; Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins and Janet Wiener; W3C Conference (2000).
WTMS: A System for Collecting and Analyzing Topic-Specific Web Information; Sougata Mukherjea, C&C Research Laboratories; W3C Conference (2000).
On Near-Uniform URL Sampling; Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher and Marc Najork; W3C Conference (2000).
What is this Page Known for? Computing Web Page Reputation; Davood Rafiei and Alberto O. Mendelzon; W3C Conference (2000).

gjk0223

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Term Vector Theory and Keyword Weights

An Introductory Series on Term Vector Theory for Information Retrieval Students and Search Engine MarketersDr. E. GarciaMi Islita.comEmail | Last Update: 10/27/06 Article 1 of the series Term Ve
复制链接

扫一扫