Topics
Salton's Vector Space Model
Local Weights
Global Weights
Keyword Density Values
Keyword Density Failures
Analyzing Illusions
Salton's Vector Space Model
IR systems assign weights to terms by considering
- local information from individual documents
- global information from collection of documents
In addition, systems that assign weights to links use Web graph information to properly account for the degree of connectivity between documents.
In IR studies, the classic weighting scheme is the Salton Vector Space Model, commonly known as the "term vector model". This weighting scheme is given by
Equation 1: Term Weight =
where
- tfi = term frequency (term counts) or number of times a term i occurs in a document.
- dfi = document frequency or number of documents containing term i
- D = number of documents in the database.
Many models that extract term vectors from documents and queries are derived from Equation 1.
Local Weights
Equation 1 shows that wi increases with tfi . This makes the model vulnerable to term repetition abuses (an adversarial practice known as keyword spamming). Given a query q
- for documents of equal lengths, those with more instances of q are favored during retrieval.
- for documents of different lengths, long documents are favored during retrieval since these tend to contain more instances of q .
Global Weights
In Equation 1 the log(D/dfi ) term is known as the inverse document frequency (IDFi ) --a measure of the shear volume of information associated to a term i within a set of documents. Inspecting the dfi /D ratio, this is the probability of retrieving from D a document containing term i. In Equation 1 we simply invert this probability and take its log. The result is then premultiplied by tfi . Over the years, several modifications to Equation 1 have been proposed. The expression "a tf*idf model" is often reserved for a model using -or derived from- this equation.
Equation 1 shows that wi decreases as dfi increases [1 - 11]. For example, if in a 1000-document database only 10 documents contain the term "pet", the IDF for this term is IDF = log(1000/10) = 2. However, if only one document contains the term, IDF = log(1000/1) = 3.
Thus, terms which appear in too many documents (e.g., stopwords, very frequent terms) receive a low weight, while uncommon terms which appear in few documents receive a high weight. This makes sense since too common terms (e.g., "a", "the", "of", etc) are not very useful for distinguishing a relevant document from a non-relevant one. The two extremes are not recommended in rutinary retrieval work. Terms with acceptable weights are those that are not too common or too rare; i.e. their term vectors are not too far or too close to the query vector.
Note. In a vector space representation, when uncommon terms are found in documents and queries, the corresponding term vectors (document and query vectors) end too close from each other. After scoring and sorting results the system tends to rank these documents very high while returning few search results. This tells us that absolute ranking results derived from these term vectors not always are good discriminators of relevancy. In plain English, being #10 out of 5,000,000 results is not the same as being #1 out of 5 results.
Keyword Density Values
From Eq1 is evident that keyword weights are affected by
- local term counts
- the shear volume of documents in the database.
Therefore, the popular notion that term weights are or can be estimated with "keyword density values" is quite misleading.
Keyword density is defined as
Equation 2: Keyword Density =
where as given in Eq 1 tfi = number of times a term i occurs in a document and Li = total number of terms in a document. That is, keyword density is just a local word ratio. This ratio expresses the "concentration" of terms in a document. Thus, the keyword density of a 500-word document that repeats the term "pet" 5 times is KD = 5/500 = 0.01 or 1%. Note that this value does not account for contextuality (relative position) and distribution (relative dispersion) of terms in the document. These elements affect document relevancy and topic semantics.
Many search engine marketers (SEOs/SEMs) waste their time fine tuning keyword density values with "Keyword Density Analyzer" tools. Some go to the extreme of computing localized values in page identifiers and descriptors (eg., urls, titles, paragraphs, etc). Others propose keyword weighting schemes based on formulas created out of thin air. Even others claim that keeping documents within an "optimum" keyword density value affects the way search systems rank documents.
Keyword Density Failures
Equation 2 tells nothing about the semantic weight of terms in relation to other terms, within a document or collection of documents. Frankly, SEOs/SEMs that spend their time adjusting keyword density values, going after keyword weight tricks or buying the latest "keyword density analyzer" are wasting their time and money.
According to Eq 2, a term k1 that is equally repeated in two different documents of same length should has the same keyword density, regardless of document content or database nature. However, if we assume that keyword density values are or can be taken for keyword weights, then we are
- not considering the shear volume of information that the queried term retrieves.
- assigning term weights without regard for term relevancy.
- assigning weights without considering the nature of the queried database.
Points 1 - 3 contradict Salton's Model. According to Equation 1, term weights are not local word ratios disconnected from the queried database. Often, a term k1 and equally repeated in two different documents of same length (regardless of content) is weighted differently in the same queried database or in different databases.
Analyzing Illusions
If a search marketer wants to compute term weights, he/she may need to replicate the weighting scheme of the target system. But, this is not an easy task since:
- tf and IDF are defined differently across IR systems [1 - 11].
- if using Eq 1, he/she need to know D , total number of documents in the queried system, and dfi , number of documents containing the queried term.
- number of documents containing the queried term is not necessarily the same as number of documents retrieved.
- IR systems or search engines do not publish their working schemes.
- the target system may not use Salton's Term Vector Model at all.
- the target system may use a variant of Salton's Term Vector Model combined with other scoring schemes (eg. Google, Yahoo and MSN).
To conclude, keyword density values should not be taken for term weights. As local word ratios these are not good discriminators of relevancy.
Salton n.萨尔顿
connectivity adv.连通性
equation n.方程式, 等式 相等, 平衡
vulnerable a.易受伤的, 脆弱的, 敏感的
repetition n.重复, 反复, 重复的事 复制品, 副本
abuse n.滥用, 妄用, 虐待 恶习, 不正之风 v.滥用, 妄用 虐待, 伤害
'adversary n.对手, 敌手
adversarial a.敌手的,对手的,对抗(性)的
spam n.香火腿 罐头猪肉 垃圾邮件
spamming n.兜售信息(邮件,广告,新闻,文章),非索要信息 滥发
shear n.大剪刀 v.剪羊毛, 剪
discriminator n.辨别者,鉴别器
density n.密集, 稠密 〈物〉〈化〉密度
evident a.明显的, 明白的
account for 说明〔解释〕…原因
contextuality 文脉性 情境性 脉络性
dispersion n.散布,驱散,传播,散射;离差,差量
out of thin air 无中生有, 突然出现
optimum a.最适宜的; 最有利的
contradict v.反驳, 否认…的真实性 与…发生矛盾, 与…抵触
illusion n.错觉, 幻想, 错误观念 假象
variant n.变体; 变种; 变型