Term Vector Theory and Keyword Weights

最新推荐文章于 2022-11-16 21:57:44 发布

tiny_ala

最新推荐文章于 2022-11-16 21:57:44 发布

阅读量421

点赞数

分类专栏： IR Theory 文章标签： vector search database scheme system returning

IR Theory 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Topics

Analyzing Illusions

Salton's Vector Space Model

IR systems assign weights to terms by considering

local information from individual documents
global information from collection of documents

In addition, systems that assign weights to links use Web graph information to properly account for the degree of connectivity between documents.

In IR studies, the classic weighting scheme is the Salton Vector Space Model, commonly known as the "term vector model". This weighting scheme is given by

Equation 1: Term Weight = Term Vector

where

tf_i = term frequency (term counts) or number of times a term i occurs in a document.
df_i = document frequency or number of documents containing term i
D = number of documents in the database.

Many models that extract term vectors from documents and queries are derived from Equation 1.

Local Weights

Equation 1 shows that w_i increases with tf_i . This makes the model vulnerable to term repetition abuses (an adversarial practice known as keyword spamming). Given a query q

for documents of equal lengths, those with more instances of q are favored during retrieval.
for documents of different lengths, long documents are favored during retrieval since these tend to contain more instances of q .

Global Weights

In Equation 1 the log(D/df_i ) term is known as the inverse document frequency (IDF_i ) --a measure of the shear volume of information associated to a term i within a set of documents. Inspecting the df_i /D ratio, this is the probability of retrieving from D a document containing term i. In Equation 1 we simply invert this probability and take its log. The result is then premultiplied by tf_i . Over the years, several modifications to Equation 1 have been proposed. The expression "a tf*idf model" is often reserved for a model using -or derived from- this equation.

Equation 1 shows that w_i decreases as df_i increases [1 - 11]. For example, if in a 1000-document database only 10 documents contain the term "pet", the IDF for this term is IDF = log(1000/10) = 2. However, if only one document contains the term, IDF = log(1000/1) = 3.

Thus, terms which appear in too many documents (e.g., stopwords, very frequent terms) receive a low weight, while uncommon terms which appear in few documents receive a high weight. This makes sense since too common terms (e.g., "a", "the", "of", etc) are not very useful for distinguishing a relevant document from a non-relevant one. The two extremes are not recommended in rutinary retrieval work. Terms with acceptable weights are those that are not too common or too rare; i.e. their term vectors are not too far or too close to the query vector.

Note. In a vector space representation, when uncommon terms are found in documents and queries, the corresponding term vectors (document and query vectors) end too close from each other. After scoring and sorting results the system tends to rank these documents very high while returning few search results. This tells us that absolute ranking results derived from these term vectors not always are good discriminators of relevancy. In plain English, being #10 out of 5,000,000 results is not the same as being #1 out of 5 results.

Keyword Density Values

From Eq1 is evident that keyword weights are affected by

local term counts
the shear volume of documents in the database.

Therefore, the popular notion that term weights are or can be estimated with "keyword density values" is quite misleading.

Keyword density is defined as

Equation 2: Keyword Density = Co-Occurrence

where as given in Eq 1 tf_i = number of times a term i occurs in a document and L_i = total number of terms in a document. That is, keyword density is just a local word ratio. This ratio expresses the "concentration" of terms in a document. Thus, the keyword density of a 500-word document that repeats the term "pet" 5 times is KD = 5/500 = 0.01 or 1%. Note that this value does not account for contextuality (relative position) and distribution (relative dispersion) of terms in the document. These elements affect document relevancy and topic semantics.

Many search engine marketers (SEOs/SEMs) waste their time fine tuning keyword density values with "Keyword Density Analyzer" tools. Some go to the extreme of computing localized values in page identifiers and descriptors (eg., urls, titles, paragraphs, etc). Others propose keyword weighting schemes based on formulas created out of thin air. Even others claim that keeping documents within an "optimum" keyword density value affects the way search systems rank documents.

Keyword Density Failures

Equation 2 tells nothing about the semantic weight of terms in relation to other terms, within a document or collection of documents. Frankly, SEOs/SEMs that spend their time adjusting keyword density values, going after keyword weight tricks or buying the latest "keyword density analyzer" are wasting their time and money.

According to Eq 2, a term k1 that is equally repeated in two different documents of same length should has the same keyword density, regardless of document content or database nature. However, if we assume that keyword density values are or can be taken for keyword weights, then we are

not considering the shear volume of information that the queried term retrieves.
assigning term weights without regard for term relevancy.
assigning weights without considering the nature of the queried database.

Points 1 - 3 contradict Salton's Model. According to Equation 1, term weights are not local word ratios disconnected from the queried database. Often, a term k1 and equally repeated in two different documents of same length (regardless of content) is weighted differently in the same queried database or in different databases.

Analyzing Illusions

If a search marketer wants to compute term weights, he/she may need to replicate the weighting scheme of the target system. But, this is not an easy task since:

tf and IDF are defined differently across IR systems [1 - 11].
if using Eq 1, he/she need to know D , total number of documents in the queried system, and df_i , number of documents containing the queried term.
number of documents containing the queried term is not necessarily the same as number of documents retrieved.
IR systems or search engines do not publish their working schemes.
the target system may not use Salton's Term Vector Model at all.
the target system may use a variant of Salton's Term Vector Model combined with other scoring schemes (eg. Google, Yahoo and MSN).

To conclude, keyword density values should not be taken for term weights. As local word ratios these are not good discriminators of relevancy.

Salton n.萨尔顿
connectivity adv.连通性
equation n.方程式, 等式相等, 平衡
vulnerable a.易受伤的, 脆弱的, 敏感的
repetition n.重复, 反复, 重复的事复制品, 副本
abuse n.滥用, 妄用, 虐待恶习, 不正之风 v.滥用, 妄用虐待, 伤害
'adversary n.对手, 敌手
adversarial a.敌手的,对手的,对抗(性)的
spam n.香火腿罐头猪肉垃圾邮件
spamming n.兜售信息(邮件，广告，新闻，文章)，非索要信息滥发
shear n.大剪刀 v.剪羊毛, 剪
discriminator n.辨别者,鉴别器
density n.密集, 稠密〈物〉〈化〉密度
evident a.明显的, 明白的
account for 说明〔解释〕…原因
contextuality 文脉性情境性脉络性
dispersion n.散布,驱散,传播,散射;离差,差量
out of thin air 无中生有, 突然出现
optimum a.最适宜的; 最有利的
contradict v.反驳, 否认…的真实性与…发生矛盾, 与…抵触
illusion n.错觉, 幻想, 错误观念假象
variant n.变体; 变种; 变型

tiny_ala

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Term Vector Theory and Keyword Weights

Topics Saltons Vector Space Model Local Weights Global Weights Keyword Density Values Keyword Density Failures Analyzing Illusions Sa
复制链接

扫一扫