第一次翻译,还真的很难
Term Vector Theory and Keyword Weights
词向量和关键字权重
An Introductory Series on Term Vector Theory for Information Retrieval Students and Search Engine Marketers
一系列介绍性的词向量相关的理论,针对信息提取的学生和搜索引擎市场人员。
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 10/27/06
Article 1 of the series Term Vector Theory and Keyword Weights
Topics
主题
Salton's Vector Space Model
Salton 的向量空间模型
Local Weights
局部权重
Global Weights
全局权重
Keyword Density Values
关键字密度值
Keyword Density Failures
关键字密度值的不足
Analyzing Illusions
分析器幻想
Acknowledgements
致谢
References
参照
Salton's Vector Space Model
Salton 的向量空间模型
IR systems assign weights to terms by considering
信息提取系统赋予每个词元的权重通过考虑:
1. local information from individual documents
1.从个别的文档得到局部信息
2. global information from collection of documents
2.从所有文档集得到全局信息
In addition, systems that assign weights to links use Web graph information to properly account for the degree of connectivity between documents.
另外,系统使用网络连通图形信息合适的计算出文档之间的联系程度,以此给链接赋权值。
In IR studies, the classic weighting scheme is the Salton Vector Space Model, commonly known as the "term vector model". This weighting scheme is given by
在信息提取研究里,标准的权值公式是Salton向量空间模型,常常也说是“词向量模型”。权值计算公式如下:
Equation 1: Term Weight =
等式一:词权重=
where
具体元素:
• tfi = term frequency (term counts) or number of times a term i occurs in a document.
• tfi =词频率(词统计)或者一个词在文档出现的次数
• dfi = document frequency or number of documents containing term i
• dfi =文档频率或者包含词的文档的数目
• D = number of documents in the database.
• D =在数据库里面文档的总数
Many models that extract term vectors from documents and queries are derived from Equation 1.
很多从文档和数据请求提取词向量的模型来自于等式一。
Local Weights
局部权值
Equation 1 shows that wi increases with tfi. This makes the model vulnerable to term repetition abuses (an adversarial practice known as keyword spamming). Given a query q
等式一表明wi 随着tfi 增大而增大。这样模型存在弱点就是词元的重复滥用(一个反例就是垃圾关键字的产生)。给一个查询 q
for documents of equal lengths, those with more instances of q are favored during retrieval.
对于相同长度的文档,那些出现更多q实例的文档在提取中受到更多关注。
1. for documents of different lengths, long documents are favored during retrieval since these tend to contain more instances of q.
1.对于长度不同的文档,提取中长文档受到更多的关注,因为这一些文档能包含更多的q实例。
Global Weights
全局权重
In Equation 1 the log(D/dfi) term is known as the inverse document frequency (IDFi) --a measure of the shear volume of information associated to a term i within a set of documents. Inspecting the dfi/D ratio, this is the probability of retrieving from D a document containing term i. In Equation 1 we simply invert this probability and take its log. The result is then premultiplied by tfi. Over the years, several modifications to Equation 1 have been proposed. The expression "a tf*idf model" is often reserved for a model using -or derived from- this equation.
在等式一中,log(D/dfi) 元被称为“逆转文档频率”--一种量度标准,对于词元 I 在一个文档集合里信息联系度。通过dfi/D比率,这就表示从D提取一个文档包含词元i的可能性。在等式中我们简单的翻转了这个可能性并且对其取对数。其结果就是在前面乘上tfi.。几年来,出现一些对等式一的修改的建议。表达式“a tf*idf model”常常表示为一个使用或者源于这个等式的模型。
Equation 1 shows that wi decreases as dfi increases [1 - 11]. For example, if in a 1000-document database only 10 documents contain the term "pet", the IDF for this term is IDF = log(1000/10) = 2. However, if only one document contains the term, IDF = log(1000/1) = 3.
等式一表示wi 随着dfi 的增大而减小。举一个例子,如果的一个数量级为1000的文档数据库,只有10个文档包含词元“pet”,那么对于这个词元IDF就是IDF = log(1000/10) = 2,而如果只有一个文档包含这个词元,IDF = log(1000/1) = 3。
Thus, terms which appear in too many documents (e.g., stopwords, very frequent terms) receive a low weight, while uncommon terms which appear in few documents receive a high weight. This makes sense since too common terms (e.g., "a", "the", "of", etc) are not very useful for distinguishing a relevant document from a non-relevant one. The two extremes are not recommended in rutinary retrieval work. Terms with acceptable weights are those that are not too common or too rare; i.e. their term vectors are not too far or too close to the query vector.
这样,在太多文档中出现的词元(比如,短词语,频率很高的词元)获得一个小的权值,而不常见的词元出现在少数文档获得高的权值。这样有意义的地方在于,因为太常见的词元(比如,"a", "the", "of", 等等)在辨别一个相关的文档和一个不相关的文档没有起到很大的作用。这两种极端在实际提取工作中是不推荐的(rutinary?)。可以接受的词元是那些不太常见也不太罕见的。他们的词元向量相对于查询向量不太远也不太近。
Note. In a vector space representation, when uncommon terms are found in documents and queries, the corresponding term vectors (document and query vectors) end too close from each other. After scoring and sorting results the system tends to rank these documents very high while returning few search results. This tells us that absolute ranking results derived from these term vectors not always are good discriminators of relevancy. In plain English, being #10 out of 5,000,000 results is not the same as being #1 out of 5 results.
注意。在一个向量空间表现上,当一个不常见的词元在文档和查询内容出现,对应的词元向量(文档和查询内容向量)表现的特别的接近。在评分和评分结果后系统趋向把这些文档评为很高的等级,同时返回很少的结果集。这就是说,绝对的源于这些词元向量评级的结果不一定是好相关性的差别。在简单的英文里,在5,000,000个结果里面排第10和在5个结果里面排第1是不一样的。
Keyword Density Values
关键字密度值
From Eq1 is evident that keyword weights are affected by
从等式一可以看出,关键字权值受以下影响:
1. local term counts
1. 局部词元数量
2. the shear volume of documents in the database.
2. 在数据库中shear volume?的文档
Therefore, the popular notion that term weights are or can be estimated with "keyword density values" is quite misleading.
这样,常见的“词元权重是或者能判断关键字密度值”想法是误导的。
Keyword density is defined as
关键字密度定义为:
Equation 2: Keyword Density =
等式二:关键字密度 =
where as given in Eq 1 tfi = number of times a term i occurs in a document and Li = total number of terms in a document. That is, keyword density is just a local word ratio. This ratio expresses the "concentration" of terms in a document. Thus, the keyword density of a 500-word document that repeats the term "pet" 5 times is KD = 5/500 = 0.01 or 1%. Note that this value does not account for contextuality (relative position) and distribution (relative dispersion) of terms in the document. These elements affect document relevancy and topic semantics.
就像是等式一,tfi =词元i在一个文档中出现的次数,Li = 一个文档中词元的所有数量。这个比率表示一个词元在一个文档的“密集度”。这样,在一个500词有5个重复的“pet”的文档里,关键词密度是KD =5/500 = 0.01 或者说是1%。注意这个值不能视为上下文相关的(相对于位置)和分配式的(相对于分散)词元在这个文档里。这些要素影响文档相关性和主题语义性。
Many search engine marketers (SEOs/SEMs) waste their time fine tuning keyword density values with "Keyword Density Analyzer" tools. Some go to the extreme of computing localized values in page identifiers and descriptors (eg., urls, titles, paragraphs, etc). Others propose keyword weighting schemes based on formulas created out of thin air. Even others claim that keeping documents within an "optimum" keyword density value affects the way search systems rank documents.
很多搜索引擎市场人员(SEOs/SEMs)浪费时间在使用“关键字密度分析器”微调关键字密度值。有一些走向极端,计算局部性的值在页面识别符和说明词(比如,链接,标题,段落,等等)。其他人选择基于公式的关键字权值方案(out of thin air?).甚至有人宣称保持文档在一个“最适宜”的关键字密度值影响搜索系统对文档的评级。
Keyword Density Failures
关键字密度失败值?
Equation 2 tells nothing about the semantic weight of terms in relation to other terms, within a document or collection of documents. Frankly, SEOs/SEMs that spend their time adjusting keyword density values, going after keyword weight tricks or buying the latest "keyword density analyzer" are wasting their time and money.
等式而可以看出在一个文档或者文档集里面,词元的语义权值和其他的词元没有任何的联系。简单地说,SEOs/SEMs 花费时间,通过关键字技巧或者买入最新的“关键字密度分析器”来修正关键字密度值是浪费他们的时间和金钱的。
According to Eq 2, a term k1 that is equally repeated in two different documents of same length should has the same keyword density, regardless of document content or database nature. However, if we assume that keyword density values are or can be taken for keyword weights, then we are
参照等式二,词元k1在两个不同的相同长度的文档重复相等的次数,那么将有一样的关键字密度,不管是文档内容还是基于数据库的。但是,如果我们认为关键字密度值就是或者可以看是关键字权值,那么也可以这么做。
1. not considering the shear volume of information that the queried term retrieves.
1. 不考虑shear volume?信息 查询词元提取?
2. assigning term weights without regard for term relevancy.
2. 假定词元权重不必考虑词元的关联。
3. assigning weights without considering the nature of the queried database.
3. 假定权重不必考虑the nature of the queried database. ?(查询数据库的属性)
Points 1 - 3 contradict Salton's Model. According to Equation 1, term weights are not local word ratios disconnected from the queried database. Often, a term k1 and equally repeated in two different documents of same length (regardless of content) is weighted differently in the same queried database or in different databases.
观点1-3与Salton的模型矛盾。根据等式一,词元权重不是局部词比率,和查询数据库没有什么联系。时常的,在两个不同的相同长度的文档里(不考虑内容)重复相同次数的词元k1,在相同的或者不同的查询数据库里权重不一样。
Analyzing Illusions
分析器幻想
If a search marketer wants to compute term weights, he/she may need to replicate the weighting scheme of the target system. But, this is not an easy task since:
如果一个搜索市场人员想计算词元权重,他或她可能需要去复制目标系统的权值方案。但是,这不是一个简单的工作,因为:
1. tf and IDF are defined differently across IR systems [1 - 11].
1. tf 和 IDF 在不同的信息提取系统定义是不一样的。
2. if using Eq 1, he/she need to know D, total number of documents in the queried system, and dfi, number of documents containing the queried term.
2. 如果使用等式一,他或她需要知道D,查询系统的所有文档的总数,和dfi, 包含查询词元的文档数目。
3. number of documents containing the queried term is not necessarily the same as number of documents retrieved.
3. 包含词元的文档数目不完全需要和提取出来的文档数目一样。
4. IR systems or search engines do not publish their working schemes.
4. 信息提取系统或者搜索引擎没有公开他们的工作方案。
5. the target system may not use Salton's Term Vector Model at all.
5. 目标系统可能根本就不使用Salton的词元向量模型。
6. the target system may use a variant of Salton's Term Vector Model combined with other scoring schemes (eg. Google, Yahoo and MSN).
6. 目标系统可能使用一个Salton词元向量模型的变型集合了其他的评分方案(比如Google, Yahoo 和 MSN)
To conclude, keyword density values should not be taken for term weights. As local word ratios these are not good discriminators of relevancy.
最后总结,关键字密度值不能等同于词元权重。因为一个局部的词比率不能很好的辨识出关联度。
Acknowledgements
致谢
The author thanks the following authority sites for referencing this series of articles
作者在这一系列的文章里参照了一下权威的网站,在此表示感谢。
• Forge.MySQL.com - MySQL Internals Algorithms, MySQL AB Corporation.
• MySQL Internals, Manual :: 4.7 Full-text Search, MySQL AB Corporation.
• University of San Francisco Personal Web Neighborhood Project Uddhav, G. and Kien, D. See another version and powerpoint presentation. Dr. Garcia want to thank the researchers for reproducing some of his formulas and tables for this project.
• Projet e-Quest Thesaurus et Questions Laboratoire didactique informatique of l'Ecole d'Ingenieurs de Geneve
Term Vector Theory and Keyword Weights
词向量和关键字权重
An Introductory Series on Term Vector Theory for Information Retrieval Students and Search Engine Marketers
一系列介绍性的词向量相关的理论,针对信息提取的学生和搜索引擎市场人员。
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 10/27/06
Article 1 of the series Term Vector Theory and Keyword Weights
Topics
主题
Salton's Vector Space Model
Salton 的向量空间模型
Local Weights
局部权重
Global Weights
全局权重
Keyword Density Values
关键字密度值
Keyword Density Failures
关键字密度值的不足
Analyzing Illusions
分析器幻想
Acknowledgements
致谢
References
参照
Salton's Vector Space Model
Salton 的向量空间模型
IR systems assign weights to terms by considering
信息提取系统赋予每个词元的权重通过考虑:
1. local information from individual documents
1.从个别的文档得到局部信息
2. global information from collection of documents
2.从所有文档集得到全局信息
In addition, systems that assign weights to links use Web graph information to properly account for the degree of connectivity between documents.
另外,系统使用网络连通图形信息合适的计算出文档之间的联系程度,以此给链接赋权值。
In IR studies, the classic weighting scheme is the Salton Vector Space Model, commonly known as the "term vector model". This weighting scheme is given by
在信息提取研究里,标准的权值公式是Salton向量空间模型,常常也说是“词向量模型”。权值计算公式如下:
Equation 1: Term Weight =
等式一:词权重=
where
具体元素:
• tfi = term frequency (term counts) or number of times a term i occurs in a document.
• tfi =词频率(词统计)或者一个词在文档出现的次数
• dfi = document frequency or number of documents containing term i
• dfi =文档频率或者包含词的文档的数目
• D = number of documents in the database.
• D =在数据库里面文档的总数
Many models that extract term vectors from documents and queries are derived from Equation 1.
很多从文档和数据请求提取词向量的模型来自于等式一。
Local Weights
局部权值
Equation 1 shows that wi increases with tfi. This makes the model vulnerable to term repetition abuses (an adversarial practice known as keyword spamming). Given a query q
等式一表明wi 随着tfi 增大而增大。这样模型存在弱点就是词元的重复滥用(一个反例就是垃圾关键字的产生)。给一个查询 q
for documents of equal lengths, those with more instances of q are favored during retrieval.
对于相同长度的文档,那些出现更多q实例的文档在提取中受到更多关注。
1. for documents of different lengths, long documents are favored during retrieval since these tend to contain more instances of q.
1.对于长度不同的文档,提取中长文档受到更多的关注,因为这一些文档能包含更多的q实例。
Global Weights
全局权重
In Equation 1 the log(D/dfi) term is known as the inverse document frequency (IDFi) --a measure of the shear volume of information associated to a term i within a set of documents. Inspecting the dfi/D ratio, this is the probability of retrieving from D a document containing term i. In Equation 1 we simply invert this probability and take its log. The result is then premultiplied by tfi. Over the years, several modifications to Equation 1 have been proposed. The expression "a tf*idf model" is often reserved for a model using -or derived from- this equation.
在等式一中,log(D/dfi) 元被称为“逆转文档频率”--一种量度标准,对于词元 I 在一个文档集合里信息联系度。通过dfi/D比率,这就表示从D提取一个文档包含词元i的可能性。在等式中我们简单的翻转了这个可能性并且对其取对数。其结果就是在前面乘上tfi.。几年来,出现一些对等式一的修改的建议。表达式“a tf*idf model”常常表示为一个使用或者源于这个等式的模型。
Equation 1 shows that wi decreases as dfi increases [1 - 11]. For example, if in a 1000-document database only 10 documents contain the term "pet", the IDF for this term is IDF = log(1000/10) = 2. However, if only one document contains the term, IDF = log(1000/1) = 3.
等式一表示wi 随着dfi 的增大而减小。举一个例子,如果的一个数量级为1000的文档数据库,只有10个文档包含词元“pet”,那么对于这个词元IDF就是IDF = log(1000/10) = 2,而如果只有一个文档包含这个词元,IDF = log(1000/1) = 3。
Thus, terms which appear in too many documents (e.g., stopwords, very frequent terms) receive a low weight, while uncommon terms which appear in few documents receive a high weight. This makes sense since too common terms (e.g., "a", "the", "of", etc) are not very useful for distinguishing a relevant document from a non-relevant one. The two extremes are not recommended in rutinary retrieval work. Terms with acceptable weights are those that are not too common or too rare; i.e. their term vectors are not too far or too close to the query vector.
这样,在太多文档中出现的词元(比如,短词语,频率很高的词元)获得一个小的权值,而不常见的词元出现在少数文档获得高的权值。这样有意义的地方在于,因为太常见的词元(比如,"a", "the", "of", 等等)在辨别一个相关的文档和一个不相关的文档没有起到很大的作用。这两种极端在实际提取工作中是不推荐的(rutinary?)。可以接受的词元是那些不太常见也不太罕见的。他们的词元向量相对于查询向量不太远也不太近。
Note. In a vector space representation, when uncommon terms are found in documents and queries, the corresponding term vectors (document and query vectors) end too close from each other. After scoring and sorting results the system tends to rank these documents very high while returning few search results. This tells us that absolute ranking results derived from these term vectors not always are good discriminators of relevancy. In plain English, being #10 out of 5,000,000 results is not the same as being #1 out of 5 results.
注意。在一个向量空间表现上,当一个不常见的词元在文档和查询内容出现,对应的词元向量(文档和查询内容向量)表现的特别的接近。在评分和评分结果后系统趋向把这些文档评为很高的等级,同时返回很少的结果集。这就是说,绝对的源于这些词元向量评级的结果不一定是好相关性的差别。在简单的英文里,在5,000,000个结果里面排第10和在5个结果里面排第1是不一样的。
Keyword Density Values
关键字密度值
From Eq1 is evident that keyword weights are affected by
从等式一可以看出,关键字权值受以下影响:
1. local term counts
1. 局部词元数量
2. the shear volume of documents in the database.
2. 在数据库中shear volume?的文档
Therefore, the popular notion that term weights are or can be estimated with "keyword density values" is quite misleading.
这样,常见的“词元权重是或者能判断关键字密度值”想法是误导的。
Keyword density is defined as
关键字密度定义为:
Equation 2: Keyword Density =
等式二:关键字密度 =
where as given in Eq 1 tfi = number of times a term i occurs in a document and Li = total number of terms in a document. That is, keyword density is just a local word ratio. This ratio expresses the "concentration" of terms in a document. Thus, the keyword density of a 500-word document that repeats the term "pet" 5 times is KD = 5/500 = 0.01 or 1%. Note that this value does not account for contextuality (relative position) and distribution (relative dispersion) of terms in the document. These elements affect document relevancy and topic semantics.
就像是等式一,tfi =词元i在一个文档中出现的次数,Li = 一个文档中词元的所有数量。这个比率表示一个词元在一个文档的“密集度”。这样,在一个500词有5个重复的“pet”的文档里,关键词密度是KD =5/500 = 0.01 或者说是1%。注意这个值不能视为上下文相关的(相对于位置)和分配式的(相对于分散)词元在这个文档里。这些要素影响文档相关性和主题语义性。
Many search engine marketers (SEOs/SEMs) waste their time fine tuning keyword density values with "Keyword Density Analyzer" tools. Some go to the extreme of computing localized values in page identifiers and descriptors (eg., urls, titles, paragraphs, etc). Others propose keyword weighting schemes based on formulas created out of thin air. Even others claim that keeping documents within an "optimum" keyword density value affects the way search systems rank documents.
很多搜索引擎市场人员(SEOs/SEMs)浪费时间在使用“关键字密度分析器”微调关键字密度值。有一些走向极端,计算局部性的值在页面识别符和说明词(比如,链接,标题,段落,等等)。其他人选择基于公式的关键字权值方案(out of thin air?).甚至有人宣称保持文档在一个“最适宜”的关键字密度值影响搜索系统对文档的评级。
Keyword Density Failures
关键字密度失败值?
Equation 2 tells nothing about the semantic weight of terms in relation to other terms, within a document or collection of documents. Frankly, SEOs/SEMs that spend their time adjusting keyword density values, going after keyword weight tricks or buying the latest "keyword density analyzer" are wasting their time and money.
等式而可以看出在一个文档或者文档集里面,词元的语义权值和其他的词元没有任何的联系。简单地说,SEOs/SEMs 花费时间,通过关键字技巧或者买入最新的“关键字密度分析器”来修正关键字密度值是浪费他们的时间和金钱的。
According to Eq 2, a term k1 that is equally repeated in two different documents of same length should has the same keyword density, regardless of document content or database nature. However, if we assume that keyword density values are or can be taken for keyword weights, then we are
参照等式二,词元k1在两个不同的相同长度的文档重复相等的次数,那么将有一样的关键字密度,不管是文档内容还是基于数据库的。但是,如果我们认为关键字密度值就是或者可以看是关键字权值,那么也可以这么做。
1. not considering the shear volume of information that the queried term retrieves.
1. 不考虑shear volume?信息 查询词元提取?
2. assigning term weights without regard for term relevancy.
2. 假定词元权重不必考虑词元的关联。
3. assigning weights without considering the nature of the queried database.
3. 假定权重不必考虑the nature of the queried database. ?(查询数据库的属性)
Points 1 - 3 contradict Salton's Model. According to Equation 1, term weights are not local word ratios disconnected from the queried database. Often, a term k1 and equally repeated in two different documents of same length (regardless of content) is weighted differently in the same queried database or in different databases.
观点1-3与Salton的模型矛盾。根据等式一,词元权重不是局部词比率,和查询数据库没有什么联系。时常的,在两个不同的相同长度的文档里(不考虑内容)重复相同次数的词元k1,在相同的或者不同的查询数据库里权重不一样。
Analyzing Illusions
分析器幻想
If a search marketer wants to compute term weights, he/she may need to replicate the weighting scheme of the target system. But, this is not an easy task since:
如果一个搜索市场人员想计算词元权重,他或她可能需要去复制目标系统的权值方案。但是,这不是一个简单的工作,因为:
1. tf and IDF are defined differently across IR systems [1 - 11].
1. tf 和 IDF 在不同的信息提取系统定义是不一样的。
2. if using Eq 1, he/she need to know D, total number of documents in the queried system, and dfi, number of documents containing the queried term.
2. 如果使用等式一,他或她需要知道D,查询系统的所有文档的总数,和dfi, 包含查询词元的文档数目。
3. number of documents containing the queried term is not necessarily the same as number of documents retrieved.
3. 包含词元的文档数目不完全需要和提取出来的文档数目一样。
4. IR systems or search engines do not publish their working schemes.
4. 信息提取系统或者搜索引擎没有公开他们的工作方案。
5. the target system may not use Salton's Term Vector Model at all.
5. 目标系统可能根本就不使用Salton的词元向量模型。
6. the target system may use a variant of Salton's Term Vector Model combined with other scoring schemes (eg. Google, Yahoo and MSN).
6. 目标系统可能使用一个Salton词元向量模型的变型集合了其他的评分方案(比如Google, Yahoo 和 MSN)
To conclude, keyword density values should not be taken for term weights. As local word ratios these are not good discriminators of relevancy.
最后总结,关键字密度值不能等同于词元权重。因为一个局部的词比率不能很好的辨识出关联度。
Acknowledgements
致谢
The author thanks the following authority sites for referencing this series of articles
作者在这一系列的文章里参照了一下权威的网站,在此表示感谢。
• Forge.MySQL.com - MySQL Internals Algorithms, MySQL AB Corporation.
• MySQL Internals, Manual :: 4.7 Full-text Search, MySQL AB Corporation.
• University of San Francisco Personal Web Neighborhood Project Uddhav, G. and Kien, D. See another version and powerpoint presentation. Dr. Garcia want to thank the researchers for reproducing some of his formulas and tables for this project.
• Projet e-Quest Thesaurus et Questions Laboratoire didactique informatique of l'Ecole d'Ingenieurs de Geneve