tf-idf文本分类
Introduction
介绍
tf-idf, which stands for term frequency-inverse document frequency is used to calculate a quantitative digest of any document, which can be further used to find similar documents, classification of documents, etc.
tf-idf代表术语频率与文档频率成反比,用于计算任何文档的定量摘要,该摘要可进一步用于查找相似文档,文档分类等。
This article will explain tf-idf, it’s variations and what is the impact of these variations on the model output.
本文将解释tf-idf,它的变化形式以及这些变化对模型输出的影响。
tf-idf, which stands for term frequency-inverse document frequency is similar to Bag of Words (BoW) where documents are considered as a bag or collection of words/terms and converted to numerical forms by counting the occurrences of every term. The whole idea is to assign a weight to each term occurring in the document.
tf-idf代表术语频率与文档频率成反比,与词袋(BoW)相似,其中文档被视为袋子或单词/术语的集合,并通过计算每个术语的出现将其转换为数字形式。 整个想法是为文档中出现的每个术语分配权重。
tf-idf takes it to one step further and also consider the relative importance of every term to the document in a collection (normally addressed as corpus) of documents.
tf-idf将其更进一步,并考虑文档集合(通常称为语料库)中每个术语对文档的相对重要性。
Wikipedia summarises it well,
维基百科总结得很好,
term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
频率-逆文档频率是一种数字统计,旨在反映单词对集合或语料库中文档的重要性。
Following is a sample collection of documents taken from Wikipedia which will be the corpus (of size 4) for this article.
以下是从Wikipedia提取的文档样本集,该文章将是本文的语料库(大小为4)。
We will have to perform some text cleansing and pre-processing such as removing special characters, removing stop words and lemmatizing words in the corpus before further steps. Text look like this after these steps.
在进一步的步骤之前,我们将必须进行一些文本清理和预处理,例如删除特殊字符,删除停用词和使词素词化。 这些步骤后,文本看起来像这样。
Calculations
计算方式
tf, term frequency, is the simplest way of calculating weights. As the name suggests, weight of a term t in document d is the number of occurrences of t in d, which is exactly the Bag of Words model.
tf(术语频率)是计算权重的最简单方法。 顾名思义,重量在文件d的一个术语吨的为t的出现在d的数目,而这正是词模型的袋。
In tf there is no notion of importance. idf, inverse document frequency, is used to introduce the importance factor of any term. This is required because some of the terms would have little or no discriminating power such as the collection of documents about machine learning would have term machine in almost all the documents and does not hold much discriminating relevance.
在tf中没有重要性的概念。 idf,反文档频率,用于介绍任何术语的重要性因子。 这是必需的,因为某些术语几乎没有或没有歧视性,例如有关机器学习的文档的收集将在几乎所有文档中都使用术语机器,并且没有太多的歧视性。
Document frequency, which is the number of documents in the corpus that contains the term t, is used to scale the weight(factor of importance) of term t. idf of a term t in the document collection is defined as,
文档频率,这是在包含术语t时的语料库的文档的数量,用于缩放的术语吨的重量(重要性的因子)。 文档集合中术语t的 idf定义为
where,
哪里,
N is the number of documents in the collection, dft is the document frequency of term t
N是集合中文档的数量, dft是项t的文档频率
Different variations of tf and idf
TF和IDF的不同变体
tf-idf,
tf-idf,
tf-idf is the combination of tf and idf, which is the scaled version of weight. tf-idf of a term t present in a document d from a corpus of documents D is defined as,
tf-idf是tf和idf的组合,后者是权重的缩放形式。 来自文档D的文档集d中存在的术语t的 tf-idf定义为
tf-idf is highest for a t if it occurs many times within a small number of documents
如果tf-idf在少数文档中多次出现,则t最高
tf-idf is lower for t when it occurs fewer times in a document, or occurs in many documents
当tf-idf在文档中出现的次数较少或在许多文档中出现时,它的t值较低
tf-idf is lowest when t occurs in all the documents
当所有文档中都出现t时,tf-idf最低
We can see that the weight of data is more than computer even their term frequency is same for document 0. This is because of idf as data occurred in a smaller number of documents.
我们可以看到,即使文档0的术语频率相同, 数据的权重也比计算机大。这是由于idf导致数据出现在较少数量的文档中。
Machine occurred in all the documents, hence it has the 0 weight. We can add 1 to idf to avoid getting 0 for such terms as per the use case.
机器在所有文档中均发生,因此其权重为0。 我们可以将1加到idf,以避免根据用例使这些术语获得0。
Sub linear tf-scaling,
次线性tf缩放
It is not always true that multiple occurrences of a term in a document mean more significance of that term in proportion to the number of occurrences. Sublinear tf-scaling is modification of term frequency, which calculates weight as following,
在文档中多次出现某个术语并不总是意味着与出现次数成比例的意义更大。 次线性tf标度是项频率的修改,其计算权重如下:
In this case tf-idf becomes,
在这种情况下,tf-idf变为
As intended sub linear tf-scaling has scaled down the weight of term algorithm as it is occurring multiple times(maximum tf) in first document.
正如预期的那样,在第一个文档中,线性tf缩放缩小了术语算法的权重,因为它多次出现(最大tf)。
Maximum tf-normalization,
最大tf归一化,
This is another modification of term frequency where tf of every term occurring in a document is normalized by the maximum tf in that document.
这是术语频率的另一种修改形式,其中文档中出现的每个术语的tf均以该文档中的最大tf标准化。
where a is the smoothing term ranging between 0 to 1 and is generally set to 0.4.
其中a是介于0到1之间的平滑项,通常设置为0.4。
Maximum tf-normalization handles the case when a long document has higher values of term frequencies just because of the length of the document and it will have same terms repeated again and again.
最大的tf归一化处理的情况是,一个长文档仅由于文档的长度而具有较高的术语频率值,并且会不断重复相同的术语。
This approach falls short in the case when a document will have a term occurring unusually very high number of times.
当文档中的术语出现异常次数非常多的情况下,这种方法不可行。
High frequency terms such as algorithm and computer are scaled down as it is normalized by the maximum frequency. Terms with zero frequency are also having some weight because of the smoothing term.
高频项(例如算法和计算机)会按比例缩小,因为它已通过最大频率标准化。 由于平滑项,频率为零的项也具有一定的权重。
Normalization,
正常化,
We can normalize documents vectors either by L2 or L1 norm. After L2 normalization, sum of elements of every document vector will be 1. In this case cosine similarity between any two document vectors is just the dot product of the vectors. In case of L1 normalization, sum of absolute values of elements of very document vector becomes 1.
我们可以通过L2或L1范数对文档向量进行归一化。 在L2归一化之后,每个文档向量的元素之和将为1。在这种情况下,任何两个文档向量之间的余弦相似度只是这些向量的点积。 在L1归一化的情况下,非常文档矢量的元素的绝对值之和为1。
Code used in the article can be found at this link — https://github.com/varun21290/medium/blob/master/tfidf/tfidf.ipynb
可以在此链接中找到本文中使用的代码— https://github.com/varun21290/medium/blob/master/tfidf/tfidf.ipynb
Scikit-Learn provides most of these calculations out of the box, check links in the references.
Scikit-Learn开箱即用地提供了大多数这些计算,请检查参考资料中的链接。
翻译自: https://medium.com/analytics-vidhya/the-quantitative-value-of-text-tf-idf-and-more-e3c7883f1df3
tf-idf文本分类