tf-idf文本分类_文本,tf-idf等的定量值…

本文介绍了TF-IDF在文本分类中的应用,探讨了这种统计方法如何量化文本中的重要词汇,并用于信息检索和自然语言处理任务。通过理解TF-IDF,可以更好地进行文本分类和信息提取。
摘要由CSDN通过智能技术生成

tf-idf文本分类

Introduction

介绍

tf-idf, which stands for term frequency-inverse document frequency is used to calculate a quantitative digest of any document, which can be further used to find similar documents, classification of documents, etc.

tf-idf代表术语频率与文档频率成反比,用于计算任何文档的定量摘要,该摘要可进一步用于查找相似文档,文档分类等。

This article will explain tf-idf, it’s variations and what is the impact of these variations on the model output.

本文将解释tf-idf,它的变化形式以及这些变化对模型输出的影响。

tf-idf, which stands for term frequency-inverse document frequency is similar to Bag of Words (BoW) where documents are considered as a bag or collection of words/terms and converted to numerical forms by counting the occurrences of every term. The whole idea is to assign a weight to each term occurring in the document.

tf-idf代表术语频率与文档频率成反比,与词袋(BoW)相似,其中文档被视为袋子或单词/术语的集合,并通过计算每个术语的出现将其转换为数字形式。 整个想法是为文档中出现的每个术语分配权重。

tf-idf takes it to one step further and also consider the relative importance of every term to the document in a collection (normally addressed as corpus) of documents.

tf-idf将其更进一步,并考虑文档集合(通常称为语料库)中每个术语对文档的相对重要性。

Wikipedia summarises it well,

维基百科总结得很好,

term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

频率-逆文档频率是一种数字统计,旨在反映单词对集合或语料库中文档的重要性。

Following is a sample collection of documents taken from Wikipedia which will be the corpus (of size 4) for this article.

以下是从Wikipedia提取的文档样本集,该文章将是本文的语料库(大小为4)。

Image for post
Document Corpus
文件语料库

We will have to perform some text cleansing and pre-processing such as removing special characters, removing stop words and lemmatizing words in the corpus before further steps. Text look like this after these steps.

在进一步的步骤之前,我们将必须进行一些文本清理和预处理,例如删除特殊字符,删除停用词和使词素词化。 这些步骤后,文本看起来像这样。

Image for post
Processed text
处理的文字

Calculations

计算方式

tf, term frequency, is the simplest way of calculating weights. As the name suggests, weight of a term t in document d is the number of occurrences of t in d, which is exactly the Bag of Words model.

tf(术语频率)是计算权重的最简单方法。 顾名思义,重量在文件d的一个术语的为t的出现在d的数目,而这正是词模型的袋。

Image for post
Term Frequencies
词频

In tf there is no notion of importance. idf, inverse document frequency, is used to introduce the importance factor of any term. This is required because some of the terms would have little or no discriminating power such as the collection of documents about machine learning would have term machine in almost all the documents and does not hold much discriminating relevance.

在tf中没有重要性的概念。 idf,反文档频率,用于介绍任何术语的重要性因子。 这是必需的,因为某些术语几乎没有或没有歧视性,例如有关机器学习的文档的收集将在几乎所有文档中都使用术语机器,并且没有太多的歧视性。

Document frequency, which is the number of documents in the corpus that contains the term t, is used to scale the weight(factor of importance) of term t. idf of a term t in the document collection is defined as,

文档频率,这是在包含术语t时的语料库的文档的数量用于缩放的术语的重量(重要性的因子)。 文档集合中术语t的 idf定义为

Image for post

where,

哪里,

N is the number of documents in the collection, dft is the document frequency of term t

N是集合中文档的数量, dft是项t的文档频率

Image for post
idf values
idf值

Different variations of tf and idf

TF和IDF的不同变体

tf-idf,

tf-idf,

tf-idf is the combination of tf and idf, which is the scaled version of weight. tf-idf of a term t present in a document d from a corpus of documents D is defined as,

tf-idf是tf和idf的组合,后者是权重的缩放形式。 来自文档D的文档集d中存在的术语t的 tf-idf定义为

Image for post

tf-idf is highest for a t if it occurs many times within a small number of documents

如果tf-idf在少数文档中多次出现,则t最高

tf-idf is lower for t when it occurs fewer times in a document, or occurs in many documents

当tf-idf在文档中出现的次数较少或在许多文档中出现时,它的t值较低

tf-idf is lowest when t occurs in all the documents

当所有文档中都出现t时,tf-idf最低

Image for post
Image for post

We can see that the weight of data is more than computer even their term frequency is same for document 0. This is because of idf as data occurred in a smaller number of documents.

我们可以看到,即使文档0的术语频率相同, 数据的权重也比计算机大。这是由于idf导致数据出现在较少数量的文档中。

Machine occurred in all the documents, hence it has the 0 weight. We can add 1 to idf to avoid getting 0 for such terms as per the use case.

机器在所有文档中均发生,因此其权重为0。 我们可以将1加到idf,以避免根据用例使这些术语获得0。

Sub linear tf-scaling,

次线性tf缩放

It is not always true that multiple occurrences of a term in a document mean more significance of that term in proportion to the number of occurrences. Sublinear tf-scaling is modification of term frequency, which calculates weight as following,

在文档中多次出现某个术语并不总是意味着与出现次数成比例的意义更大。 次线性tf标度是项频率的修改,其计算权重如下:

Image for post

In this case tf-idf becomes,

在这种情况下,tf-idf变为

Image for post
Image for post
wf-idf
wf-idf

As intended sub linear tf-scaling has scaled down the weight of term algorithm as it is occurring multiple times(maximum tf) in first document.

正如预期的那样,在第一个文档中,线性tf缩放缩小了术语算法的权重,因为它多次出现(最大tf)。

Maximum tf-normalization,

最大tf归一化,

This is another modification of term frequency where tf of every term occurring in a document is normalized by the maximum tf in that document.

这是术语频率的另一种修改形式,其中文档中出现的每个术语的tf均以该文档中的最大tf标准化。

Image for post

where a is the smoothing term ranging between 0 to 1 and is generally set to 0.4.

其中a是介于0到1之间的平滑项,通常设置为0.4。

Maximum tf-normalization handles the case when a long document has higher values of term frequencies just because of the length of the document and it will have same terms repeated again and again.

最大的tf归一化处理的情况是,一个长文档仅由于文档的长度而具有较高的术语频率值,并且会不断重复相同的术语。

This approach falls short in the case when a document will have a term occurring unusually very high number of times.

当文档中的术语出现异常次数非常多的情况下,这种方法不可行。

Image for post

High frequency terms such as algorithm and computer are scaled down as it is normalized by the maximum frequency. Terms with zero frequency are also having some weight because of the smoothing term.

高频项(例如算法计算机)会按比例缩小,因为它已通过最大频率标准化。 由于平滑项,频率为零的项也具有一定的权重。

Normalization,

正常化,

We can normalize documents vectors either by L2 or L1 norm. After L2 normalization, sum of elements of every document vector will be 1. In this case cosine similarity between any two document vectors is just the dot product of the vectors. In case of L1 normalization, sum of absolute values of elements of very document vector becomes 1.

我们可以通过L2或L1范数对文档向量进行归一化。 在L2归一化之后,每个文档向量的元素之和将为1。在这种情况下,任何两个文档向量之间的余弦相似度只是这些向量的点积。 在L1归一化的情况下,非常文档矢量的元素的绝对值之和为1。

Code used in the article can be found at this link — https://github.com/varun21290/medium/blob/master/tfidf/tfidf.ipynb

可以在此链接中找到本文中使用的代码— https://github.com/varun21290/medium/blob/master/tfidf/tfidf.ipynb

Scikit-Learn provides most of these calculations out of the box, check links in the references.

Scikit-Learn开箱即用地提供了大多数这些计算,请检查参考资料中的链接。

https://nlp.stanford.edu/IR-book/

https://nlp.stanford.edu/IR-book/

https://en.wikipedia.org/wiki/Tf%E2%80%93idf#

https://zh.wikipedia.org/wiki/Tf%E2%80%93idf#

https://en.wikipedia.org/wiki/Machine_learning

https://zh.wikipedia.org/wiki/机器学习

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#

https://scikit-learn.org/stable/modules/generation/sklearn.feature_extraction.text.TfidfVectorizer.html#

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#

https://scikit-learn.org/stable/modules/generation/sklearn.feature_extraction.text.CountVectorizer.html#

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#

https://scikit-learn.org/stable/modules/generation/sklearn.feature_extraction.text.TfidfTransformer.html#

翻译自: https://medium.com/analytics-vidhya/the-quantitative-value-of-text-tf-idf-and-more-e3c7883f1df3

tf-idf文本分类

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值