如何在Python中使用TF-IDF处理文本数据

最新推荐文章于 2023-03-15 14:09:28 发布

cumifi2519

最新推荐文章于 2023-03-15 14:09:28 发布

阅读量943

点赞数

文章标签： python java 人工智能大数据机器学习

原文链接：https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/

版权

by Mayank Tripathi

通过Mayank Tripathi

Computers are good with numbers, but not that much with textual data. One of the most widely used techniques to process textual data is TF-IDF. In this article, we will learn how it works and what are its features.

计算机擅长数字，但文字数据不那么有用。 TF-IDF是处理文本数据最广泛使用的技术之一。在本文中，我们将学习它的工作原理和功能。

From our intuition, we think that the words which appear more often should have a greater weight in textual data analysis, but that’s not always the case. Words such as “the”, “will”, and “you” — called stopwords — appear the most in a corpus of text, but are of very little significance. Instead, the words which are rare are the ones that actually help in distinguishing between the data, and carry more weight.

根据我们的直觉，我们认为出现频率更高的单词在文本数据分析中应具有更大的权重，但并非总是如此。诸如“ the”，“ will”和“ you”之类的词(称为停用词)在文本语料库中出现最多，但意义不大。相反，很少使用的单词实际上是有助于区分数据并具有更大权重的单词。

TF-IDF简介 (An introduction to TF-IDF)

TF-IDF stands for “Term Frequency — Inverse Data Frequency”. First, we will learn what this term means mathematically.

TF-IDF代表“术语频率-反向数据频率”。首先，我们将学习该术语在数学上的含义。

Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.

词频(tf) ：给出语料库中每个文档中单词的词频。它是单词在文档中出现的次数与该文档中单词总数的比率。随着文档中该单词出现次数的增加，它也会增加。每个文档都有自己的tf。

Inverse Data Frequency (idf): used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is given by the equation below.

逆数据频率(idf)：用于计算语料库中所有文档中稀有词的权重。在语料库中很少出现的单词具有较高的IDF分数。它由以下等式给出。

Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:

结合这两个，我们得出语料库文档中单词的TF-IDF分数(w)。它是tf和idf的乘积：

Let’s take an example to get a clearer understanding.

让我们以一个例子来更清楚地了解。

Sentence 1 : The car is driven on the road.

句子1：汽车在道路上行驶。

Sentence 2: The truck is driven on the highway.

句子2：卡车在高速公路上行驶。

In this example, each sentence is a separate document.

在此示例中，每个句子都是一个单独的文档。

We will now calculate the TF-IDF for the above two documents, which represent our corpus.

现在，我们将计算上述两个文档的TF-IDF，它们代表了我们的语料库。

From the above table, we can see that TF-IDF of common words was zero, which shows they are not significant. On the other hand, the TF-IDF of “car” , “truck”, “road”, and “highway” are non-zero. These words have more significance.

从上表中可以看出，常用词的TF-IDF为零，表明它们并不重要。另一方面，“汽车”，“卡车”，“道路”和“高速公路”的TF-IDF不为零。这些话具有更大的意义。

使用Python计算TF-IDF (Using Python to calculate TF-IDF)

Lets now code TF-IDF in Python from scratch. After that, we will see how we can use sklearn to automate the process.

现在让我们从头开始用Python编写TF-IDF。之后，我们将看到如何使用sklearn自动执行该过程。

The function computeTF computes the TF score for each word in the corpus, by document.

函数computeTF通过文档计算语料库中每个单词的TF分数。

The function computeIDF computes the IDF score of every word in the corpus.

函数computeIDF计算语料库中每个单词的IDF分数。

The function computeTFIDF below computes the TF-IDF score for each word, by multiplying the TF and IDF scores.

下面的computeTFIDF函数通过将TF和IDF分数相乘来计算每个单词的TF-IDF分数。

The output produced by the above code for the set of documents D1 and D2 is the same as what we manually calculated above in the table.

上面的代码为文档D1和D2的集合产生的输出与我们在表中上面手动计算的结果相同。

You can refer to this link for the complete implementation.

您可以参考此链接以获取完整的实现。

斯克莱恩 (sklearn)

Now we will see how we can implement this using sklearn in Python.

现在，我们将看到如何在Python中使用sklearn实现此功能。

First, we will import TfidfVectorizer from sklearn.feature_extraction.text:

首先，我们将从TfidfVectorizer导入sklearn.feature_extraction.text ：

Now we will initialise the vectorizer and then call fit and transform over it to calculate the TF-IDF score for the text.

现在，我们将初始化vectorizer ，然后调用fit并对其进行变换，以计算文本的TF-IDF分数。

Under the hood, the sklearn fit_transform executes the following fit and transform functions. These can be found in the official sklearn library at GitHub.

在后台，sklearn fit_transform执行以下fit和transform函数。这些可以在GitHub上的官方sklearn库中找到。

def fit(self, X, y=None):
        """Learn the idf vector (global term weights)
        Parameters
        ----------
        X : sparse matrix, [n_samples, n_features]
            a matrix of term/token counts
        """
        if not sp.issparse(X):
            X = sp.csc_matrix(X)
        if self.use_idf:
            n_samples, n_features = X.shape
            df = _document_frequency(X)

            # perform idf smoothing if required
            df += int(self.smooth_idf)
            n_samples += int(self.smooth_idf)

            # log+1 instead of log makes sure terms with zero idf don't get
            # suppressed entirely.
            idf = np.log(float(n_samples) / df) + 1.0
            self._idf_diag = sp.spdiags(idf, diags=0, m=n_features,
                                        n=n_features, format='csr')

        return self

    def transform(self, X, copy=True):
        """Transform a count matrix to a tf or tf-idf representation
        Parameters
        ----------
        X : sparse matrix, [n_samples, n_features]
            a matrix of term/token counts
        copy : boolean, default True
            Whether to copy X and operate on the copy or perform in-place
            operations.
        Returns
        -------
        vectors : sparse matrix, [n_samples, n_features]
        """
        if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.floating):
            # preserve float family dtype
            X = sp.csr_matrix(X, copy=copy)
        else:
            # convert counts or binary occurrences to floats
            X = sp.csr_matrix(X, dtype=np.float64, copy=copy)

        n_samples, n_features = X.shape

        if self.sublinear_tf:
            np.log(X.data, X.data)
            X.data += 1

        if self.use_idf:
            check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')

            expected_n_features = self._idf_diag.shape[0]
            if n_features != expected_n_features:
                raise ValueError("Input has n_features=%d while the model"
                                 " has been trained with n_features=%d" % (
                                     n_features, expected_n_features))
            # *= doesn't work
            X = X * self._idf_diag

        if self.norm:
            X = normalize(X, norm=self.norm, copy=False)

        return X

One thing to notice in the above code is that, instead of just the log of n_samples, 1 has been added to n_samples to calculate the IDF score. This ensures that the words with an IDF score of zero don’t get suppressed entirely.

在上面的代码中要注意的一件事是，不仅仅是n_samples的日志，还向n_samples中添加了1以计算IDF分数。这样可以确保不会完全抑制IDF得分为零的单词。

The output obtained is in the form of a skewed matrix, which is normalised to get the following result.

所获得的输出为偏斜矩阵的形式，将其规格化以获得以下结果。

Thus we saw how we can easily code TF-IDF in just 4 lines using sklearn. Now we understand how powerful TF-IDF is as a tool to process textual data out of a corpus. To learn more about sklearn TF-IDF, you can use this link.

因此，我们看到了如何使用sklearn在4行中轻松编码TF-IDF。现在，我们了解了TF-IDF作为处理语料库之外的文本数据的工具的功能。要了解有关sklearn TF-IDF的更多信息，可以使用此链接。

Happy coding!

祝您编码愉快！

Thanks for reading this article. Be sure to share it if you find it helpful.

感谢您阅读本文。如果发现有帮助，请务必共享。

For more about programming, you can follow me, so that you get notified every time I come up with a new post.

有关编程的更多信息，您可以关注我，以便每次我提出新帖子时都收到通知。

Cheers!

干杯!

Also, Let’s get connected on Twitter, Linkedin, Github and Facebook.

另外，让我们在Twitter ， Linkedin ， Github和Facebook上建立联系 。