tf-idf词向量和bow_使用词袋Bow和TF IDF进行多标签分类

最新推荐文章于 2023-05-21 15:13:01 发布

weixin_26756255

最新推荐文章于 2023-05-21 15:13:01 发布

阅读量1.2k

点赞数

文章标签： python 人工智能 java

原文链接：https://towardsdatascience.com/multi-label-classification-using-bag-of-words-bow-and-tf-idf-4f95858740e5

版权

本文翻译自《多标签分类：基于词袋(Bow)与TF-IDF》。介绍如何利用Python进行多标签分类，探讨了词袋模型(Bow)和TF-IDF在文本表示中的应用。

摘要由CSDN通过智能技术生成

tf-idf词向量和bow

1.加载数据 (1. Load the data)

For this study, we are using Kaggle data for Toxic Comment Classification Challenge. Lets load and inspect the data. This is a multilabel classification problem where comments are classified by the level of toxicity: toxic / severe_toxic / obscene / threat / insult / identity_hate

在本研究中，我们将Kaggle数据用于有毒评论分类挑战。让我们加载和检查数据。这是一个多标签分类问题，其中注释按毒性级别分类： toxic / severe_toxic / obscene / threat / insult / identity_hate毒性toxic / severe_toxic / obscene / threat / insult / identity_hate

import pandas as pd
data = pd.read_csv('train.csv')
print('Shape of the data: ', data.shape)
data.head()

Image for post — Snapshot of the dataset

y_cols = list(data.columns[2:])
is_multilabel = (data[y_cols].sum(axis=1) >1).count()
print('is_multilabel count: ', is_multilabel)

From the above data we can see that not all comments have a label.
从以上数据可以看出，并非所有注释都带有标签。
Its Multilabel data (each comment can have more than one label)
它的多标签数据(每个注释可以有多个标签)
Add a label, ‘non_toxic’ for comments with no label
添加标签“ non_toxic”以添加没有标签的评论
Lets also check the explore how balanced is the classes.
我们还检查一下类之间的平衡程度。

# Add a label, ‘non_toxic’ for comments with no label
data['non_toxic'] = 1-data[y_cols].max(axis=1)
y_cols += ['non_toxic']# Inspect the class balance
def get_class_weight(data):
    class_weight = {}
    for num,col in enumerate(y_cols):
        if num not in class_weight:
            class_weight[col] = round((data[data[col] == 1][col].sum())/data.shape[0]*100,2)
    return class_weight
class_weight = get_class_weight(data)print('Total class weight: ', sum(class_weight.values()), '%\n\n', class_weight)

We can see that the data is highly imbalanced. Imbalanced data refers to classification problems where the classes are not represented equally for e.g., 89% comments are classified under the newly built ‘non_toxic’ label.

我们可以看到数据高度不平衡。数据不平衡指的是分类问题，其中类别的代表不一样，例如，有89％的评论被归类为新建的“无毒”标签。

Any given linear model will handle class imbalance very badly if it uses squared loss for binary classification. We will not discuss the techniques to tackle the imbalance problem in this project. Let's focus on preprocessing the text data before converting it to numeric data using BoW and tf-idf.

如果给定的线性模型使用平方损失进行二进制分类，则将非常严重地处理类不平衡问题。在本项目中，我们将不讨论解决不平衡问题的技术。让我们集中精力在使用BoW和tf-idf将文本数据转换为数字数据之前对其进行预处理。

2.将数据集分为训练，验证和测试 (2. Split the dataset into train, validation and test)

from sklearn.model_selection import train_test_splitX, X_test, y, y_test = train_test_split(X_data, y_data, test_size=0.2, train_size=0.8)
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size = 0.25,train_size =0.75)X_train[:1]

Lets take a closer look at one the comments. Please note the text will vary for you since the dataset is split randomly. Please use seed in the split if you aim to reproduce the results.

让我们仔细看看其中一项评论。请注意，由于数据集是随机拆分的，因此文本会有所不同。如果您想复制结果，请在分割中使用种子。

3.预处理文本 (3. Preprocess text)

From the above example, we can see that the text requires preprocessing, i.e., converting it into same cas (lower), remove symbols, numbers and stop words before the text is converted into tokens. For preprocessing the text you will need to download specific libraries.

从上面的示例中，我们可以看到文本需要进行预处理，即在将文本转换为标记之前，将其转换为相同的cas(下部)，删除符号，数字和停用词。为了预处理文本，您将需要下载特定的库。

import re
import numpy as n
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
REPLACE_IP_ADDRESS = re.compile(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b')


def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.replace('\n', ' ').lower()# lowercase text
    text = REPLACE_IP_ADDRESS.sub('', text)
    text = REPLACE_BY_SPACE_RE.sub(' ',text)# replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('',text)# delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join([w for w in text.split() if not w in STOPWORDS])# delete stopwords from text
    return text

4.将文本转换为矢量 (4. Transform text to a vector)

For machine learning models, the textual data must be converted to numeric data. This can be done in various ways like BoW, tf-idf, Word embeddings, etc. In this project, we will be focusing on BoW and tf-idf.

对于机器学习模型，必须将文本数据转换为数字数据。这可以通过BoW，tf-idf，Word嵌入等多种方式完成。在此项目中，我们将重点关注BoW和tf-idf。

言语袋 (Bag-of-Words)

In the BoW model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

在BoW模型中，文本(例如句子或文档)表示为其单词的包(多重集)，而忽略了语法甚至单词顺序，但保持了多重性。

-通过排名建立前N个热门单词的字典。 (- Build a dictionary of top N popular words by ranking.)

We will restrict ourselves to N popular words to limit the size of the matrix. Moreover, including unpopular words will only introduce sparsity without adding too much information. For this project lets work with 10000 popular words.

我们将自己限制为N个流行词，以限制矩阵的大小。而且，包含不受欢迎的单词只会带来稀疏性，而不会增加太多信息。对于这个项目，让我们使用10000个流行词。

# Dictionary of all words from train corpus with their counts.
words_counts = {}
for comments in X_train:
    for word in comments.split():
        if word not in words_counts:
            words_counts[word] = 1
        words_counts[word] += 1
        
DICT_SIZE = 10000
POPULAR_WORDS = sorted(words_counts, key=words_counts.get, reverse=True)[:DICT_SIZE]
WORDS_TO_INDEX = {key: rank for rank, key in enumerate(POPULAR_WORDS, 0)}
INDEX_TO_WORDS = {index:word for word, index in WORDS_TO_INDEX.items()}
ALL_WORDS = WORDS_TO_INDEX.keys()

# Lets take a look at top 10 popular words
POPULAR_WORDS[:10]

-建立BoW (- Build BoW)

For each comment in the corpora create a zero vector with N dimension and for the words found in the comment increase the values in the vector by 1 for e.g., If a word appear twice, that index in the vector will get 2.

为语料库中的每个注释创建一个零维的N维向量，并为注释中找到的单词将向量中的值增加1，例如，如果一个单词出现两次，则该向量中的索引将为2。

For efficient storage, we will convert this vector into a sparse vector, one that leverages sparsity and actually stores only nonzero entries.

为了有效存储，我们将把此向量转换为稀疏向量，该向量利用稀疏性并实际上仅存储非零条目。

from scipy import sparse as sp_sparse

def my_bag_of_words(text, words_to_index, dict_size):
    """
        text: a string
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    result_vector = np.zeros(dict_size)
    for word in text.split(' '):
        if word in words_to_index:
            result_vector[words_to_index[word]] +=1
    return result_vector


X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_val_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_val])
print('X_train shape ', X_train_mybag.shape, '\nX_val shape ', X_val_mybag.shape)

特遣部队 (TF-IDF)

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

在信息检索中， tf–idf或TFIDF (即术语频率–反转文档频率 )是一种数字统计，旨在反映单词对集合或语料库中文档的重要性。

This method is an extension to Bag-of-Words where the total frequency of the word is divided by total words in the document. This penalizes too frequent word by normalizing it over the entire document.

此方法是单词袋的扩展，其中单词的总频率除以文档中的单词总数。通过对整个文档进行标准化来惩罚过于频繁的单词。

from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_features(X_train, X_val, X_test):
    """
        X_train, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with a proper parameters choice
    # Fit the vectorizer on the train set
    # Transform the train, test set and return the result
    
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=5, token_pattern='(\S+)')


    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    X_val_tfidf = tfidf_vectorizer.transform(X_val)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    
    return X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vectorizer.vocabulary_


X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_val, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}

5.多标签分类 (5. Multilabel Classification)

We have the datasets prepared using two different techniques BoW and tf-idf. We can run classifiers on both the datasets. Since this is multilabel classification problem, we will be using a simple OneVsRestClassfier logistic regression.

我们有使用两种不同的技术BoW和tf-idf准备的数据集。我们可以在两个数据集上运行分类器。由于这是多标签分类问题，因此我们将使用简单的OneVsRestClassfier逻辑回归。

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

def train_classifier(X_train, y_train, C, regularisation):
    """
      X_train, y_train — training data
      
      return: trained classifier
    """
    
    # Create and fit LogisticRegression wraped into OneVsRestClassifier.


    model = OneVsRestClassifier(LogisticRegression(penalty=regularisation, C=C, max_iter=10000)).fit(X_train, y_train)
    return model


classifier_mybag = train_classifier(X_train_mybag, y_train, C = 4, regularisation = 'l2')
classifier_tfidf = train_classifier(X_train_tfidf, y_train, C = 4, regularisation = 'l2')


y_val_predicted_labels_mybag = classifier_mybag.predict(X_val_mybag)
y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores_mybag = classifier_mybag.decision_function(X_val_mybag)
y_val_predicted_scores_tfidf = classifier_tfidf.decision_function(X_val_tfidf)

You can experiement with different regularization techniques, L1 and L2 with different coefficients (e.g. C equal to 0.1, 1, 10, 100) till you are happy with the result, this is called hyperparameter tuning. This can be acheived by cv grid search, random search and bayesian optimisation. We are not covering this topic in this article. If you would like to learn more about this, please refer to this post.

您可以使用不同的正则化技术(具有不同系数(例如，C等于0.1、1、10、100)的L1和L2)进行实验，直到对结果满意为止，这称为超参数调整。这可以通过简历网格搜索，随机搜索和贝叶斯优化来实现。我们不在本文中讨论此主题。如果您想了解更多有关此的信息，请参阅这篇文章。

6.评估 (6. Evaluation)

We will using metrics like accuracy score and f1 score for evaluation.

我们将使用准确性得分和f1得分等指标进行评估。

Accuracy Score: In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
准确性评分：在多标签分类中，此函数计算子集准确性：为样本预测的标签组必须与y_true中的相应标签组完全匹配。
F1 score : The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. F1 score = 2 * (precision * recall) / (precision + recall)
F1分数：F1分数可以解释为精度和召回率的加权平均值，其中F1分数在1时达到最佳值，在0时达到最差值。精度和召回率对F1分数的相对贡献相等。 F1分数= 2 *(精度*召回率)/(精度+召回率)

'F1 score micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.

'F1 score micro' ：通过计算总的真实肯定，错误否定和错误肯定来整体计算指标。

'F1 score macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'F1 score macro' ：计算每个标签的指标，并找到其未加权平均值。这没有考虑标签不平衡。

'F1 score weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

'F1 score weighted' ：计算每个标签的指标，并通过支持(每个标签的真实实例数)找到它们的平均加权。这改变了“宏观”以解决标签的不平衡。这可能导致F得分不在精确度和召回率之间。

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def print_evaluation_scores(y_test, predicted):
    
    print('Accuracy: ', accuracy_score(y_test, predicted, normalize=False))
    print('F1-score macro: ', f1_score(y_test, predicted, average='macro'))
    print('F1-score micro: ', f1_score(y_test, predicted, average='micro'))
    print('F1-score weighted: ', f1_score(y_test, predicted, average='weighted'))
    print('Precision macro: ', average_precision_score(y_test, predicted, average='macro'))
    print('Precision micro: ', average_precision_score(y_test, predicted, average='micro'))
    print('Precision weighted: ', average_precision_score(y_test, predicted, average='weighted'))
    
print('Bag-of-words\n')
print_evaluation_scores(y_val, y_val_predicted_labels_mybag)
print('\nTfidf\n')
print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)

F1 score weighted and macro which accounts for data imbalance look good. Lets check the output, predicted label and actual label. We will need to replace on hot encoded labels with actual label for interpretation. Lets run prediction on tf-idf model.

F1分数加权，并且宏可以解决数据不平衡的问题。让我们检查输出，预测标签和实际标签。我们将需要用实际的标签替换热编码的标签以进行解释。让我们对tf-idf模型进行预测。

test_predictions = classifier_tfidf.predict(X_test_tfidf)

def get_pred_labels(data, predictions):
    y_cols = list(data.columns[2:])
    y_label_dict={}
    for k,v in enumerate(y_cols):
        y_label_dict[k] = v


    test_predictions_labels = []
    for pred in predictions:
        label_pred = []
        for index, label in enumerate(list(pred)):
            if label != 0:
                label = y_label_dict[index]
            label_pred.append(label)
        test_predictions_labels.append(tuple([i for i in label_pred if i != 0]))
    return test_predictions_labels


test_pred_labels = get_pred_labels(data, test_predictions)
test_labels = get_pred_labels(data, y_test)

for i in range(90,97):
    print('\ny_label: ', test_labels[i], '\ny_pred: ', test_pred_labels[i])print('\ny_label: ', test_labels[i], '\ny_pred: ', test_pred_labels[i])

The results are not too bad, but it can do better. Please try to experiment hyper parameter tuning and try different classifiers to check the performance of the model. Hope you enjoyed reading.

结果还不错，但是可以做得更好。请尝试试验超参数调整，并尝试使用不同的分类器来检查模型的性能。希望您喜欢阅读。

I have left the code for building word cloud image, if you are interested.

如果您有兴趣，我已经留下了用于构建词云图像的代码。

词云与Twitter面具 (Word Cloud with Twitter mask)

comments_join = ' '.join(POPULAR_WORDS)from scipy.misc import imread
from wordcloud import WordCloud, STOPWORDStwitter_mask = imread('twitter.png', flatten=True)
wordcloud = WordCloud(
                      stopwords=STOPWORDS,
                      background_color='white',
                      width=1800,
                      height=1400,
                      mask=twitter_mask
            ).generate(comments_join)plt.figure(figsize = (12, 12), facecolor = None) 
plt.imshow(wordcloud)
plt.axis("off")
plt.savefig('twitter_comments.png', dpi=300)
plt.show()

For jupyter notebook with codes, please click here.

有关带有代码的Jupyter笔记本，请单击此处。