tf-idf词向量和bow_使用词袋Bow和TF IDF进行多标签分类

最新推荐文章于 2023-05-21 15:13:01 发布

weixin_26750481

最新推荐文章于 2023-05-21 15:13:01 发布

阅读量1k

点赞数

文章标签： python 人工智能

原文链接：https://medium.com/swlh/multi-label-classification-using-bag-of-words-bow-and-tf-idf-4f95858740e5

版权

tf-idf词向量和bow

This project follows the traditional techniques like the Bag of Words and tf-idf to represent words in a corpus in a numeric format for multilabel classification.

该项目遵循传统的技术，例如单词袋和tf-idf，以数字格式表示语料库中的单词，以进行多标签分类。

1.加载数据 (1. Load the data)

For this study, we are using a kaggle data for Toxic Comment Classification Challenge. Lets load and inspect the data. This is multilabel classification problem where comments are classified by the level of toxicity: toxic / severe_toxic / obscene / threat / insult / identity_hate

在本研究中，我们将kaggle数据用于“ 毒性评论分类挑战” 。让我们加载和检查数据。这是多标签分类问题，其中注释按毒性级别分类： toxic / severe_toxic / obscene / threat / insult / identity_hate毒性toxic / severe_toxic / obscene / threat / insult / identity_hate

import pandas as pd
data = pd.read_csv('train.csv')
print('Shape of the data: ', data.shape)
data.head()

Image for post — Snapshot of the dataset

y_cols = list(data.columns[2:])
is_multilabel = (data[y_cols].sum(axis=1) >1).count()
print('is_multilabel count: ', is_multilabel)

From the above data we can see that not all comments have a label.
从以上数据可以看出，并非所有注释都带有标签。
Its Multilabel data (each comment can have more than one label)
它的多标签数据(每个注释可以有多个标签)
Add a label, ‘non_toxic’ for comments with no label
添加标签“ non_toxic”以添加没有标签的评论
Lets also check the explore how balanced is the classes.
我们还检查一下类之间的平衡程度。

# Add a label, ‘non_toxic’ for comments with no label
data['non_toxic'] = 1-data[y_cols].max(axis=1)
y_cols += ['non_toxic']# Inspect the class balance
def get_class_weight(data):
    class_weight = {}
    for num,col in enumerate(y_cols):
        if num not in class_weight:
            class_weight[col] = round((data[data[col] == 1][col].sum())/data.shape[0]*100,2)
    return class_weight
class_weight = get_class_weight(data)print('Total class weight: ', sum(class_weight.values()), '%\n\n', class_weight)

We can see that the data is highly imbalanced. Imabalanced data refers to a classfication problems where the classes are not represented equally for e.g., 89% comments are classified under the newly built ‘non_toxic’ label.

我们可以看到数据高度不平衡。不平衡的数据指的是分类问题，其中类别没有被平等地表示，例如，有89％的评论被归类为新建的“无毒”标签。

Any give

最低0.47元/天解锁文章

weixin_26750481

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
tf-idf词向量和bow_使用词袋Bow和TF IDF进行多标签分类

tf-idf词向量和bowThis project follows the traditional techniques like the Bag of Words and tf-idf to represent words in a corpus in a numeric format for multilabel classification. 该项目遵循传统的技术，例如单词袋和tf-idf，...
复制链接

扫一扫