tf-idf词向量和bow_使用词袋Bow和TF IDF进行多标签分类

tf-idf词向量和bow

This project follows the traditional techniques like the Bag of Words and tf-idf to represent words in a corpus in a numeric format for multilabel classification.

该项目遵循传统的技术,例如单词袋和tf-idf,以数字格式表示语料库中的单词,以进行多标签分类。

1.加载数据 (1. Load the data)

For this study, we are using a kaggle data for Toxic Comment Classification Challenge. Lets load and inspect the data. This is multilabel classification problem where comments are classified by the level of toxicity: toxic / severe_toxic / obscene / threat / insult / identity_hate

在本研究中,我们将kaggle数据用于“ 毒性评论分类挑战” 。 让我们加载和检查数据。 这是多标签分类问题,其中注释按毒性级别分类: toxic / severe_toxic / obscene / threat / insult / identity_hate毒性toxic / severe_toxic / obscene / threat / insult / identity_hate

import pandas as pd
data = pd.read_csv('train.csv')
print('Shape of the data: ', data.shape)
data.head()
Image for post
Snapshot of the dataset
数据集快照
y_cols = list(data.columns[2:])
is_multilabel = (data[y_cols].sum(axis=1) >1).count()
print('is_multilabel count: ', is_multilabel)
  • From the above data we can see that not all comments have a label.

    从以上数据可以看出,并非所有注释都带有标签。
  • Its Multilabel data (each comment can have more than one label)

    它的多标签数据(每个注释可以有多个标签)
  • Add a label, ‘non_toxic’ for comments with no label

    添加标签“ non_toxic”以添加没有标签的评论
  • Lets also check the explore how balanced is the classes.

    我们还检查一下类之间的平衡程度。
# Add a label, ‘non_toxic’ for comments with no label
data['non_toxic'] = 1-data[y_cols].max(axis=1)
y_cols += ['non_toxic']# Inspect the class balance
def get_class_weight(data):
class_weight = {}
for num,col in enumerate(y_cols):
if num not in class_weight:
class_weight[col] = round((data[data[col] == 1][col].sum())/data.shape[0]*100,2)
return class_weight
class_weight = get_class_weight(data)print('Total class weight: ', sum(class_weight.values()), '%\n\n', class_weight)
Image for post

We can see that the data is highly imbalanced. Imabalanced data refers to a classfication problems where the classes are not represented equally for e.g., 89% comments are classified under the newly built ‘non_toxic’ label.

我们可以看到数据高度不平衡。 不平衡的数据指的是分类问题,其中类别没有被平等地表示,例如,有89%的评论被归类为新建的“无毒”标签。

Any give

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值