tf-idf词向量和bow
This project follows the traditional techniques like the Bag of Words and tf-idf to represent words in a corpus in a numeric format for multilabel classification.
该项目遵循传统的技术,例如单词袋和tf-idf,以数字格式表示语料库中的单词,以进行多标签分类。
1.加载数据 (1. Load the data)
For this study, we are using a kaggle data for Toxic Comment Classification Challenge. Lets load and inspect the data. This is multilabel classification problem where comments are classified by the level of toxicity: toxic / severe_toxic / obscene / threat / insult / identity_hate
在本研究中,我们将kaggle数据用于“ 毒性评论分类挑战” 。 让我们加载和检查数据。 这是多标签分类问题,其中注释按毒性级别分类: toxic / severe_toxic / obscene / threat / insult / identity_hate
毒性toxic / severe_toxic / obscene / threat / insult / identity_hate
import pandas as pd
data = pd.read_csv('train.csv')
print('Shape of the data: ', data.shape)
data.head()
![Image for post](https://miro.medium.com/max/9999/1*D5D-iuqP7N92qIS5Risd8w.png)
y_cols = list(data.columns[2:])
is_multilabel = (data[y_cols].sum(axis=1) >1).count()
print('is_multilabel count: ', is_multilabel)
- From the above data we can see that not all comments have a label. 从以上数据可以看出,并非所有注释都带有标签。
- Its Multilabel data (each comment can have more than one label) 它的多标签数据(每个注释可以有多个标签)
- Add a label, ‘non_toxic’ for comments with no label 添加标签“ non_toxic”以添加没有标签的评论
- Lets also check the explore how balanced is the classes. 我们还检查一下类之间的平衡程度。
# Add a label, ‘non_toxic’ for comments with no label
data['non_toxic'] = 1-data[y_cols].max(axis=1)
y_cols += ['non_toxic']# Inspect the class balance
def get_class_weight(data):
class_weight = {}
for num,col in enumerate(y_cols):
if num not in class_weight:
class_weight[col] = round((data[data[col] == 1][col].sum())/data.shape[0]*100,2)
return class_weight
class_weight = get_class_weight(data)print('Total class weight: ', sum(class_weight.values()), '%\n\n', class_weight)
![Image for post](https://miro.medium.com/max/9999/1*23uPZSbEy66wUgLuCmjXXA.png)
We can see that the data is highly imbalanced. Imabalanced data refers to a classfication problems where the classes are not represented equally for e.g., 89% comments are classified under the newly built ‘non_toxic’ label.
我们可以看到数据高度不平衡。 不平衡的数据指的是分类问题,其中类别没有被平等地表示,例如,有89%的评论被归类为新建的“无毒”标签。
Any give