文本多标签分类python_如何用Scikit-learn实现多标签文本分类

我们在之前的一篇回答中曾详细讲解了机器学习中的多标签分类问题,也介绍了解决多标签分类问题的一些方法:

简单说,多标签分类就是向每个样本分配一组目标标签,我们可以将这个问题看作预测某个数据点的互不排斥的多个属性,比如7-11,你既能将它归类为路边便利店,也能归类为路边小吃店。而在多标签分类问题中,多标签文本分类在实际中有着广泛应用,比如在购物网站上为商品分类标签,或者将电影分类到一个或多个流派等等。今天就分享一下如何用Scikit-learn解决多标签文本分类。

问题描述

估计不少人都有过在网上遭人辱骂或骚扰的经历,这种现象不会因为你关掉网站或手机后就消失不见。谷歌的一些研究人员目前正在用一些工具研究网络恶毒评论。本文我(作者Susan Li——译者注)会搭建一个多标签模型,能够检测不同类型的恶毒言论,比如威胁、猥亵话语、侮辱等等。我们会用到监督式分类器和文本表示。一条恶毒评论可能是威胁、辱骂、侮辱、仇恨等其中一种或全都符合。我们所有的数据集来自Kaggle:

数据探索

%matplotlib inline

import re

import matplotlib

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score

from sklearn.multiclass import OneVsRestClassifier

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

from sklearn.svm import LinearSVC

from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline

import seaborn as sns

df = pd.read_csv("train 2.csv", encoding = "ISO-8859-1")

df.head()

每个种类下的评论数量

df_toxic = df.drop(['id', 'comment_text'], axis=1)

counts = []

categories = list(df_toxic.columns.values)

for i in categories:

counts.append((i, df_toxic[i].sum()))

df_stats = pd.DataFrame(counts, columns=['category', 'number_of_comments'])

df_stats

df_stats.plot(x='category', y='number_of_comments', kind='bar', legend=False, grid=True, figsize=(8, 5))

plt.title("Number of comments per category")

plt.ylabel('# of Occurrences', fontsize=12)

plt.xlabel('category', fontsize=12)

多标签

多少评论有多个标签?

rowsums = df.iloc[:,2:].sum(axis=1)

x=rowsums.value_counts()

#plot

plt.figure(figsize=(8,5))

ax = sns.barplot(x.index, x.values)

plt.title("Multiple categories per comment")

plt.ylabel('# of Occurrences', fontsize=12)

plt.xlabel('# of categories', fontsize=12)绝大部分评论文本都没有被标记。

print('Percentage of comments that are not labelled:')

print(len(df[(df['toxic']==0) & (df['severe_toxic']==0) & (df['obscene']==0) & (df['threat']== 0) & (df['insult']==0) & (df['identity_hate']==0)]) / len(df))

没有标签的评论文本所占比例:

0.8983211235124177

评论文本中词汇数量的分布。

lens = df.comment_text.str.len()

lens.hist(bins = np.arange(0,5000,50))

大部分评论文本长度都在500个字符以内,也有些异常值,长度达到5000个字符。

在评论文本列中没有缺失评论。

print('Number of missing comments in comment text:')

df['comment_text'].isnull().sum()

评论文本中缺失评论的数量:

0

我们先看看第一条评论后就会发现,文本数据需要清洗。

df['comment_text'][0]

“Explanation\rWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27”

数据预处理

创建一个函数清洗文本

def clean_text(text):

text = text.lower()

text = re.sub(r"what's", "what is ", text)

text = re.sub(r"\'s", " ", text)

text = re.sub(r"\'ve", " have ", text)

text = re.sub(r"can't", "can not ", text)

text = re.sub(r"n't", " not ", text)

text = re.sub(r"i'm", "i am ", text)

text = re.sub(r"\'re", " are ", text)

text = re.sub(r"\'d", " would ", text)

text = re.sub(r"\'ll", " will ", text)

text = re.sub(r"\'scuse", " excuse ", text)

text = re.sub('\W', ' ', text)

text = re.sub('\s+', ' ', text)

text = text.strip(' ')

return text

清洗comment_text列:

df['comment_text'] = df['comment_text'].map(lambda com : clean_text(com))

df['comment_text'][0]

‘explanation why the edits made under my username hardcore metallica fan were reverted they were not vandalisms just closure on some gas after i voted at new york dolls fac and please do not remove the template from the talk page since i am retired now 89 205 38 27’

结果比之前好多了!

将数据拆分为训练集和测试集:

categories = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train, test = train_test_split(df, random_state=42, test_size=0.33, shuffle=True)

X_train = train.comment_text

X_test = test.comment_text

print(X_train.shape)

print(X_test.shape)

(106912,)

(52659,)

分类器训练

工作流

Scikit-learn提供了一个工作流实体以帮助我们实现自动化机器学习流程。在机器学习系统中,工作流非常常见,因为有太多数据需要处理,应用很多数据转换操作。下面我们会利用工作流训练每个分类器。

OneVsRest多标签策略

该多标签算法会将二元掩膜应用到多个标签上,每个预测的结果表示为0或1的数组,标记哪个类别应用到每行输入样本上。

朴素贝叶斯

OneVsRest策略可以用于多标签学习,其中有一个分类器用于预测实例的多个标签。朴素贝叶斯用于解决多类问题,但我们要处理多标签问题,所以我们将朴素贝叶斯封装在OneVsRest分类器中。

# 定义一个工作流,将文本特征提取器和多标签分类器合并在一起

NB_pipeline = Pipeline([

('tfidf', TfidfVectorizer(stop_words=stop_words)),

('clf', OneVsRestClassifier(MultinomialNB(

fit_prior=True, class_prior=None))),

])

for category in categories:

print('... Processing {}'.format(category))

# 用 X_dtm & y训练模型

NB_pipeline.fit(X_train, train[category])

# 计算测试准确率

prediction = NB_pipeline.predict(X_test)

print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

处理恶言类别(toxic)

测试准确率为0.9191401279933155

处理严重恶言类别(severe_toxic)

测试准确率为0.9900112041626312

处理猥亵言论类别(obscene)

测试准确率为0.9514802787747584

处理威胁类别(threat)

测试准确率为0.9971135038644866

处理侮辱类别(insult)

测试准确率为 0.9517271501547694

处理仇恨言论类别(identity_hate)

测试准确率为0.9910556600011394

线性SVC

SVC_pipeline = Pipeline([

('tfidf', TfidfVectorizer(stop_words=stop_words)),

('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),

])

for category in categories:

print('... Processing {}'.format(category))

# train the model using X_dtm & y

SVC_pipeline.fit(X_train, train[category])

# compute the testing accuracy

prediction = SVC_pipeline.predict(X_test)

print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

处理恶言类别(toxic)

测试准确率为0.9599498661197516

处理严重恶言类别(severe_toxic)

测试准确率为0.9906948479842003

处理猥亵言论类别(obscene)

测试准确率为0.9789019920621356

处理威胁类别(threat)

测试准确率为0.9974173455629617

处理侮辱类别(insult)

测试准确率为 0.9712299891756395

处理仇恨言论类别(identity_hate)

测试准确率为0.9919861752027194

逻辑回归

LogReg_pipeline = Pipeline([

('tfidf', TfidfVectorizer(stop_words=stop_words)),

('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),

])

for category in categories:

print('... Processing {}'.format(category))

# train the model using X_dtm & y

LogReg_pipeline.fit(X_train, train[category])

# compute the testing accuracy

prediction = LogReg_pipeline.predict(X_test)

print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

处理恶言类别(toxic)

测试准确率为0.9548415275641391

处理严重恶言类别(severe_toxic)

测试准确率为 0.9910556600011394

处理猥亵言论类别(obscene)

测试准确率为0.9761104464573956

处理威胁类别(threat)

测试准确率为0.9973793653506523

处理侮辱类别(insult)

测试准确率为 0.9687612753755294

处理仇恨言论类别(identity_hate)

测试准确率为0.991758293928863

这三个分类器生成的结果差不多。至此,我们就为恶毒评论多标签文本分类问题创建了一个很强大的基准。

附本项目完整代码地址:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值