[Kaggle] Spam/Ham Email Classification 垃圾邮件分类(spacy)

练习地址:https://www.kaggle.com/c/ds100fa19
相关博文:
[Kaggle] Spam/Ham Email Classification 垃圾邮件分类(RNN/GRU/LSTM)
[Kaggle] Spam/Ham Email Classification 垃圾邮件分类(BERT)

1. 导入包

import pandas as pd
import spacy
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

2. 数据预览

train.head(10)
train = train.fillna(" ")
test = test.fillna(" ")

注意处理下 NaN , 否则后续会报错,见链接:
spacy 报错 gold.pyx in spacy.gold.GoldParse.init() 解决方案https://michael.blog.csdn.net/article/details/109106806

2. 特征组合

  • 对邮件的主题和内容进行组合 + 处理标签
train['all'] = train['subject']+train['email']
train['label'] = [{"spam": bool(y), "ham": not bool(y)}
                  for y in train.spam.values]
train.head(10)

标签不是很懂为什么这样,可能spacy要求这种格式的标签

  • 训练集、验证集切分,采用分层抽样
from sklearn.model_selection import StratifiedShuffleSplit
# help(StratifiedShuffleSplit)
splt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1)
for train_idx, valid_idx in splt.split(train, train['spam']): 
											# 按照后者分层抽样
    train_set = train.iloc[train_idx]
    valid_set = train.iloc[valid_idx]

# 查看分布
print(train_set['spam'].value_counts()/len(train_set))
print(valid_set['spam'].value_counts()/len(valid_set))

输出:显示两种数据集的标签分布是几乎相同的

0    0.743636
1    0.256364
Name: spam, dtype: float64
0    0.743713
1    0.256287
Name: spam, dtype: float64
  • 文本、标签分离
train_text = train_set['all'].values
train_label = train_set['label']
valid_text = valid_set['all'].values
valid_label = valid_set['label']

# 标签还要做以下处理,添加一个 'cats' key,'cats' 也是内置的关键字
train_label = [{"cats": label} for label in train_label]
valid_label = [{"cats": label} for label in valid_label]

# 训练数据打包,再转为list
train_data = list(zip(train_text, train_label)) 

test_text = (test['subject']+test['email']).values
print(train_label[0])

输出:

{'cats': {'spam': False, 'ham': True}}

3. 建模

  • 创建模型,管道
nlp = spacy.blank('en') # 建立空白的英语模型
email_cat = nlp.create_pipe('textcat', 
#                             config=
#                             {
#     "exclusive_classes": True, # 排他的,二分类
#     "architecture": "bow"
#                             }
                           )
# 参数 'textcat' 不能随便写,是接口内置的 字符串
# 上面的 config 不要也可以,没找到文档说明,该怎么配置
help(nlp.create_pipe)
  • 添加管道
nlp.add_pipe(email_cat)
  • 添加标签
# 注意顺序,ham是 0, spam 是 1
email_cat.add_label('ham')
email_cat.add_label('spam')
  • 训练
from spacy.util import minibatch
import random
def train(model, train, optimizer, batch_size=8):
    loss = {}
    random.seed(1)
    random.shuffle(train) # 随机打乱
    batches = minibatch(train, size=batch_size) # 数据分批
    for batch in batches:
        text, label = zip(*batch)
        model.update(text, label, sgd=optimizer, losses=loss)
    return loss
  • 预测
def predict(model, text):
    docs = [model.tokenizer(txt) for txt in text] # 先把文本令牌化
    emailpred = model.get_pipe('textcat')
    score, _ = emailpred.predict(docs)
    pred_label = score.argmax(axis=1)
    return pred_label
  • 评估
def evaluate(model, text, label):
    pred = predict(model, text)
    true_class = [int(lab['cats']['spam']) for lab in label]
    correct = (pred == true_class)
    acc = sum(correct)/len(correct) # 准确率
    return acc

4. 训练

n = 20
opt = nlp.begin_training() # 定义优化器
for i in range(n):
    loss = train(nlp, train_data, opt)
    acc = evaluate(nlp, valid_text, valid_label)
    print(f"Loss: {loss['textcat']:.3f} \t Accuracy: {acc:.3f}")

输出:

Loss: 1.132 	 Accuracy: 0.941
Loss: 0.283 	 Accuracy: 0.988
Loss: 0.121 	 Accuracy: 0.993
Loss: 0.137 	 Accuracy: 0.993
Loss: 0.094 	 Accuracy: 0.982
Loss: 0.069 	 Accuracy: 0.995
Loss: 0.060 	 Accuracy: 0.990
Loss: 0.010 	 Accuracy: 0.992
Loss: 0.004 	 Accuracy: 0.992
Loss: 0.004 	 Accuracy: 0.992
Loss: 0.004 	 Accuracy: 0.992
Loss: 0.004 	 Accuracy: 0.992
Loss: 0.004 	 Accuracy: 0.992
Loss: 0.004 	 Accuracy: 0.991
Loss: 0.004 	 Accuracy: 0.991
Loss: 0.308 	 Accuracy: 0.981
Loss: 0.158 	 Accuracy: 0.987
Loss: 0.014 	 Accuracy: 0.990
Loss: 0.007 	 Accuracy: 0.990
Loss: 0.043 	 Accuracy: 0.990

5. 预测

pred = predict(nlp, test_text)

  • 写入提交文件
id = test['id']
output = pd.DataFrame({'id':id, 'Class':pred})
output.to_csv("submission.csv",  index=False)

模型在测试集的准确率是99%以上!


我的CSDN博客地址 https://michael.blog.csdn.net/

长按或扫码关注我的公众号(Michael阿明),一起加油、一起学习进步!
Michael阿明

SPAM/HAM数据集是用于垃圾邮件分类的英文数据集,可以用于机器学习模型的训练。这个数据集包含一个名为spam.csv的文件,其中包含用于对垃圾邮箱进行分类的数据。 如果你对这个数据集感兴趣,你可以在Kaggle上找到它,地址是https://www.kaggle.com/c/ds100fa19。在这个链接中,你可以找到相关的博文和一些关于垃圾邮件分类的练习。 当你读入数据时,可以使用pandas库来读取spam.csv文件,并将它分为训练集和测试集。具体的代码如下: ```python import pandas as pd import numpy as np train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") train.head() ``` 如果你想了解数据集中是否存在无效的单元格,可以使用numpy库中的sum函数来计算train和test中无效单元格的数量。具体代码如下: ```python print(np.sum(np.array(train.isnull()==True), axis=0)) print(np.sum(np.array(test.isnull()==True), axis=0)) ``` 这样就可以得到train和test中无效单元格的数量了。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [spam-and-ham-dataset.zip](https://download.csdn.net/download/qq_32742431/12129001)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* *3* [[Kaggle] Spam/Ham Email Classification 垃圾邮件分类(RNN/GRU/LSTM)](https://blog.csdn.net/qq_21201267/article/details/111059250)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论 11
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Michael阿明

如果可以,请点赞留言支持我哦!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值