[Kaggle] Spam/Ham Email Classification 垃圾邮件分类(RNN/GRU/LSTM)

练习地址:https://www.kaggle.com/c/ds100fa19
相关博文
[Kaggle] Spam/Ham Email Classification 垃圾邮件分类(spacy)
[Kaggle] Spam/Ham Email Classification 垃圾邮件分类(BERT)

1. 读入数据

  • 读取数据,test集没有标签
import pandas as pd
import numpy as np
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()

  • 数据有无效的单元
print(np.sum(np.array(train.isnull()==True), axis=0))
print(np.sum(np.array(test.isnull()==True), axis=0))

存在 Na 单元格

[0 6 0 0]
[0 1 0]
  • fillna 填充处理
train = train.fillna(" ")
test = test.fillna(" ")
print(np.sum(np.array(train.isnull()==True), axis=0))
print(np.sum(np.array(test.isnull()==True), axis=0))

填充完成,显示 sum = 0

[0 0 0 0]
[0 0 0]
  • y 标签 只有 0 不是垃圾邮件, 1 是垃圾邮件
print(train['spam'].unique())
[0 1]

2. 文本处理

  • 邮件内容和主题合并为一个特征
X_train = train['subject'] + ' ' + train['email']
y_train = train['spam']
X_test = test['subject'] + ' ' + test['email']
  • 文本转成 tokens ids 序列
from keras.preprocessing.text import Tokenizer
max_words = 300
tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')
# 只给频率最高的300个词分配 id,其他的忽略
tokenizer.fit_on_texts(list(X_train)+list(X_test)) # tokenizer 训练
X_train_tokens = tokenizer.texts_to_sequences(X_train)
X_test_tokens = tokenizer.texts_to_sequences(X_test)
  • pad ids 序列,使之长度一样
# 样本 tokens 的长度不一样,pad
maxlen = 100
from keras.preprocessing import sequence
X_train_tokens_pad = sequence.pad_sequences(X_train_tokens, maxlen=maxlen,padding='post')
X_test_tokens_pad = sequence.pad_sequences(X_test_tokens, maxlen=maxlen,padding='post')

3. 建模

embeddings_dim = 30 # 词嵌入向量维度
from keras.models import Model, Sequential
from keras.layers import Embedding, LSTM, GRU, SimpleRNN, Dense
model = Sequential()
model.add(Embedding(input_dim=max_words, # Size of the vocabulary
                    output_dim=embeddings_dim, # 词嵌入的维度
                    input_length=maxlen))
model.add(GRU(units=64)) # 可以改为 SimpleRNN , LSTM
model.add(Dense(units=1, activation='sigmoid'))
model.summary()

模型结构:

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 100, 30)           9000      
_________________________________________________________________
gru (GRU)                    (None, 64)                18432     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
=================================================================
Total params: 27,497
Trainable params: 27,497
Non-trainable params: 0
_________________________________________________________________

4. 训练

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']) # 配置模型
history = model.fit(X_train_tokens_pad, y_train,
                    batch_size=128, epochs=10, validation_split=0.2)
model.save("email_cat_lstm.h5") # 保存训练好的模型
  • 绘制训练曲线
from matplotlib import pyplot as plt
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.show()

5. 测试

pred_prob = model.predict(X_test_tokens_pad).squeeze()
pred_class = np.asarray(pred_prob > 0.5).astype(np.int32)
id = test['id']
output = pd.DataFrame({'id':id, 'Class': pred_class})
output.to_csv("submission_gru.csv",  index=False)
  • 3种RNN模型对比:
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Michael阿明

如果可以,请点赞留言支持我哦!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值