垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器

垃圾邮件分类 python

介绍 (Introduction)

I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy of our limited attention.

我一直对Google的gmail垃圾邮件检测系统着迷,该系统似乎可以毫不费力地判断收到的电子邮件是否是垃圾邮件,因此不值得我们的关注。

In this article, I seek to recreate such a spam detection system, but on sms messages. I will use a few different models and compare their performance.

在本文中,我试图重新创建这样的垃圾邮件检测系统,但要针对短信。 我将使用几种不同的模型并比较它们的性能。

The models are as below:

型号如下:

  1. Multinomial Naive Bayes Model (Count tokenizer)

    多项朴素贝叶斯模型(Count tokenizer)
  2. Multinomial Naive Bayes Model (tfidf tokenizer)

    多项式朴素贝叶斯模型(tfidf tokenizer)
  3. Support Vector Classifier Model

    支持向量分类器模型
  4. Logistic Regression Model with ngrams parameters

    具有ngrams参数的Logistic回归模型

Using a train-test split, the 4 models were put through the stages of X_train vectorization, model fitting on X_train and Y_train, make some predictions and generate the respective confusion matrices and area under the receiver operating characteristics curve for evaluation. (AUC-ROC)

使用火车测试拆分,对这四个模型进行了X_train向量化,对X_train和Y_train进行模型拟合的阶段,进行了一些预测,并在接收器工作特性曲线下生成了相应的混淆矩阵和面积以进行评估。 (AUC-ROC)

The resultant best performing model was the Logistic Regression Model, although it should be noted that all 4 models performed reasonably well at detecting spam messages (all AUC > 0.9).

最终表现最好的模型是Logistic回归模型 ,尽管应该注意的是,这4个模型在检测垃圾邮件方面都表现得相当不错(所有AUC> 0.9)。

数据 (The Data)

The data was obtained from UCI’s Machine Learning Repository, alternatively I have also uploaded the used dataset onto my github repo. In total, the data set has 5571 rows, and 2 columns: spamorham indicating it’s spam status and the message’s text. I found it quite funny how the text is quite relatable.

数据是从UCI的机器学习存储库中获得的 ,或者我也将使用过的数据集上传到了我的github存储库中 。 数据集总共有5571行和2列:spamorham(表明其为垃圾邮件状态)和邮件的文本。 我发现文本之间的相关性很好笑。

Definitions: Spam refers to spam messages as they are commonly known, ham refers to non-spam messages.

定义:垃圾邮件是指众所周知的垃圾邮件,火腿是指非垃圾邮件。

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值