垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器

最新推荐文章于 2024-05-24 15:45:47 发布

weixin_26752765

最新推荐文章于 2024-05-24 15:45:47 发布

阅读量1.3k

点赞数 2

文章标签： python

原文链接：https://medium.com/analytics-vidhya/create-a-sms-spam-classifier-in-python-b4b015f7404b

版权

垃圾邮件分类 python

介绍 (Introduction)

I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy of our limited attention.

我一直对Google的gmail垃圾邮件检测系统着迷，该系统似乎可以毫不费力地判断收到的电子邮件是否是垃圾邮件，因此不值得我们的关注。

In this article, I seek to recreate such a spam detection system, but on sms messages. I will use a few different models and compare their performance.

在本文中，我试图重新创建这样的垃圾邮件检测系统，但要针对短信。我将使用几种不同的模型并比较它们的性能。

The models are as below:

型号如下：

Multinomial Naive Bayes Model (Count tokenizer)
多项朴素贝叶斯模型(Count tokenizer)
Multinomial Naive Bayes Model (tfidf tokenizer)
多项式朴素贝叶斯模型(tfidf tokenizer)
Support Vector Classifier Model
支持向量分类器模型
Logistic Regression Model with ngrams parameters
具有ngrams参数的Logistic回归模型

Using a train-test split, the 4 models were put through the stages of X_train vectorization, model fitting on X_train and Y_train, make some predictions and generate the respective confusion matrices and area under the receiver operating characteristics curve for evaluation. (AUC-ROC)

使用火车测试拆分，对这四个模型进行了X_train向量化，对X_train和Y_train进行模型拟合的阶段，进行了一些预测，并在接收器工作特性曲线下生成了相应的混淆矩阵和面积以进行评估。 (AUC-ROC)

The resultant best performing model was the Logistic Regression Model, although it should be noted that all 4 models performed reasonably well at detecting spam messages (all AUC > 0.9).

最终表现最好的模型是Logistic回归模型 ，尽管应该注意的是，这4个模型在检测垃圾邮件方面都表现得相当不错(所有AUC> 0.9)。

Image for post — Photo by Hannes Johnson on Unsplash

数据 (The Data)

The data was obtained from UCI’s Machine Learning Repository, alternatively I have also uploaded the used dataset onto my github repo. In total, the data set has 5571 rows, and 2 columns: spamorham indicating it’s spam status and the message’s text. I found it quite funny how the text is quite relatable.

数据是从UCI的机器学习存储库中获得的，或者我也将使用过的数据集上传到了我的github存储库中。数据集总共有5571行和2列：spamorham(表明其为垃圾邮件状态)和邮件的文本。我发现文本之间的相关性很好笑。

Definitions: Spam refers to spam messages as they are commonly known, ham refers to non-spam messages.

定义：垃圾邮件是指众所周知的垃圾邮件，火腿是指非垃圾邮件。

最低0.47元/天解锁文章

weixin_26752765

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器

介绍 (Introduction)I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy o...
复制链接

扫一扫