Spark-MLlib实例——逻辑回归,应用于二元分类的情况,这里以垃圾邮件分类为例,即是否为垃圾邮件两种情况。
1、垃圾邮件分类,使用Spark-MLlib中的两个函数:
1)HashingTF: 从文本数据构建词频(term frequency)特征向量
2)LogisticRegressionWithSGD: 使用随机梯度下降法(Stochastic Gradient Descent),实现逻辑回归。
2、训练原数据集
垃圾邮件例子 spam.txt
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to send you money via wire transfer so please ...
Get Viagra real cheap! Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to this email with your password and SSN ...
THIS IS NOT A SCAM! Send money and get access to awesome stuff really cheap and never have to ...
非垃圾邮件例子 normal.txt