使用spark建立逻辑回归(Logistic)模型帮Helen找男朋友

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents


博主简介:风雪夜归子(Allen),机器学习算法攻城狮,喜爱钻研Meachine Learning的黑科技,对Deep Learning和Artificial Intelligence充满兴趣,经常关注Kaggle数据挖掘竞赛平台,对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦,个人CSDN博客:http://blog.csdn.net/u013719780?viewmode=contents




假设海伦一直使用在线约会网站寻找适合自己的约会对象。尽管约会网站会推荐不同的人选,但她没有从中找到喜欢的人。经过一番总结,她发现曾交往过三种类型的人:

□ 不喜欢的人
□ 魅力一般的人
□ 极具魅力的人

尽管发现了上述规律,但海伦依然无法将约会网站推荐的匹配对象归人恰当的分类。她觉得可以在周一到周五约会那些魅力一般的人,而周末则更喜欢与那些极具魅力的人为伴。海伦希望我们的分类算法可以更好地帮助她将匹配对象划分到确切的分类中。此外海伦还收集了一些约会网站未曾记录的数据信息,她认为这些数据更有助于匹配对象的归类。



海伦收集约会数据巳经有了一段时间,她把这些数据存放在文本文件datingTestSet中,每个样本数据占据一行,总共有1000行。海伦的样本主要包含以下3种特征:
□ 每年获得的飞行常客里程数
□ 玩视频游戏所耗时间百分比
□ 每周消费的冰淇淋公升数


file_content = sc.textFile('/Users/youwei.tan/Desktop/datingTestSet.txt')
df = file_content.map(lambda x:x.split('\t'))
df.take(2)


输出结果如下:


[[u'40920', u'8.326976', u'0.953952', u'largeDoses'],
 [u'14488', u'7.153469', u'1.673904', u'smallDoses']]



再将数据集转换成dataframe格式,具体代码如下:


dataset = sqlContext.createDataFrame(df, ['Mileage ', 'Gametime', 'Icecream', 'label'])
dataset.show(5, False)
dataset.printSchema


输出结果如下:

+--------+---------+--------+----------+
|Mileage |Gametime |Icecream|label     |
+--------+---------+--------+----------+
|40920   |8.326976 |0.953952|largeDoses|
|14488   |7.153469 |1.673904|smallDoses|
|26052   |1.441871 |0.805124|didntLike |
|75136   |13.147394|0.428964|didntLike |
|38344   |1.669788 |0.134296|didntLike |
+--------+---------+--------+----------+
only showing top 5 rows


<bound method DataFrame.printSchema of DataFrame[Mileage : string, Gametime: string, Icecream: string, label: string]>


建立标签label的索引字典,目的是为了将字符串型的label转换成数值型的label。






label_set = dataset.map(lambda x: x[3]).distinct().collect()
label_dict = dict()
i = 0
for key in label_set:
    if key not in label_dict.keys():
        label_dict[key ]= i
        i = i+1
label_dict


输出结果:


{u'didntLike': 0, u'largeDoses': 1, u'smallDoses': 2}



目前所得到的数据集类型是string类型,需要将其转成数值型,具体实现代码如下:


data = dataset.map(lambda x: ([x[i] for i in range(3)], label_dict[x[3]])).\
               map(lambda (x,y): [int(x[0]), float(x[1]), float(x[2]), y])
data = sqlContext.createDataFrame(data,  ['Mileage ', 'Gametime', 'Icecream', 'label'] )
data.show(5, False)
data.printSchema
#data.selectExpr('Mileage', 'Gametime', 'Icecream', 'label').show()


输出结果:


+--------+---------+--------+-----+
|Mileage |Gametime |Icecream|label|
+--------+---------+--------+-----+
|40920   |8.326976 |0.953952|1    |
|14488   |7.153469 |1.673904|2    |
|26052   |1.441871 |0.805124|0    |
|75136   |13.147394|0.428964|0    |
|38344   |1.669788 |0.134296|0    |
+--------+---------+--------+-----+
only showing top 5 rows


<bound method DataFrame.printSchema of DataFrame[Mileage : bigint, Gametime: double, Icecream: double, label: bigint]>

现在数据集已经符合我们的要求了,接下来就是建立模型了。在建立模型之前,我先对其进行标准化,然后用主成份分析(PCA)进行了降维,最后通过逻辑回归(logistic)模型进行分类和概率预测。具体实现代码如下:




from __future__ import print_function

# $example on$
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import StandardScaler


# 将类别2和类别1合并,即Helen对男生的印象是要么有魅力要么没有魅力。
# 之所以合并,是因为pyspark.ml.classification.LogisticRegression目前仅支持二分类
feature_data = data.map(lambda x:(Vectors.dense([x[i] for i in range(0,3)]),float(1 if x[3]==2 else x[3])))
feature_data = sqlContext.createDataFrame(feature_data, ['features', 'labels'])
#feature_data.show()

train_data, test_data = feature_data.randomSplit([0.7, 0.3], 6)
#train.show()

scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures',
                            withStd=True, withMean=False)
pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")
lr = LogisticRegression(maxIter=10, featuresCol='pcaFeatures', labelCol='labels')


pipeline = Pipeline(stages=[scaler, pca, lr])

Model = pipeline.fit(train_data)
results = Model.transform(test_data)

results.select('probability', 'prediction', 'prediction').show(truncate=False)


输出结果如下:

+----------------------------------------+----------+----------+
|probability                             |prediction|prediction|
+----------------------------------------+----------+----------+
|[0.22285193760551922,0.7771480623944808]|1.0       |1.0       |
|[0.19145196324973038,0.8085480367502696]|1.0       |1.0       |
|[0.25815968118089555,0.7418403188191045]|1.0       |1.0       |
|[0.1904557879847662,0.8095442120152337] |1.0       |1.0       |
|[0.23649048307318044,0.7635095169268196]|1.0       |1.0       |
|[0.19581773456064858,0.8041822654393515]|1.0       |1.0       |
|[0.17595295700627253,0.8240470429937274]|1.0       |1.0       |
|[0.2693008979176928,0.7306991020823073] |1.0       |1.0       |
|[0.19489995345665115,0.8051000465433488]|1.0       |1.0       |
|[0.2790706794240234,0.7209293205759766] |1.0       |1.0       |
|[0.2074274685125254,0.7925725314874746] |1.0       |1.0       |
|[0.2225838179162865,0.7774161820837134] |1.0       |1.0       |
|[0.23520083542636305,0.764799164573637] |1.0       |1.0       |
|[0.16390109775004727,0.8360989022499528]|1.0       |1.0       |
|[0.2032817412585787,0.7967182587414213] |1.0       |1.0       |
|[0.22397459472064782,0.7760254052793522]|1.0       |1.0       |
|[0.1987896145632484,0.8012103854367516] |1.0       |1.0       |
|[0.18503543175783838,0.8149645682421617]|1.0       |1.0       |
|[0.30849060803324585,0.6915093919667542]|1.0       |1.0       |
|[0.2472540013472057,0.7527459986527943] |1.0       |1.0       |
+----------------------------------------+----------+----------+
only showing top 20 rows



最后对模型进行简单的评估,具体代码如下:


from pyspark.mllib.evaluation import MulticlassMetrics
predictionAndLabels = results.select('probability', 'prediction', 'prediction').map(lambda x: (x[1], x[2]))
metrics = MulticlassMetrics(predictionAndLabels)
metrics.confusionMatrix().toArray()


输出结果:


array([[  40.,    0.],
       [   0.,  257.]])




  • 3
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值