Python Spark MLlib之SVM支持向量机

最新推荐文章于 2022-03-11 07:27:00 发布

SanFanCSgo

最新推荐文章于 2022-03-11 07:27:00 发布

阅读量3k

点赞数 3

分类专栏： Spark Python 机器学习与大数据实践文章标签： Python Spark Spark Mllib SVM

本文链接：https://blog.csdn.net/weixin_40170902/article/details/82659383

版权

数据准备

和决策树分类一样，依然使用StumbleUpon Evergreen数据进行实验。

Local模式启动ipython notebook
cd ~/pythonwork/ipynotebook PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=local[*] pyspark
导入并转换数据

## 定义路径
global Path
if sc.master[:5]=="local":
    Path="file:/home/yyf/pythonwork/PythonProject/"
else:
    Path="hdfs://master:9000/user/yyf/"
## 读取train.tsv
print("开始导入数据...")
rawDataWithHeader = sc.textFile(Path+"data/train.tsv")
## 取第一项数据
header = rawDataWithHeader.first()
## 剔除字段名（特征名）行，取数据行
rawData = rawDataWithHeader.filter(lambda x:x!=header)
## 将双引号"替换为空字符（剔除双引号）
rData = rawData.map(lambda x:x.replace("\"",""))
## 以制表符分割每一行
lines = rData.map(lambda x: x.split("\t"))
print("共有："+str(lines.count())+"项数据")

数据预处理

1、处理特征

该数据集tran.tsv和test.tsv的第3个字段是alchemy_category网页分类，是一个离散值特征，要采用OneHotEncode的方式进行编码转换为数值特征，主要过程如下：

(1) 创建categoriesMap字典，key为网页类别名，value为数字（网页类别名的索引值），每个类别名对应一个索引值
(2) 根据categoriesMap字典查询每个alchemy_category特征值对应的索引值，例如business的索引值categoryIdx为2
(3) 根据categoryIdx=2，以OneHotEncodeer的方式转换为一个列表categoryFeatures List，该列表长度为14（统计所有网页类别），categoryIdx=2对应的列表为[0,0,1,0,0,0,0,0,0,0,0,0,0,0]。

建立categoriesMap网页分类字典

categoriesMap = lines.map(lambda fields: fileds[3]).distinct().zipWithIndex().collectAsMap()

其中，lines.map()表示处理之前读取的数据的每一行，.map(lambda fields: fileds[3])表示读取第3个字段，.distinct()保留不重复数据，.zipWithIndex()将第3个字段中不重复的数据进行编号，.collectAsMap()转换为dict字典格式

将每个alchemy_category网页分类特征值转化为列表categoryFeatures List

## 给定一个alchemy_category网页分类特征转化为OneHot 列表
## 查询对应索引值
import numpy as np
categoryIdx = categoriesMap[lines.first()[3]]
OneHot = np.zeros(len(categoriesMap))
OneHot[categoryIdx] = 1
print(OneHot)

对于第4~25字段的数值特征，要转换为数值，用float函数将字符串转换为数值，同时简单处理缺失值”?”为0.

整个处理特征的过程可以封装成一个函数：

import numpy as np

def convert(v):
    """处理数值特征的转换函数"""
    return (0 if v=="?" else float(v))

def process_features(line, categoriesMap, featureEnd):
    """处理特征，line为字段行，categoriesMap为网页分类字典，featureEnd为特征结束位置，此例为25"""
    ## 处理alchemy_category网页分类特征
    categoryIdx = categoriesMap[line[3]]
    OneHot = np.zeros(len(categoriesMap))
    OneHot[categoryIdx] = 1
    ## 处理数值特征
    numericalFeatures = [convert(value) for value in line[4:featureEnd]]
    # 返回拼接的总特征列表
    return np.concatenate((OneHot, numericalFeatures))


## 处理特征生成featureRDD
featureRDD = lines.map(lambda r: process_features(r, categoriesMap, len(r)-1))

2、数据标准化

对数值特征进行标准化：

## 数据标准化
from pyspark.mllib.feature import StandardScaler   # 导入数据标准化模块

## 对featureRDD进行标准化
stdScaler = StandardScaler(withMean=True, withStd=True).fit(featureRDD)  # 创建一个标准化实例
ScalerFeatureRDD = stdScaler.transform(featureRDD)
ScalerFeatureRDD.first()

查看标准化之后的数值特征：
这里写图片描述

3、处理label构成labelpoint数据格式

处理标签数据（test.tsv最后一列），只需把字符串类型转化为float型：

## 处理标签
def

最低0.47元/天解锁文章

SanFanCSgo

关注

3
点赞
踩
19

收藏

觉得还不错? 一键收藏
1
评论
Python Spark MLlib之SVM支持向量机

数据准备和决策树分类一样，依然使用StumbleUpon Evergreen数据进行实验。Local模式启动ipython notebook cd ~/pythonwork/ipynotebook PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=&amp;amp;quot;notebook&amp;amp;quot; MASTER=local[*] pyspark...
复制链接

扫一扫