pyspark 数据类型转换_PySpark学习笔记 - 模型建立流程

最新推荐文章于 2023-07-28 20:38:28 发布

weixin_39612653

最新推荐文章于 2023-07-28 20:38:28 发布

阅读量232

点赞数

文章标签： pyspark 数据类型转换

本文链接：https://blog.csdn.net/weixin_39612653/article/details/111621733

版权

本文是关于使用PySpark进行数据类型转换和模型建立的学习笔记，重点介绍了如何将数据转换为numeric类型，应用one-hot encoding处理类别变量，以及创建pipeline进行数据预处理。同时，强调了在分割train和test数据前需完成数据处理，以避免StringIndexer产生的不同index。最后，文章概述了选择模型、设定评估指标、进行超参数调优和交叉验证的过程。

摘要由CSDN通过智能技术生成

最近工作中用到pyspark, 在家自学整理了笔记

觉得有用的话，点赞支持一下谢谢~

数据准备

spark建模中需要的数据需要是numeric类型

（1）普通的类型转换

# convert to numeric type
data.withColumn("oldCol",data.oldCol.cast("integer"))

（2）类别变量处理 - onehot encoding

# create StringIndexer
A_indexer = StringIndexer(inputCol = "A", outputCol = "A_index")
A_encoder = OneHotEncoder(inputCol = "A_index", outputCol = "A_fact")

（3）将所有的列assemble

VectorAssembler(inputCols = ["a

最低0.47元/天解锁文章

weixin_39612653

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pyspark 数据类型转换_PySpark学习笔记 - 模型建立流程

最近工作中用到pyspark, 在家自学整理了笔记觉得有用的话，点赞支持一下谢谢~数据准备spark建模中需要的数据需要是numeric类型（1）普通的类型转换# convert to numeric typedata.withColumn("oldCol",data.oldCol.cast("integer"))（2）类别变量处理 - onehot encoding# create Strin...
复制链接

扫一扫