spark ml中向量化后还原python方法

我们在应用spark ml中的一些算法API进行机器学习建模时,常常会遇到一个问题,即经向量化操作后,无法再还原为原来单列的数据形式,为此开发了一个python方法,该方法可与spark原生ml中API进行pipeline组装应用,以模型形式保存加载,具体实现过程见示例代码:

from pyspark.sql.types import DoubleType
from pyspark import keyword_only
from pyspark.ml.param.shared import HasOutputCols, Param, Params, HasInputCol
from pyspark.ml import Pipeline, PipelineModel
from sparktorch.pipeline_util import PysparkReaderWriter
from pyspark.ml import Model
from sparktorch import PysparkPipelineWrapper
from pyspark.ml.regression import LinearRegression, LinearRegressionModel
from pyspark.ml.util import Identifiable, MLReadable, MLWritable
from pyspark.ml.param import TypeConverters

spark = SparkSession.builder \
        .enableHiveSupport() \
        .getOrCreate()
df = spark.read.table('hive_table_name')

class SplitCol(Model, HasInputCol, HasOutputCols, PysparkReaderWriter, MLReadable, MLWritable, Identifiable):
    kepCol = Param(Params._dummy(), "kepCol", "", typeConverter=TypeConverters.toBoolean)
    @keyword_only
    def __init__(
        self,
        inputCol=None,
        outputCols=None,
        kepCol=False
    ):
        super().__init__()
        self._setDefault(
            inputCol=None,
            outputCols=None,
            kepCol=False
        )
        kwargs = self._input_kwargs
        self.setParams(**kwargs)
    
  @keyword_only
    def setParams(
        self,
        inputCol=None,
        outputCols=None,
        kepCol=False
    ):
        kwargs = self._input_kwargs
        return self._set(**kwargs)
    def _transform(self, df):
        #inp = self.getOrDefault()
        out = self.getOutputCols()
        inp = self.getInputCol()
        kepCol = self.getOrDefault(self.kepCol)
        new_features = []
        outCols = out
        inCol = inp
        if not kepCol:
            for col in outCols:
                df: DataFrame = df.drop(col)
            new_features = outCols
        else:
            for col in outCols:
                new_features.append('scaled_' + col)
        schema = df.schema
        cols = df.columns
        for col in new_features:
            schema = schema.add(col, DoubleType(), True)
        df: DataFrame = spark.createDataFrame(
            df.rdd.map(lambda row: [row[i] for i in cols] + row.scaled_feature.tolist()), schema)
        df: DataFrame = df.drop(inCol)
        df: DataFrame = df.drop('assemble_feature')
        return df
 feature_list = ['col1', 'col2', 'col3']
vector_assembler: VectorAssembler = VectorAssembler(inputCols=feature_list, outputCol='assemble_feature')
scaler: MinMaxScaler = MinMaxScaler(min=0, max=1, inputCol='assemble_feature',
                                    outputCol='scaled_feature')
split_model = SplitCol(
            inputCol='scaled_feature',
            outputCols=feature_list,
            kepCol=False)
vectored: VectorAssembler = VectorAssembler(inputCols=feature_list, outputCol='assemble_feature')
lr = LinearRegression(featuresCol='assemble_feature', labelCol='medv', predictionCol="prediction",
                      maxIter=2, regParam=0)
p = Pipeline(stages=[vector_assembler, scaler, split_model, vectored, lr]).fit(df)
p.write().overwrite().save('hdfs_path')
pm1 = PysparkPipelineWrapper.unwrap(PipelineModel.load('hdfs_path'))
data = pm1.transform(df)
data.show()
print('***************************************')


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

山河念远之追寻

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值