pyspark运行生存模型

参考:https://blog.csdn.net/luoganttcc/article/details/80618940
pyspark.ml.regression官方文档:
http://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/regression.html

Spark ML 之 RDD to DataFrame (python版):
https://blog.csdn.net/chenguangchun1993/article/details/78810955

机器学习工具:https://www.jianshu.com/p/b81680a52cd7

from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors

from pyspark.sql import SparkSession

spark= SparkSession\
                .builder \
                .appName("dataFrame") \
                .getOrCreate()

training = spark.createDataFrame([
    (1.218, 1.0, Vectors.dense(1.560, -0.605)),
    (2.949, 0.0, Vectors.dense(0.346, 2.158)),
    (3.627, 0.0, Vectors.dense(1.380, 0.231)),
    (0.273, 1.0, Vectors.dense(0.520, 1.151)),
    (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.3, 0.6]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
                            quantilesCol="quantiles")

model = aft.fit(training)

# Print the coefficients, intercept and scale parameter for AFT survival regression
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
print("Scale: " + str(model.scale))
model.transform(training).show(truncate=False)12345678910111213141516171819202122232425262728

Coefficients: [-0.4963044110531165,0.19845217252922842]
Intercept: 2.638089896305634
Scale: 1.5472363533632303
+-----+------+--------------+-----------------+---------------------------------------+
|label|censor|features      |prediction       |quantiles                              |
+-----+------+--------------+-----------------+---------------------------------------+
|1.218|1.0   |[1.56,-0.605] |5.718985621018952|[1.1603229908059516,4.995460583406753] |
|2.949|0.0   |[0.346,2.158] |18.07678210850554|[3.6675919944963185,15.789837303662035]|
|3.627|0.0   |[1.38,0.231]  |7.381908879359964|[1.4977129086101577,6.448002719505493] |
|0.273|1.0   |[0.52,1.151]  |13.57771781488451|[2.754778414791513,11.859962351993202] |
|4.199|0.0   |[0.795,-0.226]|9.013087597344812|[1.828662187733188,7.8728164067854856] |
+-----+------+--------------+-----------------+-------------------------------------

---------------------

import xgboost as xgb
import pandas as pd
import numpy as np

加载模型

bst = xgb.Booster()
bst.load_model(“xxx.model”)

变量列表

var_list=[…]
df.rdd.map(lambda x : cal_xgb_score(x,var_list,ntree_limit=304)).write.toDF()

计算分数

def cal_xgb_score(x,var_list,ntree_limit=50):
feature_count = len(var_list)
x1 = pd.DataFrame(np.array(x).reshape(1,feature_count),columns=var_list)
# 数据变化操作
y1 = transformFun(x1)

test_x = xgb.DMatrix(y1.drop(['mobile','mobile_md5'],xais=1),missing=float('nan'))
y1['score'] = bst.predict(test_x,ntree_limit=ntree_limit)
y2 = y1[['mobile','mobile_md5','score']]
return {'mobile': str(y2['mobile'][0]),'mobille_md5':str(y2['mobile_md5'][0]),'score':float(y2['score'][0])}

在spark dataFrame 中使用 pandas dataframe:https://blog.csdn.net/lsshlsw/article/details/79814645?utm_source=copy

import xgboost as xgb
import pandas as pd
import numpy as np

# 加载模型
bst = xgb.Booster()
bst.load_model("xxx.model")

# 变量列表
var_list=[...]
df.rdd.map(lambda x : cal_xgb_score(x,var_list,ntree_limit=304)).write.toDF()

# 计算分数
def cal_xgb_score(x,var_list,ntree_limit=50):
    feature_count = len(var_list)
    x1 = pd.DataFrame(np.array(x).reshape(1,feature_count),columns=var_list)
    # 数据变化操作
    y1 = transformFun(x1)
    
    test_x = xgb.DMatrix(y1.drop(['mobile','mobile_md5'],xais=1),missing=float('nan'))
    y1['score'] = bst.predict(test_x,ntree_limit=ntree_limit)
    y2 = y1[['mobile','mobile_md5','score']]
    return {'mobile': str(y2['mobile'][0]),'mobille_md5':str(y2['mobile_md5'][0]),'score':float(y2['score'][0])}

---------------------

本文来自 breeze_lsw 的CSDN 博客 ,全文地址请点击:https://blog.csdn.net/lsshlsw/article/details/79814645?utm_source=copy 

使用Spark的DataFrame,而不要使用Pandas的DataFrame
PySpark本身就具有类似pandas.DataFrame的DataFrame,所以直接使用PySpark的DataFrame即可,基于PySpark的DataFrame的操作都是分布式执行的,而pandas.DataFrame是单机执行的,
https://tio.cloud.tencent.com/gitbook/doc/pyspark.html

# record -> other record
def process_fn(record):
  # your process logic
  # for example
  # import numpy as np
  # x = np.array(record, type=np.int32)
  # ...

# record -> True or Flase
def judge_fn(record):
  # return True or Flase

processed = rdd.map(process_fn).map(lambda x: x[1:3])
filtered = processed.filter(judge_fn)

用 Apache Spark 做大数据处理 - 第五部分:Spark 机器学习数据流水线
https://juejin.im/entry/589d688e0ce46300562c635f

在我看来,PyODPS就是阿里云上的Python。值得注意的是,这里的定语“阿里云上的”一定不能精简掉,因为PyODPS不等于单机版的Python!
https://yq.aliyun.com/articles/292672

****pyspark文档中文:****http://ifeve.com/spark-sql-dataframes/

Spark DataFrame ETL教程:
https://hk.saowen.com/a/37d6243147b39b36a6ac61d55b743e8b25ca05667e3a32a2bd34896e17f48b66

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值