参考:https://blog.csdn.net/luoganttcc/article/details/80618940
pyspark.ml.regression官方文档:
http://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/regression.html
Spark ML 之 RDD to DataFrame (python版):
https://blog.csdn.net/chenguangchun1993/article/details/78810955
机器学习工具:https://www.jianshu.com/p/b81680a52cd7
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
spark= SparkSession\
.builder \
.appName("dataFrame") \
.getOrCreate()
training = spark.createDataFrame([
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.3, 0.6]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
quantilesCol="quantiles")
model = aft.fit(training)
# Print the coefficients, intercept and scale parameter for AFT survival regression
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
print("Scale: " + str(model.scale))
model.transform(training).show(truncate=False)12345678910111213141516171819202122232425262728
Coefficients: [-0.4963044110531165,0.19845217252922842]
Intercept: 2.638089896305634
Scale: 1.5472363533632303
+-----+------+--------------+-----------------+---------------------------------------+
|label|censor|features |prediction |quantiles |
+-----+------+--------------+-----------------+---------------------------------------+
|1.218|1.0 |[1.56,-0.605] |5.718985621018952|[1.1603229908059516,4.995460583406753] |
|2.949|0.0 |[0.346,2.158] |18.07678210850554|[3.6675919944963185,15.789837303662035]|
|3.627|0.0 |[1.38,0.231] |7.381908879359964|[1.4977129086101577,6.448002719505493] |
|0.273|1.0 |[0.52,1.151] |13.57771781488451|[2.754778414791513,11.859962351993202] |
|4.199|0.0 |[0.795,-0.226]|9.013087597344812|[1.828662187733188,7.8728164067854856] |
+-----+------+--------------+-----------------+-------------------------------------
---------------------
import xgboost as xgb
import pandas as pd
import numpy as np
加载模型
bst = xgb.Booster()
bst.load_model(“xxx.model”)
变量列表
var_list=[…]
df.rdd.map(lambda x : cal_xgb_score(x,var_list,ntree_limit=304)).write.toDF()
计算分数
def cal_xgb_score(x,var_list,ntree_limit=50):
feature_count = len(var_list)
x1 = pd.DataFrame(np.array(x).reshape(1,feature_count),columns=var_list)
# 数据变化操作
y1 = transformFun(x1)
test_x = xgb.DMatrix(y1.drop(['mobile','mobile_md5'],xais=1),missing=float('nan'))
y1['score'] = bst.predict(test_x,ntree_limit=ntree_limit)
y2 = y1[['mobile','mobile_md5','score']]
return {'mobile': str(y2['mobile'][0]),'mobille_md5':str(y2['mobile_md5'][0]),'score':float(y2['score'][0])}
在spark dataFrame 中使用 pandas dataframe:https://blog.csdn.net/lsshlsw/article/details/79814645?utm_source=copy
import xgboost as xgb
import pandas as pd
import numpy as np
# 加载模型
bst = xgb.Booster()
bst.load_model("xxx.model")
# 变量列表
var_list=[...]
df.rdd.map(lambda x : cal_xgb_score(x,var_list,ntree_limit=304)).write.toDF()
# 计算分数
def cal_xgb_score(x,var_list,ntree_limit=50):
feature_count = len(var_list)
x1 = pd.DataFrame(np.array(x).reshape(1,feature_count),columns=var_list)
# 数据变化操作
y1 = transformFun(x1)
test_x = xgb.DMatrix(y1.drop(['mobile','mobile_md5'],xais=1),missing=float('nan'))
y1['score'] = bst.predict(test_x,ntree_limit=ntree_limit)
y2 = y1[['mobile','mobile_md5','score']]
return {'mobile': str(y2['mobile'][0]),'mobille_md5':str(y2['mobile_md5'][0]),'score':float(y2['score'][0])}
---------------------
本文来自 breeze_lsw 的CSDN 博客 ,全文地址请点击:https://blog.csdn.net/lsshlsw/article/details/79814645?utm_source=copy
使用Spark的DataFrame,而不要使用Pandas的DataFrame
PySpark本身就具有类似pandas.DataFrame的DataFrame,所以直接使用PySpark的DataFrame即可,基于PySpark的DataFrame的操作都是分布式执行的,而pandas.DataFrame是单机执行的,
https://tio.cloud.tencent.com/gitbook/doc/pyspark.html
# record -> other record
def process_fn(record):
# your process logic
# for example
# import numpy as np
# x = np.array(record, type=np.int32)
# ...
# record -> True or Flase
def judge_fn(record):
# return True or Flase
processed = rdd.map(process_fn).map(lambda x: x[1:3])
filtered = processed.filter(judge_fn)
用 Apache Spark 做大数据处理 - 第五部分:Spark 机器学习数据流水线:
https://juejin.im/entry/589d688e0ce46300562c635f
在我看来,PyODPS就是阿里云上的Python。值得注意的是,这里的定语“阿里云上的”一定不能精简掉,因为PyODPS不等于单机版的Python!:
https://yq.aliyun.com/articles/292672
****pyspark文档中文:****http://ifeve.com/spark-sql-dataframes/
Spark DataFrame ETL教程:
https://hk.saowen.com/a/37d6243147b39b36a6ac61d55b743e8b25ca05667e3a32a2bd34896e17f48b66