在pyspark上使用xgboost

7 篇文章 1 订阅
4 篇文章 0 订阅

xgb是机器学习业界常用模型,在spark上不像RF等有现成的build in model,所以需要自己弄一下,不过也不是很难。

1. 预备工作

首先需要下两个jar文件,xgboost4j-spark-0.72.jar 和xgboost4j-0.72.jar,链接如下。之后要下载一个sparkxgb.zip,里面包括了pyspark代码去call jar文件以及set up一些参数。

xgboost4j

xgboost4j-spark

XGBoost python wrapper

 我们以bank.csv为例来讲如何在spark上运行xgboost

dataset: https://www.kaggle.com/janiobachmann/bank-marketing-dataset

import package

import numpy as np
import pandas as pd
import os
import re
from sklearn import metrics
import matplotlib.pyplot as plt
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars xgboost4j-spark-0.72.jar,xgboost4j-0.72.jar pyspark-shell'
import findspark
findspark.init()

import pyspark
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark = SparkSession\
        .builder\
        .appName("PySpark XGBOOST")\
        .master("local[*]")\
        .getOrCreate()

from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
spark.sparkContext.addPyFile("sparkxgb.zip")
from sparkxgb import XGBoostEstimator
import pyspark.sql.functions as F
import pyspark.sql.types as T

Load 文件(这部分我的是用的s3改过来没有test,所以可能你在自己电脑上弄得话你用local的需要自己调整下)

df_all = spark\
  .read\
  .option("header", "true")\
  .csv("bank.csv")

因为spark不接受column name是带.的,所以这里把column names都修正一下以防报错。

tran_tab = str.maketrans({x:None for x in list('{()}')})
df_all = df_all.toDF(*(re.sub(r'[\.\s]+', '_', c).translate(tran_tab) for c in df_all.columns))

# fill na
df_all = df_all.na.fill(0)

2.data processing

在pyspark train model 时,都需要构建pipeline,pipeline里定义stage来指定操作顺序。

对于一些categorial的变量,我们需要进行一些stringindex的转换,也可以在运用OneHotEncoder再次转换,对于numerical的变量就可以直接定义stage。

转换string 变量:

unused_col = ['day','month']
df_all = df_all.select([col for col in df_all.columns if col not in unused_col])
numeric_features = [t[0] for t in df_all.dtypes if t[1] == 'int']
cols = df_all.columns


string_col = [t[0] for t in df_all.dtypes if t[1] != 'int']
string_col = [x for x in string_col if x!='deposit']

for S in string_col:
    globals()[str(S)+'Indexer'] = StringIndexer()\
                                  .setInputCol(S)\
                                  .setOutputCol(str(S)+'Index')\
                                  .setHandleInvalid("keep")
    globals()[str(S)+'classVecIndexer'] = OneHotEncoderEstimator(inputCols=[globals()[str(S)+'Indexer'].getOutputCol()], outputCols=[str(S)+ "classVec"]) 


# zip to one 'feature' columns
feature_col = [s+'Index' for s in string_col]
feature_col.extend([str(s)+ "classVec"  for s in string_col])
feature_col.extend(numeric_features)

vectorAssembler = VectorAssembler()\
  .setInputCols(feature_col)\
  .setOutputCol("features")
  
# index label columns
label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol = 'label')


# define xgboost
xgboost = XGBoostEstimator(
    featuresCol="features", 
    labelCol="label", 
    predictionCol="prediction"
)

3.定义pipeline并把之前的操作都加入stages

feat_stage = [globals()[str(S)+'Indexer'] for S in string_col]
feat_stage.extend([globals()[str(s)+ "classVecIndexer"]  for s in string_col])
feat_stage.extend([vectorAssembler,label_stringIdx,xgboost])
xgb_pipeline = Pipeline().setStages(feat_stage)

# split train & test
trainDF, testDF = df_all.randomSplit([0.8, 0.2], seed=24)

看看pipeline和stage

4.model training和test

# train model
model = xgb_pipeline.fit(trainDF)
# predict 
pre   = model.transform(testDF)\
        .select(col("label"),col('probabilities'),col("prediction"))



# to pandas df 
cx = pre.toPandas()
cx["probabilities"] =   cx["probabilities"].apply(lambda x: x.values)
cx[['prob_0','prob_1']] = pd.DataFrame(cx.probabilities.tolist(), index= cx.index)
cx  = cx[["label",'prob_1']].sort_values(by = ['prob_1'],ascending = False)

查看结果

5.evaluate results

#evaluate
metrics.roc_auc_score(cx.label, cx.prob_1)

# plot ROC curve
y_pred_proba =cx.prob_1
fpr, tpr, _ = metrics.roc_curve(cx.label,  y_pred_proba)
auc = metrics.roc_auc_score(cx.label, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

auc: 0.8788617014295159

roc curve:

没怎么调参,大家可以再自己调下参。方法是去到sparkxgb.zip的那个xgboost.py里面调max_depth之类的就可以

  • 6
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 15
    评论
以下是使用pyspark调用xgboost的步骤: 1.下载xgboost4j-spark-0.72.jar和xgboost4j-0.72.jar两个jar文件,并将它们放在一个文件夹中。 2.下载sparkxgb.zip文件,并解压缩到一个文件夹中。 3.在你的pyspark代码中,导入xgboost4j-spark-0.72.jar和xgboost4j-0.72.jar两个jar文件: ```python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("xgboost").getOrCreate() spark.sparkContext.addPyFile("path/to/sparkxgb.zip") spark.sparkContext.addPyFile("path/to/xgboost4j-spark-0.72.jar") spark.sparkContext.addPyFile("path/to/xgboost4j-0.72.jar") ``` 4.在你的pyspark代码中,导入xgboost: ```python from sparkxgb import XGBoostEstimator ``` 5.将你的数据转换为Spark DataFrame,并将其拆分为训练集和测试集: ```python data = spark.read.format("libsvm").load("path/to/data") (trainingData, testData) = data.randomSplit([0.7, 0.3]) ``` 6.设置xgboost的参数: ```python params = {"eta": 0.1, "max_depth": 6, "objective": "binary:logistic", "num_class": 2} ``` 7.创建XGBoostEstimator对象,并将参数传递给它: ```python xgboost = XGBoostEstimator(**params) ``` 8.使用fit()方法拟合模型: ```python model = xgboost.fit(trainingData) ``` 9.使用transform()方法对测试集进行预测: ```python predictions = model.transform(testData) ``` 10.评估模型的性能: ```python from pyspark.ml.evaluation import BinaryClassificationEvaluator evaluator = BinaryClassificationEvaluator() print("Test Area Under ROC: " + str(evaluator.evaluate(predictions))) ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 15
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值