基于pyspark的lightgbm使用教程
mmlspark官方文档
在安装过程会遇到spark版本、jar包版本等各种各样问题,以下是经过试验可成功应用示例。
本教程使用安装包下载
链接:https://pan.baidu.com/s/1Y_QlWn9gOZLSggNnVO8bGA
提取码:77o7
spark2.4版本
离线安装
- 下载jar包 ,下载lightgbmxxxx.jar和mmlspark.jar,上传至spark安装路径/jars
- 下载zip ,下载mmlspark.zip至anaconda安装路径/lib/python3.7/site-packages/后解压即可。
spark3.1版本
在线安装
- 直接在命令行
spark-submit --master yarn --deploy-mode cluster --packages com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT --repositories=https://mmlspark.azureedge.net/maven xxx.py
pyspark --packages com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT --repositories=https://mmlspark.azureedge.net/maven
以上直接在线下载所需jar包,下载完成后即可进行spark-lightgbm代码测试
离线安装
有些现网环境不可连接外网,需要离线安装,安装步骤如下:
下载jar包:在百度网盘中下载lgbm_jars.tar解压
- 解压后的jar包放至hdfs路径,本教程放至/tmp/lgm/jars下 ,其中jar包均为在线安装时下载的jar包,共计24个。执行命令如下:
spark-submit --master yarn --deploy-mode cluster --conf 'spark.repl.local.jars=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --conf 'spark.yarn.dist.jars=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --conf 'spark.yarn.dist.pyFiles=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --jars hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar --conf 'spark.submit.pyFiles=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' /root/lgbm_test.py
测试代码
import os
from pyspark.sql import SparkSession
from sparkxgb.xgboost import XGBoostClassifier
import pandas as pd
import numpy as np
data = np.random.rand(1000,2)
label = np.random.randint(0,1,1000)
df = pd.DataFrame(data,columns = ['a','b'])
df['label'] = label
df = pd.DataFrame([[1,2,1],[3,4,1],[2,4,1],[3,9,1],[2,2,0],[1,1,0],[1,-1,0]],columns = ['a','b','label'])
spark = SparkSession.builder.appName("dd").getOrCreate()
df = spark.createDataFrame(df)
from pyspark.ml.feature import VectorAssembler
df_ass = VectorAssembler(inputCols = ['a','b'], outputCol = 'features')
df = df_ass.transform(df)
model_df = df.select(['features','label'])
from mmlspark.lightgbm import LightGBMClassifier,LightGBMRegressor
model = LightGBMClassifier(objective="binary", featuresCol="features", labelCol="label")
lgm = model.fit(model_df)
p = lgm.transform(model_df)
lgm.write().overwrite().save('/lgm')
pltfeature_importances = lgm.getFeatureImportances()
print(pltfeature_importances)
from mmlspark.lightgbm import LightGBMClassificationModel
model1 = LightGBMClassificationModel().load('/lgm')
p = model1.transform(model_df)
p.show()