基于pyspark的lightgbm使用教程

最新推荐文章于 2024-03-08 16:54:12 发布

Zsomnus_

最新推荐文章于 2024-03-08 16:54:12 发布

阅读量750

点赞数 7

分类专栏： pyspark 文章标签： spark python

本文链接：https://blog.csdn.net/wo1234234/article/details/135148081

版权

pyspark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

基于pyspark的lightgbm使用教程

mmlspark官方文档
在安装过程会遇到spark版本、jar包版本等各种各样问题，以下是经过试验可成功应用示例。
本教程使用安装包下载
链接：https://pan.baidu.com/s/1Y_QlWn9gOZLSggNnVO8bGA
提取码：77o7

spark2.4版本

离线安装

下载jar包 ，下载lightgbmxxxx.jar和mmlspark.jar，上传至spark安装路径/jars
下载zip ，下载mmlspark.zip至anaconda安装路径/lib/python3.7/site-packages/后解压即可。

spark3.1版本

在线安装

直接在命令行

spark-submit --master yarn --deploy-mode cluster --packages com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT --repositories=https://mmlspark.azureedge.net/maven   xxx.py
pyspark --packages com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT --repositories=https://mmlspark.azureedge.net/maven

以上直接在线下载所需jar包，下载完成后即可进行spark-lightgbm代码测试

离线安装

有些现网环境不可连接外网，需要离线安装，安装步骤如下：
下载jar包：在百度网盘中下载lgbm_jars.tar解压

解压后的jar包放至hdfs路径，本教程放至/tmp/lgm/jars下 ，其中jar包均为在线安装时下载的jar包，共计24个。执行命令如下：

spark-submit --master yarn --deploy-mode cluster --conf 'spark.repl.local.jars=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --conf 'spark.yarn.dist.jars=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --conf 'spark.yarn.dist.pyFiles=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --jars hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar --conf 'spark.submit.pyFiles=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' /root/lgbm_test.py

测试代码

import os
from pyspark.sql import SparkSession
from sparkxgb.xgboost import XGBoostClassifier
import pandas as pd
import numpy as np
data = np.random.rand(1000,2)
label = np.random.randint(0,1,1000)
df = pd.DataFrame(data,columns = ['a','b'])
df['label'] = label
df = pd.DataFrame([[1,2,1],[3,4,1],[2,4,1],[3,9,1],[2,2,0],[1,1,0],[1,-1,0]],columns = ['a','b','label'])
spark = SparkSession.builder.appName("dd").getOrCreate()
df = spark.createDataFrame(df)
from pyspark.ml.feature import VectorAssembler
df_ass = VectorAssembler(inputCols = ['a','b'], outputCol = 'features')
df = df_ass.transform(df)
model_df = df.select(['features','label'])



from mmlspark.lightgbm import LightGBMClassifier,LightGBMRegressor
model = LightGBMClassifier(objective="binary", featuresCol="features", labelCol="label")
lgm = model.fit(model_df)
p = lgm.transform(model_df)
lgm.write().overwrite().save('/lgm')
pltfeature_importances = lgm.getFeatureImportances()
print(pltfeature_importances)
from mmlspark.lightgbm import LightGBMClassificationModel
model1 = LightGBMClassificationModel().load('/lgm')
p = model1.transform(model_df)
p.show()

Zsomnus_

关注

7
点赞
踩
8

收藏

觉得还不错? 一键收藏
3
评论
基于pyspark的lightgbm使用教程

在安装过程会遇到spark版本、jar包版本等各种各样问题，以下是经过试验可成功应用示例。链接：https://pan.baidu.com/s/1Y_QlWn9gOZLSggNnVO8bGA提取码：77o7。
复制链接

扫一扫

专栏目录