基于pyspark的lightgbm使用教程

基于pyspark的lightgbm使用教程

mmlspark官方文档
在安装过程会遇到spark版本、jar包版本等各种各样问题,以下是经过试验可成功应用示例。
本教程使用安装包下载
链接:https://pan.baidu.com/s/1Y_QlWn9gOZLSggNnVO8bGA
提取码:77o7

spark2.4版本

离线安装

  1. 下载jar包 ,下载lightgbmxxxx.jar和mmlspark.jar,上传至spark安装路径/jars
  2. 下载zip ,下载mmlspark.zip至anaconda安装路径/lib/python3.7/site-packages/后解压即可。

spark3.1版本

在线安装

  1. 直接在命令行
spark-submit --master yarn --deploy-mode cluster --packages com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT --repositories=https://mmlspark.azureedge.net/maven   xxx.py
pyspark --packages com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-49-659b7743-SNAPSHOT --repositories=https://mmlspark.azureedge.net/maven

以上直接在线下载所需jar包,下载完成后即可进行spark-lightgbm代码测试

离线安装

有些现网环境不可连接外网,需要离线安装,安装步骤如下:
下载jar包:在百度网盘中下载lgbm_jars.tar解压

  1. 解压后的jar包放至hdfs路径,本教程放至/tmp/lgm/jars下 ,其中jar包均为在线安装时下载的jar包,共计24个。执行命令如下:

spark-submit --master yarn --deploy-mode cluster --conf 'spark.repl.local.jars=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --conf 'spark.yarn.dist.jars=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --conf 'spark.yarn.dist.pyFiles=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' --jars hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar --conf 'spark.submit.pyFiles=hdfs:///tmp/lgm/jars/com.microsoft.ml.spark_mmlspark_2.12-1.0.0-rc3-49-659b7743-SNAPSHOT.jar,hdfs:///tmp/lgm/jars/org.scalactic_scalactic_2.12-3.0.5.jar,hdfs:///tmp/lgm/jars/io.spray_spray-json_2.12-1.3.2.jar,hdfs:///tmp/lgm/jars/com.microsoft.cntk_cntk-2.4.jar,hdfs:///tmp/lgm/jars/org.openpnp_opencv-3.2.0-1.jar,hdfs:///tmp/lgm/jars/com.jcraft_jsch-0.1.54.jar,hdfs:///tmp/lgm/jars/com.microsoft.cognitiveservices.speech_client-sdk-1.14.0.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpclient-4.5.6.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpmime-4.5.6.jar,hdfs:///tmp/lgm/jars/com.microsoft.ml.lightgbm_lightgbmlib-3.2.100.jar,hdfs:///tmp/lgm/jars/com.github.vowpalwabbit_vw-jni-8.9.1.jar,hdfs:///tmp/lgm/jars/com.linkedin.isolation-forest_isolation-forest_3.0.0_2.12-1.0.1.jar,hdfs:///tmp/lgm/jars/org.scala-lang_scala-reflect-2.12.4.jar,hdfs:///tmp/lgm/jars/org.apache.httpcomponents_httpcore-4.4.10.jar,hdfs:///tmp/lgm/jars/commons-logging_commons-logging-1.2.jar,hdfs:///tmp/lgm/jars/commons-codec_commons-codec-1.10.jar,hdfs:///tmp/lgm/jars/com.chuusai_shapeless_2.12-2.3.2.jar,hdfs:///tmp/lgm/jars/org.apache.spark_spark-avro_2.12-3.0.0.jar,hdfs:///tmp/lgm/jars/org.testng_testng-6.8.8.jar,hdfs:///tmp/lgm/jars/org.typelevel_macro-compat_2.12-1.1.1.jar,hdfs:///tmp/lgm/jars/org.spark-project.spark_unused-1.0.0.jar,hdfs:///tmp/lgm/jars/org.beanshell_bsh-2.0b4.jar,hdfs:///tmp/lgm/jars/com.beust_jcommander-1.27.jar' /root/lgbm_test.py

测试代码

import os
from pyspark.sql import SparkSession
from sparkxgb.xgboost import XGBoostClassifier
import pandas as pd
import numpy as np
data = np.random.rand(1000,2)
label = np.random.randint(0,1,1000)
df = pd.DataFrame(data,columns = ['a','b'])
df['label'] = label
df = pd.DataFrame([[1,2,1],[3,4,1],[2,4,1],[3,9,1],[2,2,0],[1,1,0],[1,-1,0]],columns = ['a','b','label'])
spark = SparkSession.builder.appName("dd").getOrCreate()
df = spark.createDataFrame(df)
from pyspark.ml.feature import VectorAssembler
df_ass = VectorAssembler(inputCols = ['a','b'], outputCol = 'features')
df = df_ass.transform(df)
model_df = df.select(['features','label'])



from mmlspark.lightgbm import LightGBMClassifier,LightGBMRegressor
model = LightGBMClassifier(objective="binary", featuresCol="features", labelCol="label")
lgm = model.fit(model_df)
p = lgm.transform(model_df)
lgm.write().overwrite().save('/lgm')
pltfeature_importances = lgm.getFeatureImportances()
print(pltfeature_importances)
from mmlspark.lightgbm import LightGBMClassificationModel
model1 = LightGBMClassificationModel().load('/lgm')
p = model1.transform(model_df)
p.show()
  • 7
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
Pyspark是一个运行在Python中的Spark API,可以用于处理大规模数据集。而Hive是一个基于Hadoop的数据仓库工具,可以进行数据存储、管理和查询。结合起来,Pyspark和Hive可以提供强大的数据处理和分析能力。下面是Pyspark和Hive的使用教程: 1. 安装Pyspark:首先需要安装合适版本的SparkPython环境。然后下载并解压Pyspark,并将其添加到Python的环境变量中。 2. 导入pyspark模块:打开Python解释器或者PyCharm等开发环境,在代码中导入pyspark模块。 3. 创建SparkSession:使用以下代码创建一个SparkSession对象,用于连接Spark集群。 ```python from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("HiveExample") \ .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \ .enableHiveSupport() \ .getOrCreate() ``` 4. 使用Hive表:可以使用Pyspark来操作Hive表。通过spark.sql方法,可以像在Hive中一样执行SQL查询和操作Hive表。 ```python # 创建一个Hive表 spark.sql("CREATE TABLE IF NOT EXISTS my_table (id INT, name STRING)") # 插入数据 spark.sql("INSERT INTO my_table VALUES (1, 'John')") spark.sql("INSERT INTO my_table VALUES (2, 'Amy')") # 查询数据 result = spark.sql("SELECT * FROM my_table") result.show() ``` 5. 执行分析任务:Pyspark和Hive的结合可以用于进行大规模数据的分析任务。利用Pyspark的API,可以实现各种数据处理、转换和分析操作。 ```python # 筛选数据 filtered = spark.sql("SELECT * FROM my_table WHERE id > 1") # 聚合数据 aggregated = spark.sql("SELECT name, COUNT(*) AS count FROM my_table GROUP BY name") # 排序数据 sorted_result = aggregated.orderBy("count", ascending=False) sorted_result.show() ``` 通过上述教程,你可以开始使用Pyspark和Hive进行大规模数据处理和分析任务。掌握了这些基础知识后,你还可以深入学习更多高级的操作和技巧,以提高数据处理和分析的效率和质量。
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值