如何在Spark Scala/Java应用中调用Python脚本

最新推荐文章于 2022-10-23 16:16:53 发布

华为云技术精粹

最新推荐文章于 2022-10-23 16:16:53 发布

阅读量2.1k

点赞数

文章标签：云计算华为云

本文链接：https://blog.csdn.net/HWCloudDeveloper/article/details/123631732

版权

本文详细介绍了在Spark Scala程序中如何利用PythonRunner调用Python脚本，同样适用于Java程序。通过Py4J实现JVM与Python的通信，详细阐述了调用方法、参数设置以及在不同环境下的执行策略，包括设置pythonExec、环境变量和上传python文件。文中还给出了具体的运行命令示例，并提示了如何在集群模式下指定Python执行环境。

摘要由CSDN通过智能技术生成

本文将介绍如何在 Spark scala 程序中调用 Python 脚本，Spark java程序调用的过程也大体相同

1.PythonRunner

对于运行与 JVM 上的程序（即Scala、Java程序），Spark 提供了 PythonRunner 类。只需要调用PythonRunner 的main方法，就可以在Scala或Java程序中调用Python脚本。在实现上，PythonRunner 基于py4j ，通过构造GatewayServer实例让python程序通过本地网络socket来与JVM通信。

    // Launch a Py4J gateway server for the process to connect to; this will let it see our
    // Java system properties and such
    val localhost = InetAddress.getLoopbackAddress()
    val gatewayServer = new py4j.GatewayServer.GatewayServerBuilder()
      .authToken(secret)
      .javaPort(0)
      .javaAddress(localhost)
      .callbackClient(py4j.GatewayServer.DEFAULT_PYTHON_PORT, localhost, secret)
      .build()
    val thread = new Thread(new Runnable() {
      override def run(): Unit = Utils.logUncaughtExceptions {
        gatewayServer.start()
      }
    })
    thread.setName("py4j-gateway-init")
    thread.setDaemon(true)
    thread.start()

    // Wait until the gateway server has started, so that we know which port is it bound to.
    // `gatewayServer.start()` will start a new thread and run the server code there, after
    // initializing the socket, so the thread started above will end as soon as t