livy使用样例_使用 Livy Spark 向 Azure HDInsight 上的 Spark 群集提交作业 | Microsoft Docs...

您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用 Apache Spark REST API 将远程作业提交到 HDInsight Spark 群集Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster

02/28/2020

本文内容

了解如何使用 Apache Livy(即 Apache Spark REST API),用于将远程作业提交到 Azure HDInsight Spark 群集。Learn how to use Apache Livy, the Apache Spark REST API, which is used to submit remote jobs to an Azure HDInsight Spark cluster. 有关详细文档,请参阅 Apache Livy。For detailed documentation, see Apache Livy.

可以使用 Livy 运行交互式 Spark shell,或提交要在 Spark 上运行的批处理作业。You can use Livy to run interactive Spark shells or submit batch jobs to be run on Spark. 本文介绍如何使用 Livy 提交批处理作业。This article talks about using Livy to submit batch jobs. 本文中的代码片段使用 cURL 向Livy Spark 终结点发出 REST API 调用。The snippets in this article use cURL to make REST API calls to the Livy Spark endpoint.

必备条件Prerequisites

HDInsight 上的 Apache Spark 群集。An Apache Spark cluster on HDInsight.

提交 Apache Livy Spark 批处理作业Submit an Apache Livy Spark batch job

在提交批处理作业之前,必须将应用程序 jar 上传到与群集关联的群集存储。Before you submit a batch job, you must upload the application jar on the cluster storage associated with the cluster. 可以使用命令行实用工具 AzCopy 来执行此操作。You can use AzCopy, a command-line utility, to do so. 可以使用其他各种客户端来上传数据。There are various other clients you can use to upload data.

curl -k --user "admin:password" -v -H "Content-Type: application/json" -X POST -d '{ "file":"", "className":"" }' 'https://.azurehdinsight.net/livy/batches' -H "X-Requested-By: admin"

示例Examples

如果 jar 文件位于群集存储 (WASBS) 中If the jar file is on the cluster storage (WASBS)

curl -k --user "admin:mypassword1!" -v -H "Content-Type: application/json" -X POST -d '{ "file":"wasbs://mycontainer@mystorageaccount.blob.core.windows.net/data/SparkSimpleTest.jar", "className":"com.microsoft.spark.test.SimpleFile" }' "https://mysparkcluster.azurehdinsight.net/livy/batches" -H "X-Requested-By: admin"

如果想要传递 jar 文件名和类名作为输入文件(在本示例中为 input.txt)的一部分If you want to pass the jar filename and the classname as part of an input file (in this example, input.txt)

curl -k --user "admin:mypassword1!" -v -H "Content-Type: application/json" -X POST --data @C:\Temp\input.txt "https://mysparkcluster.azurehdinsight.net/livy/batches" -H "X-Requested-By: admin"

获取在群集上运行的 Spark 批处理的相关信息Get information on Livy Spark batches running on the cluster

语法:Syntax:

curl -k --user "admin:password" -v -X GET "https://.azurehdinsight.net/livy/batches"

示例Examples

如果想要检索在群集上运行的所有Livy Spark 批处理:If you want to retrieve all the Livy Spark batches running on the cluster:

curl -k --user "admin:mypassword1!" -v -X GET "https://mysparkcluster.azurehdinsight.net/livy/batches"

如果想要检索具有给定批 ID 的特定批If you want to retrieve a specific batch with a given batch ID

curl -k --user "admin:mypassword1!" -v -X GET "https://mysparkcluster.azurehdinsight.net/livy/batches/{batchId}"

删除 Livy Spark 批处理作业Delete a Livy Spark batch job

curl -k --user "admin:mypassword1!" -v -X DELETE "https://.azurehdinsight.net/livy/batches/{batchId}"

示例Example

删除批 ID 为 5 的批处理作业。Deleting a batch job with batch ID 5.

curl -k --user "admin:mypassword1!" -v -X DELETE "https://mysparkcluster.azurehdinsight.net/livy/batches/5"

Livy Spark 和高可用性Livy Spark and high-availability

Livy 可为群集上运行的 Spark 作业提供高可用性。Livy provides high-availability for Spark jobs running on the cluster. 下面是几个示例。Here is a couple of examples.

如果在将作业远程提交到 Spark 群集之后,Livy 服务出现故障,则该作业将继续在后台运行。If the Livy service goes down after you've submitted a job remotely to a Spark cluster, the job continues to run in the background. 当 Livy 恢复运行时,将还原并报告作业的状态。When Livy is back up, it restores the status of the job and reports it back.

适用于 HDInsight 的 Jupyter 笔记本由后端中的 Livy 提供支持。Jupyter notebooks for HDInsight are powered by Livy in the backend. 如果在 Notebook 运行 Spark 作业时重启 Livy 服务,Notebook 会继续运行代码单元。If a notebook is running a Spark job and the Livy service gets restarted, the notebook continues to run the code cells.

举个例子Show me an example

本部分通过示例介绍如何使用 Livy Spark 提交批处理作业、监视作业进度,并删除作业。In this section, we look at examples to use Livy Spark to submit batch job, monitor the progress of the job, and then delete it. The application we use in this example is the one developed in the article Create a standalone Scala application and to run on HDInsight Spark cluster. 此处的步骤假设:The steps here assume:

已将应用程序 jar 复制到与群集关联的存储帐户。You've already copied over the application jar to the storage account associated with the cluster.

你在尝试执行这些步骤的计算机上已安装了卷。You've CuRL installed on the computer where you're trying these steps.

执行以下步骤:Perform the following steps:

为方便使用,请设置环境变量。For ease of use, set environment variables. 此示例基于 Windows 环境,请根据环境需要修改变量。This example is based on a Windows environment, revise variables as needed for your environment. 将 CLUSTERNAME 和 PASSWORD 替换为相应的值。Replace CLUSTERNAME, and PASSWORD with the appropriate values.

set clustername=CLUSTERNAME

set password=PASSWORD

验证 Livy Spark 是否正在群集上运行。Verify that Livy Spark is running on the cluster. 为此,我们可以获取正在运行的批的列表。We can do so by getting a list of running batches. 首次使用 Livy 运行作业时,输出应返回零。If you're running a job using Livy for the first time, the output should return zero.

curl -k --user "admin:%password%" -v -X GET "https://%clustername%.azurehdinsight.net/livy/batches"

应会看到类似于以下代码片段的输出:You should get an output similar to the following snippet:

< HTTP/1.1 200 OK

< Content-Type: application/json; charset=UTF-8

< Server: Microsoft-IIS/8.5

< X-Powered-By: ARR/2.5

< X-Powered-By: ASP.NET

< Date: Fri, 20 Nov 2015 23:47:53 GMT

< Content-Length: 34

<

{"from":0,"total":0,"sessions":[]}* Connection #0 to host mysparkcluster.azurehdinsight.net left intact

请注意输出中的最后一行显示为 total:0 ,这意味着未运行任何批。Notice how the last line in the output says total:0 , which suggests no running batches.

现在,让我们提交批处理作业。Let us now submit a batch job. 以下代码片段使用输入文件 (input.txt) 传递 jar 名称和类名称作为参数。The following snippet uses an input file (input.txt) to pass the jar name and the class name as parameters. 如果是从 Windows 计算机运行这些步骤,则建议使用输入文件。If you're running these steps from a Windows computer, using an input file is the recommended approach.

curl -k --user "admin:%password%" -v -H "Content-Type: application/json" -X POST --data @C:\Temp\input.txt "https://%clustername%.azurehdinsight.net/livy/batches" -H "X-Requested-By: admin"

文件 input.txt 中的参数定义如下:The parameters in the file input.txt are defined as follows:

{ "file":"wasbs:///example/jars/SparkSimpleApp.jar", "className":"com.microsoft.spark.example.WasbIOTest" }

应会看到类似于以下代码片段的输出:You should see an output similar to the following snippet:

< HTTP/1.1 201 Created

< Content-Type: application/json; charset=UTF-8

< Location: /0

< Server: Microsoft-IIS/8.5

< X-Powered-By: ARR/2.5

< X-Powered-By: ASP.NET

< Date: Fri, 20 Nov 2015 23:51:30 GMT

< Content-Length: 36

<

{"id":0,"state":"starting","log":[]}* Connection #0 to host mysparkcluster.azurehdinsight.net left intact

请注意输出的最后一行显示为 state:starting 。Notice how the last line of the output says state:starting . 此外显示了 id:0 。It also says, id:0 . 此处的 0 为批 ID。Here, 0 is the batch ID.

现在,可以使用批 ID 来检索此特定批的状态。You can now retrieve the status of this specific batch using the batch ID.

curl -k --user "admin:%password%" -v -X GET "https://%clustername%.azurehdinsight.net/livy/batches/0"

应会看到类似于以下代码片段的输出:You should see an output similar to the following snippet:

< HTTP/1.1 200 OK

< Content-Type: application/json; charset=UTF-8

< Server: Microsoft-IIS/8.5

< X-Powered-By: ARR/2.5

< X-Powered-By: ASP.NET

< Date: Fri, 20 Nov 2015 23:54:42 GMT

< Content-Length: 509

<

{"id":0,"state":"success","log":["\t diagnostics: N/A","\t ApplicationMaster host: 10.0.0.4","\t ApplicationMaster RPC port: 0","\t queue: default","\t start time: 1448063505350","\t final status: SUCCEEDED","\t tracking URL: http://myspar.lpel.jx.internal.cloudapp.net:8088/proxy/application_1447984474852_0002/","\t user: root","15/11/20 23:52:47 INFO Utils: Shutdown hook called","15/11/20 23:52:47 INFO Utils: Deleting directory /tmp/spark-b72cd2bf-280b-4c57-8ceb-9e3e69ac7d0c"]}* Connection #0 to host mysparkcluster.azurehdinsight.net left intact

现在,输出显示 state:success ,这意味着作业已成功完成。The output now shows state:success , which suggests that the job was successfully completed.

现在,可以根据需要删除该批。If you want, you can now delete the batch.

curl -k --user "admin:%password%" -v -X DELETE "https://%clustername%.azurehdinsight.net/livy/batches/0"

应会看到类似于以下代码片段的输出:You should see an output similar to the following snippet:

< HTTP/1.1 200 OK

< Content-Type: application/json; charset=UTF-8

< Server: Microsoft-IIS/8.5

< X-Powered-By: ARR/2.5

< X-Powered-By: ASP.NET

< Date: Sat, 21 Nov 2015 18:51:54 GMT

< Content-Length: 17

<

{"msg":"deleted"}* Connection #0 to host mysparkcluster.azurehdinsight.net left intact

输出的最后一行显示批已成功删除。The last line of the output shows that the batch was successfully deleted. 删除正在运行的作业时,还会中止该作业。Deleting a job, while it's running, also kills the job. 如果删除已完成的作业,则不管该作业是否已成功完成,都将完全删除该作业的信息。If you delete a job that has completed, successfully or otherwise, it deletes the job information completely.

更新到从 HDInsight 3.5 版本开始的 Livy 配置Updates to Livy configuration starting with HDInsight 3.5 version

HDInsight 3.5 群集及更高版本群集默认情况下禁止使用本地文件路径访问示例数据文件或 jar。HDInsight 3.5 clusters and above, by default, disable use of local file paths to access sample data files or jars. 建议改用 wasbs:// 路径访问群集中的 jar 或示例数据文件。We encourage you to use the wasbs:// path instead to access jars or sample data files from the cluster.

在 Azure 虚拟网络中提交群集的 Livy 作业Submitting Livy jobs for a cluster within an Azure virtual network

如果从 Azure 虚拟网络内部连接到 HDInsight Spark 群集,可以直接连接到群集上的 Livy。If you connect to an HDInsight Spark cluster from within an Azure Virtual Network, you can directly connect to Livy on the cluster. 在这种情况下,Livy 终结点的 URL 为 http://:8998/batches。In such a case, the URL for Livy endpoint is http://:8998/batches. 此处的 8998 是群集头节点上运行 Livy 的端口。Here, 8998 is the port on which Livy runs on the cluster headnode. For more information on accessing services on non-public ports, see Ports used by Apache Hadoop services on HDInsight.

后续步骤Next steps

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
使用Java编写的应用程序通过Livy提交Spark任务,您可以使用Livy的Java客户端库。以下是一些基本的步骤: 1. 首先,您需要将Livy的Java客户端库添加到您的应用程序中。您可以从Livy的Maven中央仓库中获取该库。例如,如果您使用Maven构建您的应用程序,可以在pom.xml文件中添加以下依赖项: ```xml <dependency> <groupId>org.apache.livy</groupId> <artifactId>livy-client-http</artifactId> <version>0.7.1-incubating</version> </dependency> ``` 2. 接下来,您需要创建一个LivyClient实例,该实例将用于与Livy服务器交互。例如,您可以使用以下代码创建一个LivyClient: ```java LivyClient client = new LivyClientBuilder() .setURI(new URI("http://<livy-server>:8998")) .build(); ``` 其中,`<livy-server>`是Livy服务器的主机名或IP地址。 3. 然后,您需要使用LivyClient提交Spark作业。您可以使用以下代码提交一个Java Spark作业: ```java Job job = new JavaJobBuilder(SparkJob.class) .appName("My Spark Job") .mainClass("com.example.spark.MySparkJob") .args("arg1", "arg2") .jars("/path/to/your/dependencies.jar") .pyFiles("/path/to/your/dependencies.py") .conf("spark.driver.memory", "4g") .conf("spark.executor.memory", "2g") .build(); long jobId = client.submit(job); ``` 其中,`SparkJob`是您的Spark作业类,`com.example.spark.MySparkJob`是您的Spark作业的主类,`/path/to/your/dependencies.jar`和`/path/to/your/dependencies.py`是您的Spark作业的依赖项。 4. 最后,您可以使用LivyClient获取Spark作业的状态和输出。例如,您可以使用以下代码获取Spark作业的状态: ```java JobStatus status = client.getJobStatus(jobId); ``` 您还可以使用以下代码获取Spark作业的输出: ```java List<String> output = client.getJobResult(jobId).stdout(); ``` 以上就是使用Java编写的应用程序通过Livy提交Spark任务的基本步骤。需要注意的是,Livy需要与Spark集群的网络和安全设置兼容,才能在集群模式下正常工作。因此,在使用Livy时,请确保您已经正确地设置了Spark集群的网络和安全设置。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值