spark hive 本地调试 & 提交任务到yarn cluster

本文讲述spark部署在远程服务器的yarn环境下时,如何进行本地调试以及提交任务到yarn cluster

首先,把 core-site.xml hive-site.xml yarn-site.xml拷贝到 resources 下,和target/classes下
在这里插入图片描述

本地调试

    System.setProperty("HADOOP_USER_NAME", "user")  // 这里user改成你自己有权限的用户名

    val spark = SparkSession
      .builder()
      .enableHiveSupport()
      // 本地模式
      .master("local[*]")
      .appName("name")
      // 连接远程数据仓库
      .config("spark.sql.warehouse.dir", "hdfs://xxxx")
      // 如果不加的话,会导致只能显示目录,不能读取表数据
      .config("dfs.client.use.datanode.hostname", "true")
      .getOrCreate()

    //TODO 执行逻辑和操作
    // config("dfs.client.use.datanode.hostname", "true"),不加这个的话,会导致只能
    // show tables, 而select会一直Connection timed out: no further information
    spark.sql("show tables;").show()
    spark.sql("select * from table1;").show()
    
    //TODO 关闭环境
    spark.close()

提交任务到yarn

  1. 编写代码
    val sparkConf = new SparkConf()
      .setMaster("yarn")  // 设置yarn模式
      .setAppName("SparkSQL")

    val spark = SparkSession
      .builder()
      .enableHiveSupport()
      .config(sparkConf)
      .config("spark.sql.warehouse.dir", "hdfs://xxxx")  // 注明数据仓库地址
      .getOrCreate()

    //TODO 执行逻辑和操作
    val df: DataFrame = spark.sql("select * from table1")
    df.write.format("csv").save("hdfs://hdp:8020/output/result.csv")  // 保存结果

    //TODO 关闭环境
    spark.close()
  1. 代码敲完了,在maven打包前记得一定一定要build
  2. maven打包,点package
    在这里插入图片描述
    这里看你需不需要把各种依赖打包,如果需要的话要在pom.xml添加如下代码:
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

加上依赖之后,包会变得很大,看个人需求
4. 上传到集群
5. copy完整类名
在这里插入图片描述
6. 提交任务到yarn集群

spark-submit \
--class 完整类名 \
--master yarn \
--deploy-mode cluster \
./xxxx/yyyy.jar

根据自己的需求在代码和提交命令中添加参数

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Learning Apache Spark 2 by Muhammad Asif Abbasi English | 6 Jun. 2017 | ASIN: B01M7RO7US | 356 Pages | AZW3 | 16.22 MB Key Features Exclusive guide that covers how to get up and running with fast data processing using Apache Spark Explore and exploit various possibilities with Apache Spark using real-world use cases in this book Want to perform efficient data processing at real time? This book will be your one-stop solution. Book Description Spark juggernaut keeps on rolling and getting more and more momentum each day. The core challenge are they key capabilities in Spark (Spark SQL, Spark Streaming, Spark ML, Spark R, Graph X) etc. Having understood the key capabilities, it is important to understand how Spark can be used, in terms of being installed as a Standalone framework or as a part of existing Hadoop installation and configuring with Yarn and Mesos. The next part of the journey after installation is using key components, APIs, Clustering, machine learning APIs, data pipelines, parallel programming. It is important to understand why each framework component is key, how widely it is being used, its stability and pertinent use cases. Once we understand the individual components, we will take a couple of real life advanced analytics examples like: Building a Recommendation system Predicting customer churn The objective of these real life examples is to give the reader confidence of using Spark for real-world problems. What you will learn Overview Big Data Analytics and its importance for organizations and data professionals. Delve into Spark to see how it is different from existing processing platforms Understand the intricacies of various file formats, and how to process them with Apache Spark. Realize how to deploy Spark with YARN, MESOS or a Stand-alone cluster manager. Learn the concepts of Spark SQL, SchemaRDD, Caching, Spark UDFs and working with Hive and Parquet file formats Understand the architecture of Spark MLLib while discussing some of the
High-speed distributed computing made easy with Spark Overview Implement Spark's interactive shell to prototype distributed applications Deploy Spark jobs to various clusters such as Mesos, EC2, Chef, YARN, EMR, and so on Use Shark's SQL query-like syntax with Spark In Detail Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets. Fast Data Processing with Spark covers how to write distributed map reduce style programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API, to deploying your job to the cluster, and tuning it for your purposes. Fast Data Processing with Spark covers everything from setting up your Spark cluster in a variety of situations (stand-alone, EC2, and so on), to how to use the interactive shell to write distributed code interactively. From there, we move on to cover how to write and deploy distributed jobs in Java, Scala, and Python. We then examine how to use the interactive shell to quickly prototype distributed programs and explore the Spark API. We also look at how to use Hive with Spark to use a SQL-like query syntax with Shark, as well as manipulating resilient distributed datasets (RDDs). What you will learn from this book Prototype distributed applications with Spark's interactive shell Learn different ways to interact with Spark's distributed representation of data (RDDs) Load data from the various data sources Query Spark with a SQL-like query syntax Integrate Shark queries with Spark programs Effectively test your distributed software Tune a Spark installation Install and set up Spark on your cluster Work effectively with large data sets Approach This book will be a basic, step-by-step tutorial, which will help readers take advantage of all that Spark has to offer. Who this book is written for Fast Data Processing with Spark is for software developers who want to learn how to write distributed programs with Spark. It will help developers who have had problems that were too much to be dealt with on a single computer. No previous experience with distributed programming is necessary. This book assumes knowledge of either Java, Scala, or Python.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值