spark本地开发环境搭建（maven + scala + java）

最新推荐文章于 2024-03-23 23:44:21 发布

克念

最新推荐文章于 2024-03-23 23:44:21 发布

阅读量1.5k

点赞数 1

分类专栏： hadoop 数据仓库环境搭建 java BigData 技术学习文章标签： spark scala java hadoop 大数据

本文链接：https://blog.csdn.net/shikenian520/article/details/93380225

版权

java 同时被 3 个专栏收录

15 篇文章 0 订阅

订阅专栏

数据仓库

7 篇文章 0 订阅

订阅专栏

BigData 技术学习

7 篇文章 0 订阅

订阅专栏

开发工具和软件版本信息

IDEA	2019.2
JAVA	1.8
Scala	2.11.12
Spark	2.4.3
Hadoop	2.7.7
Windows	Win10专业版64位
Centos	7.5

部署Spark和Hadoop本地模式

1）下载spark和Hadoop

spark，选择pre_build版本，也就是编译好的版本

http://mirror.bit.edu.cn/apache/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz

Hadoop，注意下载和spark对应的版本

http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

2）下载Scala

Spark2.4.3支持的Scala版本为2.12，因此我们下载2.12.8版本

https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.msi

直接双击安装，选择安装在 D:\software目录下面，然后一直点击下一步直到结束

3）以管理员身份运行winrar，把

hadoop-2.7.7.tar.gz 和 spark-2.4.3-bin-hadoop2.7.tgz解压到E:\bigdata\

得到下面两个目录：

E:\bigdata\hadoop-2.7.7

E:\bigdata\spark-2.4.3-bin-hadoop2.7

4）配置环境变量

1：配置SPARK_HOME,HADOOP_HOME,SCALA_HOME的环境变量

$x (N): SPARK HOME E:\bigdata\spark-2.4.3-bin-hadoop2.71$

$x HADOOP HOME data\hadoo -2.7.7$

$x SCALA HOME D:\software\scala$

5）配置PATH变量

1）Scala配置到PATH变量中

在PATH变量中新增 :

%SCALA_HOME%\bin

%SCALA_HOME%\jre\bin

%HADOOP_HOME%\bin

%SPARK_HOME%\bin

$x 96USERPROFlLE%\AppData\Local\Microsoft\WindowsApps %JAVA D:\software\Microsoft VS Code\bin SPARK HADOOP SCALA MEW \$

测试执行Spark本地模式是否成功

执行spark-sell命令，如果出现下图所示，则成功

$Using Spark' s default log4j profile: org apache/ spark/10g4j—defau1ts. properties Setting default log level to "WARN". To adjust logging level use sc. setLogLeve1 (newLeve1). For SparkR, use setLogLeve1 (newLeve1) . Spark context Web UI available at http • // Spark context available as ' sc' (master local app id - local-1560831166803). Spark session available as ' spark' Welcome to 7 / \ , // / / version 2. 4.3 Using Scala version 2. 11. 12 (Java HotSpot (TM) 64—Bit Server VM, Type in expressions to have them evaluated. Type :help for more information. scala> Java 1.8.0 201)$

1）解决winutil报错的问题

一般情况下，在Windows中启动spark-shell会报下面错误：

$C: \User s\3g01 den) spark—shel 1 Missing Python executable ' python' , defaulting to ' E: \bigdata\spark—2. 4. 3—bin—hadoop2. 7\bin\. for SPARK HOME environment variable. utable in PYSPARK DRIVER PYTHON or PYSPARK PYTHON environment variable to detect SPARK HOME safely. 19/06/18 12:12:37 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java. io. IOException: Could not locate executable null\bin\winutils. exe in the Hadoop binaries. Please install Python or specify the correct Python exec at at at at at at at at at at at at at at at at at at at at at at at at at at org. org. org. org. org. org. org. org. org. org. org. org. apache. hadoop. util. Shell. getQua1 ifiedBinPath (Shell. java : 379) apache. hadoop. util. Shell. getWinUti 1sPath (Shell. java: 394) apache. hadoop. util. Shell. init) (Shell. java: 387) apache. hadoop. util. StringUti1s. init) (StringUti1s. java: 80) apache. hadoop. securi ty. Securi tyUti1. getAuthenticationMethod (Securi tyUti1. java: 611) apache. hadoop. securi ty. UserGroupInformation. initial ize (UserGroupInformation. java: 273) apache. hadoop. securi ty. UserGroupInformation. ensurelnitial ized (UserGroupInformation. java: 261) apache. hadoop. security. UserGroupInformation. loginUserFromSubject (UserGroupInformation. java: 791) apache. hadoop. securi ty. UserGroupInformation. getLoginUser (UserGroupInformation. java : 761) apache. hadoop. securi ty. UserGroupInformation. getCurrentUser (UserGroupInformation. java: 634) apache. spark. util. Uti 1s$$anonfun$getCurrentUserName$1. apply (Uti Is. scala: 2422) apache. spark. util. Uti 1s$$anonfun$getCurrentUserName$1. apply (Uti Is. scala: 2422) scala. Option. getOrE1se (Option. scala: 121) org. apache. spark. util. Uti Is$. getCurrentUserName (Utils. scala: 2422) org. apache. spark. SecurityManager. <init> (SecurityManager. scala: 79) org. apache. spark. deploy. SparkSubmit. secMgr$1zycompute$1 (SparkSubmit. scala: 359) org. apache. spark. deploy. SparkSubmit. org$apache$spark$dep10y$SparkSubmit$$secMgr$1 (SparkSubmit. scala: 359) org. apache. spark. deploy. SparkSubmi t$$anonfun$prepareSubmi tEnvironment$7. apply (SparkSubmi t. scala : 367) org. apache. spark. deploy. SparkSubmi t$$anonfun$prepareSubmi tEnvironment$7. apply (SparkSubmit. scala: 367) scala. Option. map (Option. scala: 146) org. apache. spark. deploy. SparkSubmit. prepareSubmi tEnvironment (SparkSubmit. scala: 366) org. apache. spark. deploy. SparkSubmit. submit (SparkSubmit. scala: 143) org. apache. spark. deploy. SparkSubmit. doSubmit (SparkSubmit. scala: 86) org. apache. spark. deploy. SparkSubmi t$$anon$2. doSubmit (SparkSubmit. scala: 924) org. apache. spark. deploy. SparkSubmi t$. main (SparkSubmit. scala: 933) org. apache. spark. deploy. SparkSubmit. main (SparkSubmit. scala) 19/06/18 12: 12:37 WARN NativeCodeLoader: Unable to load native—hadoop library for your platform. . . Using Spark' s default log4j profile: org apache/ spark/10g4j—defau1ts. properties Setting default log level to "WARN". using builtin—java classes where applicable$

那么则需要把下面的压缩包里面的文件放到Hadoop的bin目录里面就没问题了：

> bigdata hadoop-2.7.7 bin winutil.rar hadoop.dll winutils.exe 2019/6/18 12:18 2019/3/25 9:46 2019/3/25 9:46 WinRAR 75 KB 84 KB 107 KB

压缩包见下面：

<<winutil.rar>>

重新执行spark-shell

$- spark-shell icrosoft Windows 10. 0. 17763. 437] (c) 2018 Microsoft Corporation, spark—shell sing Spark' s default log4j profile: org/apache/spark/10g4j—defau1ts. properties Setting default log level to "WARN". o adjust logging level use sc. setLogLeve1 (newLeve1). For SparkR, use setLogLeve1 (newLeve1). Spark context Web UI available at http • // Spark context available as ' sc' (master local app id - local-1560832265811). Spark session available as ' spark' el come to 7 / \ , // / / version 2. 4.3 sing Scala version 2. 11. 12 (Java HotSpot (TM) 64—Bit Server VM, ype in expressions to have them evaluated. ype :help for more information. scala> Java 1.8.0 201)$

然后在通过：http://localhost:4040/jobs/ 可以进入到spark界面

o localhost:4040/jobs Säar Jobs Stages 2.4.3 Spark Jobs (?) User: 3golden Total Uptime: 32 s Scheduling Mode: FIFO Event Timeline Storage Environment Executors

IDEA本地开发Spark设置

1）IDEA的下载和安装

下载安装IDEA，社区版本就够使用

IDEA官网

https://www.jetbrains.com/idea/

IDEA社区版本下载：

Download IntelliJ IDEA Windows macOS Linux Ultimate For web and enterprise Community For JVM and Android development DOWNLOAD Free trial .EXE development DOWNLOAD .EXE Free, open-source

2）安装IDEA到本地目录（略）

3)安装Scala插件

打开IDEA，然后找到 Configure -> Plugins

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image012.png$

搜索Scala插件，然后点击install安装。安装完成后状态变为installed

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image013.png$

安装完成，重启IDEA

4）配置项目的Scala和Java SDK

选择project structure

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image014.png$

然后设置每个项目的Java版本为1.8

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image015.png$

然后为每个项目新增Scala的支持

$Project Structure scala-sdk-2.1 2.8 Project Settings Project Modules Libraries Facets Artifacts Platform Settings SDKs Global Libraries Problems Name: 2.12 scala-sdk-2.1 2.8 Compiler classpath: Standard library: V Classes V D:\software\scala\lib\scala-library.jar D:\software\scala\lib\scala-parser-combinators 2.12-1.0.7.jar El 2.12-2.0.3.jar El 2.12-1.0.6.jar JavaDocs OK Cancel Apply$

5）新建Spark开发项目

1：点击 create a new job

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image017.png$

选择maven项目，然后直接点击下一步

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image018.png$

填写项目信息，直接next

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image019.png$

选择项目路径，然后next，项目建立完毕

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image020.png$

接下来要为项目建立Scala支持

6)项目增加Scala SDK

第一步，右键点击项目，然后选择 Add Framework Support

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image021.png$

第二步，勾选Scala，然后点击OK

Add Frameworks Support Please select the desired technologies. This will download all needed libraries and create Facets in project configuration. C,) Groovy a Z OSGi D Play 2.x •.1 Scala SQL Support Use library. scala-sdk-2.12.0 scala-sdk-2.12.O library will be used OK Create... Configure... Cancel

然后再项目结构中可以看到有Scala的支持：

Project V dem001 7_ dem001 .iml m pom.xml 1.8 > 201 scala-sdk-2.12.O Scratches and Consoles

在main目录下面，和Java相同级别，建立一个目录名字叫做Scala目录

$Project resources 7_ dem001 .iml m pom.xml > FilesVava\jdk1.8.0 201 scala-sdk-2.12..O Scratches and Consoles$

右键点击scala目录，选择mark directory as 选项，设置为source folder

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image025.png$

7)项目中代码开发

在POM文件中加入依赖：

$C:\69819325\5DE70910-5C4B-4582-AC8E-43D53B836AC8.files\image026.png$

具体插入的内容为：

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<groupId>bigdata_study</groupId>

<version>1.0-SNAPSHOT</version>

<groupId>org.apache.spark</groupId>

<artifactId>spark-sql_2.12</artifactId>

</dependency>

</dependencies>

</project>

接下来，等待maven把所有的依赖下载完成后，新建一个Scala类，执行这个类

package com. bd. testl import org. apache. spark. sql. SparkSession object Testl { def main (args: Array [String)): Unit val spark = SparkSession. builder. appName( name Simple Application"). master( master — spark. sparkContext. parallelize (List(l, 2, 3)) val rddl print(rddl. reduce( + ) ) spark. stop ( ) local . get0rCreate ( )

代码如下：

packagecom.bd.test1

importorg.apache.spark.sql.SparkSession

object Test1{

Def main(args:Array[String]):Unit={

val spark = SparkSession.builder.appName("SimpleApplication").master("local[*]").getOrCreate()

val rdd1=spark.sparkContext.parallelize(List(1,2,3))

print(rdd1.reduce(_+_))

spark.stop()

}

然后直接执行，已经可以执行了：

$Eile Edit yiew Navigate Code Analyze Refactor Build Run 1001s VCS Window Help dem001 src main scala com > bd ' testl Testl .scala In 599 In 597 In 643 1.0 In 603 In 600 Project @ V Rdem001 E:\workpaces\idea\dem001 resources V com.bd.testl Testl D target dem001 .iml m pom.xml Run: Testl m dem001 Testl .scala package com. bd. testl import org. apache. spark. sql. SparkSession — object Testl { def main (args: Array [String)): Unit val spark = SparkSession. builder. appName( name spark. sparkContext. paral lel ize (List(l, val rddl Simple Application"). master( master — local INFO INFO INFO INFO INFO INFO INFO INFO INFO TaskSetManager : TaskSetManager : TaskSetManager : TaskSetManager : TaskSetManager : TaskSetManager : 7 10 Fini shed Fini shed Fini shed Fini shed Fini shed Fini shed print(rddl. reduce( + ) ) spark. stop ( ) main(args: Array[String]) 19/06/21 19/06/21 19/06/21 19/06/21 19/06/21 19/06/21 19/06/21 19/06/21 19/06/21 Testi task task task task task task 6. 0 stage stage stage stage stage stage 0. 0. 0. 0. 0. 0. o o o o o o (TID (TID (TID (TID (TID (TID 6) 3) 5) 0) 1) 4) In 594 ms ms ms ms ms ms on on on on on on localhost localhost localhost localhost localhost localhost (executor (executor (executor (executor (executor (executor driver) dri ver) dri ver) dri ver) dri ver) dri ver) (4/8) (5/8) (6/8) (8/8) TaskSchedu1erImp1: Removed TaskSet 0. 0, whose tasks have all completed, from pool DACSchedu1er: ResultStage 0 (reduce at Testl. scala:7) finished in 0. 899 s DACSchedu1er: Job 0 finished: reduce at Testl. scala:7, took 0. 976056 s 619/06/21 18:27:02 INFO SparkU1: Stopped spark web UI at http://DESKTOP-PD31C7E:4040 19/06/21 18:27 :02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 19/06/21 18:27 :02 INFO MemoryStore: MemoryStore cleared$

注意：一定要注意的是：如果要把代码发布到集群上面的话，需要

1）把maven依赖的scope改为provided

2）代码中的master方法不要调用，因为我们在spark-submit的时候会指定master

克念

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
spark本地开发环境搭建（maven + scala + java）

开发工具和软件版本信息 IDEA 2019.2 JAVA 1.8 Scala 2.11.12 Spark 2.4.3 Hadoop 2.7.7 Windows ...
复制链接

扫一扫