目录
方法一:更改pom.xml,添加Maven依赖(强烈建议使用这种方法)
IDEA是一个非常完美的编译器,在IDEA搭建本地Spark环境需要提前配置好Scala和Java环境(看我之前的文章,有详细介绍)下面我将介绍两种方法搭建Spark环境。
方法一:更改pom.xml,添加Maven依赖(强烈建议使用这种方法)
(1)首先,需要创建一个maven项目:File—New—Project
(2)检测JDK环境是否正确,并创建
(3)设置文件存储位置、Name(Artifactld)、GroupId等
(4)在该项目下的src—mian文件下创建一个新的文件夹,命名为Scala,并按回车
(5)将Scala设置为源码文件夹:右键点击Scala—Mark Directory as—Sources Root
(6)修改pom.xml文件,添加所需要的Spark依赖,并点击右下角的Enable Auto-Import 等待下载(如果是第一是导入依赖,下载时间会很长,因为是国外源,可以添加镜像下载速度会加快),底栏会有进度条提示,完成后可以看到右上角的绿色对勾。
注意: 在下载过程中,代码中间可能会出现红色,不代表报错,下载完成后红色会消失
Scala和Spark版本要对应,可以去Spark官网或者Maven官网查看
Spark的Maven库:https://mvnrepository.com/artifact/org.apache.spark/spark-core
<build>里的东西是构建器和插件,原封不动复制就可以,不同的人代码不相同,都无所谓。
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.4</version>
</dependency>
</dependencies>
<build>
<finalName>WordCount</finalName>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
(7)创建Scala文件,点击Obiect类,命名
(8)测试程序
注意:如果在程序中使用了hadoop相关的东西,比如写入文件到HDFS,则会遇到如下图所示的异常,这是因为当在windows操作系统中调试Spark或者Hadoop程序的时候,缺少了winutils.exe。winutils.exe是在Windows系统上需要的hadoop调试环境工具,里面包含一些在Windows系统下调试hadoop、spark所需要的基本的工具类。
(问题:Could not locate executable null \bin\winutils.exe in the hadoop binaries)
解决办法:
下载相应版本的winutils.exe(不同版本请参考链接:https://github.com/cdarlint/winutils,我下载的是2.7.3),然后将其解压到任意目录,这里建议放在Hadoop的bin目录下。
法①:然后在系统环境中,添加一个新的系统变量,命名为:HADOOP_HOME,路径为你所解压的路径,配置变量的时候,不用加bin目录,因为spark会自动读取到HADOOP_HOME,然后加上bin。(建议用这一种)
法②:或者在代码中添加一行代码:(这里我是下载了hadoop-common-2.2.0-bin-master文件,里面包含winutils.exe)
System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-common-2.2.0-bin-master")
import org.apache.spark.{SparkConf, SparkContext}
object Hello {
def main(args: Array[String]): Unit = {
//System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-common-2.2.0-bin-master")
val config = new SparkConf().setMaster("local[*]").setAppName("WordCount")
val sc = new SparkContext(config)
val lines = sc.textFile("in")
val words = lines.flatMap(_.split(" "))
val wordToOne = words.map((_, 1))
val wordToSum = wordToOne.reduceByKey(_ + _)
val result = wordToSum.collect()
result.foreach(println)
//第一种方法
}
}
(9)测试成功
方法二:导入Spark的jars包
先创建Maven文件(和方法一相同)
(1)点击File—Project Structure
(2)点击Global Libraries,点击“+”,添加Spark包
(3)点击Java
(4)找到jars包,点击OK
(5)点击OK
(6)出现了新的jars,点击右下角的Apply
(7)添加Scala框架支持
(8)点击 Enable Auto-import(激活自动导入)
(9)创建Scala文件并测试程序,结果和方法一是相同的。
附件:带有GeoSpark的POM.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>spark_test01</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.3.4</spark.version>
<scala.binary.version>2.11</scala.binary.version>
<geospark.version>1.3.1</geospark.version>
</properties>
<repositories>
<repository>
<id>aliyunmaven</id>
<url>http://maven.aliyun.com/nexus/content/groups/public//</url>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.datasyslab</groupId>
<artifactId>geospark</artifactId>
<version>${geospark.version}</version>
</dependency>
<dependency>
<groupId>org.datasyslab</groupId>
<artifactId>geospark-sql_2.3</artifactId>
<version>${geospark.version}</version>
</dependency>
<dependency>
<groupId>org.datasyslab</groupId>
<artifactId>geospark-viz_2.3</artifactId>
<version>${geospark.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.8</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>