配置maven和Scala环境,并且运行第一个spark项目
IDEA创建spark工程
1、安装maven和scala
下载
安装maven
- 下载解压到D:\maven目录下
- 环境变量MAVEN_HOME,赋值D:\maven
- 环境变量Path,追加
%MAVEN_HOME%\bin;
- cmd输入
mvn -v
看看是否安装成功
安装Scala
- 下载下来的msi文件,除了选择安装路径,别的直接下一步
2、IDEA配置maven和Scala
maven:
- D盘创建mvn_repository目录,用来保存jar包
- IDEA:
file -> settings -> 输入“maven”
- 三个框中分别输入:
Maven home path :maven的安装路径
User settings file:maven里面的conf-> setting.xml全路径
Local repoository :创建的mvn_repository
Scala:
ODEA:file ->Project Structure -> Global Libraries -> 点击 ”+“ 号,选择Scala
3、创建maven的spark工程
file -> new -> project -> maven -> 按照指引创建
- 在main下创建Scala目录,并设置为source route
- pom文件添加依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>SparkClickhouseTest</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<scala.binary.version>2.11.12</scala.binary.version>
<scala.version>2.11</scala.version>
<spark.version>2.3.2</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.binary.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>ru.yandex.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.2.4</version>
</dependency>
<dependency>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<version>2.5.1</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<!-- 声明绑定到maven的compile阶段 -->
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
创建一个简单的wordcount吧
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object test {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("aa")
val sc: SparkContext = new SparkContext(conf)
val rdd: RDD[String] = sc.textFile("hdfs://192.168.1.11:8020/cluster.sh")
rdd.flatMap(_.split(" "))
.flatMap(_.split("\t"))
.flatMap(_.split("\n"))
.flatMap(_.split("/"))
.flatMap(_.split("\""))
.map((_,1)).reduceByKey(_+_).foreach(println)
sc.stop()
}
}
4、创建的工程没有连接Linux会报错
解决方法
- 解压hadoop,并且配置环境变量:HADOOP_HOME和Path
- HADOOP_HOME :
D:\hadoop
- Path:
%HADOOP_HOME%\bin
- 将winutils中对应版本的bin文件替换掉hadoop的bin文件(整个文件)
- 再把bin文件中的hadoop.dll文件复制到
C:\Windows\System32
中 - 重启IDEA,没用就再重启电脑
控制台不打印无效信息
spark的conf文件下的log4j.properties.template改名为log4j.properties。放到maven工程的resource里面修改内容如下
#log4j.rootCategory=INFO, console
log4j.rootCategory=ERROR, console