IDEA开发Spark应用程序
1.maven构建项目
使用maven构建项目就不多赘述了,网上教程超多,构建的时候选择简单的scala项目,后边的gav就不用多说了
2.pom文件的依赖
添加scala Spark-core hadoop-client的依赖以及cdh的repository,这点尤其重要.
<!--Spark跟Scala版本指定,便于后续版本升级-->
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.1</spark.version>
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
</properties>
<!--cloudera的仓库地址-->
<repositories>
<repository>
<id>cloudera</id>
<name>cloudera</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
</repositories>
<dependencies>
<!--添加Scala依赖-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!--添加Spark-core依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!--添加hadoop-client依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
3.编写一个简单的WordCount词频统计程序
package com.ruozedata.spark.core
import org.apache.spark.{SparkConf, SparkContext}
object WordCountApp {
def main (args: Array[String]): Unit = {
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val textFile = sc.textFile(args(0))
val wc = textFile.flatMap(line => line.split("\t")
.map((_,1))).reduceByKey(_+_)
wc.collect().foreach(println)
sc.stop()
}
}
4.打包上传到服务器上测试运行
打包操作比较简单不多做说明,我之前的博客有详细记录
#测试数据上传到hdfs上
hadoop fs -put /opt/scripts/shell_test/test.txt /user/hdfs/
hadoop fs -text /user/hdfs/test.txt
hello world hello
world welcome hello
#提交spark作业运行
spark-submit \
--class com.ruozedata.spark.core.WordCountApp \
--master local[2] \
/opt/scripts/shell_test/spark-train-1.0.jar \
hdfs://stg.bihdp01.hairongyi.local:8020/user/hdfs/test.txt
#运行结果
2019-05-07 17:43:22 INFO DAGScheduler:54 - Job 0 finished: collect at WordCountApp.scala:17, took 1.189968 s
(hello,3)
(welcome,1)
(world,2)
2019-05-07 17:43:22 INFO AbstractConnector:318 - Stopped Spark@36ac8a63{HTTP/1.1,[http/1.1]}{0.0.0.0:4052}
Spark-submit提交参数介绍
–class 运行的jar包的主方法
–master 运行模式
/opt/scripts/shell_test/spark-train-1.0.jar 服务器上jar包的位置
hdfs://stg.bihdp01.hairongyi.local:8020/user/hdfs/test.txt 传入的参数,这个参数是代码中需要词频统计的文件的路径
5.对刚刚的代码进行补充
完善刚刚的代码,添加多文件读取,对统计结果进行排序,同时将结果保存到hdfs上
package com.ruozedata.spark.core
import org.apache.spark.{SparkConf, SparkContext}
object WordCountApp {
def main (args: Array[String]): Unit = {
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val textFile = sc.textFile(args(0))
//将输入的数据进行分割并组装成统计的类型再进行累加统计
val wc = textFile.flatMap(line => line.split("\t")
.map((_,1))).reduceByKey(_+_)
//对结果进行排序(降序排列)
val sortedWc = wc.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1))
sortedWc.saveAsTextFile(args(1))
sc.stop()
}
}
然后打包上传测试运行
#由于我们增加的代码,所以spark-submit提交的代码也要做相应的修改
spark-submit \
--class com.ruozedata.spark.core.WordCountSortApp \
--master local[2] \
/opt/scripts/shell_test/spark-train-1.0.jar \
hdfs://stg.bihdp01.hairongyi.local:8020/user/hdfs/input/*.txt hdfs://stg.bihdp01.hairongyi.local:8020/user/hdfs/output
#然后在对应的hdfs路径查看结果
hadoop fs -text /user/hdfs/output/*
(hello,12)
(world,8)
(welcome,4)
6.spark-shell中运行代码
前面我们是通过打包上传代码,然后spark-submit提交运行,我们也可以通过spark-shell来运行我们的代码
//引入外部数据源
scala> val textFile = sc.textFile("hdfs://stg.bihdp01.hairongyi.local:8020/user/hdfs/input/*.txt")
textFile: org.apache.spark.rdd.RDD[String] = hdfs://stg.bihdp01.hairongyi.local:8020/user/hdfs/input/*.txt MapPartitionsRDD[10] at textFile at <console>:24
//进行word count词频统计
scala> val wc = textFile.flatMap(line => line.split("\t")
| .map((_,1))).reduceByKey(_+_)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[12] at reduceByKey at <console>:26
scala> wc.collect
2019-05-07 18:19:53 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
res2: Array[(String, Int)] = Array((hello,12), (welcome,4), (world,8))
//对统计结果进行排序
scala> val sortedWc = wc.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1))
sortedWc: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[17] at map at <console>:25
scala> sortedWc.collect
res3: Array[(String, Int)] = Array((hello,12), (world,8), (welcome,4))