1.Windows开发环境配置与安装
下载IDEA并安装,可以百度一下免费文档。
2.IDEA Maven工程创建与配置
1)配置maven
2)新建Project项目
3)选择maven骨架
4)创建项目名称
5)选择maven地址
6)生成maven项目
7)选择scala版本
选中项目按“F4”键
8)新建Java 和 scala目录及Class类
现在我们右键是无法找到新建Class类的选项的,需要做如下设置:
然后新建一个package(com.zimo.spark自定义),再新建一个scala Object:
9)编辑pom.xml文件
a)参考地址一
b)参考地址二
#<properties>
<hadoop.version>2.6.0</hadoop.version>
<scala.binary.version>2.11</scala.binary.version>
<spark.version>2.2.0</spark.version>
#<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
3.开发Spark Application程序并进行本地测试
1)idea编写WordCount程序
package com.zimo.spark
import org.apache.spark.sql.SparkSession
/**
*
* @author Zimo
* @date 2019/4/17
*
*/
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local")
.appName("HdfsTest")
.getOrCreate()
val filePath = "D://test.txt"
// 方式一:RDD
// val rdd = spark.sparkContext.textFile(filePath)
// val lines = rdd.flatMap(x => x.split(" "))
// .map(x => (x,1))
// .reduceByKey((a,b) => (a+b))
// .collect()
// .toList
// println(lines)
// 方式二:dataset
import spark.implicits._
val dataSet = spark.read.textFile(filePath)
.flatMap(x => x.split(" "))
.map(x => (x,1))
.groupBy("_1")
.count().show()
/*val dataSet = spark.read.textFile(filePath).show()的运行结果
+--------------------+
| value|
+--------------------+
| doop storm spark|
| hbase spark flume|
| spark rdd spark|
|hdfs mapreduce spark|
| hive hdfs solr|
| spark flink storm|
| hbase storm es|
+--------------------+
*/
}
}
在本地磁盘新建一个测试文件test.txt
doop storm spark
hbase spark flume
spark rdd spark
hdfs mapreduce spark
hive hdfs solr
spark flink storm
hbase storm es
4.本地测试
报错:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/04/17 17:17:42 INFO SparkContext: Running Spark version 2.2.0
19/04/17 17:17:44 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.zimo.spark.Test$.main(Test.scala:17)
at com.zimo.spark.Test.main(Test.scala)
解决方案一:
解决方案二:
本地测试成功:
5.Spark Application程序打包
1)修改代码
2)项目编译及打包
完成后可以在本地项目文件目录找到打包好的jar包
3)spark-submit方式提交作业
先上传jar包
然后启动hadoop集群,将测试文件上传到HDFS上:
bin/hdfs dfs -mkdir -p /user/datas
bin/hdfs dfs -put /opt/datas/test.txt /user/datas
运行:
[kfk@bigdata-pro02 spark-2.2.0-bin]$ bin/spark-submit --master local[2] /opt/jars/Spark.jar hdfs://bigdata-pro01.kfk.com:9000/user/datas/test.txt
运行结果:
以上就是博主为大家介绍的这一板块的主要内容,这都是博主自己的学习过程,希望能给大家带来一定的指导作用,有用的还望大家点个支持,如果对你没用也望包涵,有错误烦请指出。如有期待可关注博主以第一时间获取更新哦,谢谢!同时也欢迎转载,但必须在博文明显位置标注原文地址,解释权归博主所有!