转载请私信。禁止无授权转载
1.spark安装
利用ambari安装整个集群 参考Ambari及其HDP集群安装及其配置教程
2.Spark工作原理
提交任务到spark集群后,spark会从数据存储系统如HDFS读取数据,生成RDD,RDD进行数据分区交由各个节点进行处理,每个节点处理完数据后的结果存储在内存中,同时可将该结果交由下一个节点继续处理.最后将最终结果写回HDFS/Mysql/Hbase等数据存储系统中.
RDD:弹性分布式数据集(Resilient Distributed Dataset)
一个RDD就是一个分布式对象集合,本质上是一个只读的分区记录集合,每个RDD可以分成多个分区,每个分区就是一个数据集片段,并且一个RDD的不同分区可以被保存到集群中不同的节点上,从而可以在集群中的不同节点上进行并行计算.
RDD提供了一组丰富的操作以支持常见的数据运算,分为“行动”(Action)和“转换”(Transformation)两种类型,前者用于执行计算并指定输出的形式,后者指定RDD之间的相互依赖关系。两类操作的主要区别是,转换操作(比如map、filter、groupBy、join等)接受RDD并返回RDD,而行动操作(比如count、collect等)接受RDD但是返回非RDD(即输出一个值或结果)
RDD典型的执行过程如下:
1.RDD读入外部数据源(或者内存中的集合)进行创建;
2.RDD经过一系列的“转换”操作,每一次都会产生不同的RDD,给下一个“转换”使用;
3.最后一个RDD经“行动”操作进行处理,并输出到外部数据源
RDD采用了惰性调用,即在RDD的执行过程中(如图2.2所示),真正的计算发生在RDD的“行动”操作,对于“行动”之前的所有“转换”操作,Spark只是记录下“转换”操作应用的一些基础数据集以及RDD生成的轨迹,即相互之间的依赖关系,而不会触发真正的计算.
RDD特征:
- 高容错性
- 中间结果持久化到内存
- 存放的数据可以是Java对象,避免了不必要的对象序列化和反序列化开销
3.spark-hello world
本地模式(Java)
- 创建maven项目spark-note,添加Spark依赖,新建testSpark.java类
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.jp</groupId>
<artifactId>spark-note</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>spark-note</name>
<url>Welcome to Apache Maven</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>2.2.0</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
- 创建words.txt文件
hello java
hello scala
hello python
hi C++
hi android
- 编写代码
package com.jp.spark;
import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
/**
* Hello world!
*
*/
public class TestSpark {
public static void main( String[] args ){
//设置配置信息
SparkConf conf = new SparkConf()
.setAppName("WordCount")
.setMaster("local");//本地运行
//创建JavaSparkContext对象-spark功能入口
JavaSparkContext sc = new JavaSparkContext(conf);
//针对输入源创建初始RDD
JavaRDD<String> lines = sc.textFile("D:words.txt");//读取文件
//==============计算操作================//
//flatMap算子-拆分操作
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
public Iterator<String> call(String line) throws Exception {
return Arrays.asList(line.split(" ")).iterator();
}
});
//每个单词映射为(单词,1)
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word,1);
}
});
//根据Key统计单词出现的次数
JavaPairRDD<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
private static final long serialVersionUID = 1L;
public Integer call(Integer v1, Integer v2) throws Exception {
return v1+v2;
}
});
//action操作 触发计算
wordsCount.foreach(new VoidFunction<Tuple2<String,Integer>>() {
private static final long serialVersionUID = 1L;
public void call(Tuple2<String, Integer> wordCount) throws Exception {
System.out.println(wordCount._1+" apeared " +wordCount._2+" times");
}
});
sc.close();
}
}
- 执行结果
- 错误处理
该错误可忽略,本地模式并无hadoop,报错正常不影响程序执行,将程序提交到spark集群上后则不会报错.若在本地模式下清理该错误可在本地安装winutils.exe,参考windows下调试hadoop
spark集群模式(Java) - 单机集群主机node, ip:192.168.1.64
- window文件words.txt上传到到Linux
- 将words.txt上传HDFS系统
- 查看文件是否上传成功
- 修改部分代码
SparkConf conf = new SparkConf()
.setAppName("WordCount");
//.setMaster("local");//注释掉
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("hdfs://node:8020/words.txt");//读取HDFS文件
// hdfs://node:8020 ->登陆ambari查看HDFS core-site配置可知
- pom.xml中添加项目打包插件
<!-- 位置 </dependencies> -->
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId> maven-assembly-plugin </artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.jp.spark.TestSpark</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<!-- 位置 </project> -->
<mainClass>com.jp.spark.TestSpark</mainClass> 程序执行入口
- 更新
- 打包
- 打包成功
spark-note-0.0.1-SNAPSHOT-jar-with-dependencies.jar:包含依赖的jar包
- 上传spark-note-0.0.1-SNAPSHOT-jar-with-dependencies.jar到spark集群主机节点上
- 提交任务执行
[root@node spark]# spark-submit --class com.jp.spark.TestSpark --num-executors 1 --executor-cores 1 /home/spark/spark-note-0.0.1-SNAPSHOT-jar-with-dependencies.jar
note:教程中使用的集群属于单机集群模式,为伪分布式.若为真正的分布式(多台服务器或者多个虚拟机组成的集群),需加上 --master spark://192.168.1.64:7077 参数 IP地址为集群主节点的IP.若在伪分布式中加上该参数为报内存不足的错误,spark较耗内存.
通过http://192.168.1.64:4040可查看任务执行进度
结果
spark-shell 模式(scala)
- 启动spark-shell(启动时间较慢)
[root@node ~]# spark-shell
- 编写scala wordCount程序
scala> val lines = sc.textFile("hdfs://node:8020/words.txt")
scala> val words = lines.flatMap(line => line.split(" "))
scala> val pairs = words.map(word => (word,1))
scala> val wordCounts = pairs.reduceByKey(_+_)
scala> wordCounts.foreach(wordCount => println(wordCount._1 + " apeared " + wordCount._2 + " times"))
wordCount执行过程
4.spark架构原理
* application:用户编写的Spark应用程序;
* driver:控制节点 - 提交任务所在的某个集群主机上的进程;
* master:资源调度和分配 - 集群 主节点上的进程
* worker: 启动executor - 集群 子节点上的进程;
* executor:运行任务,存储数据 - 集群子节点上的进程;
* task:运行在executor上的工作单元 - executor启动的线程;
5.创建RDD
- 创建RDD的方式
* 调用SparkContext的parallelize方法,在程序中集合上创建
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
public class ParallelizeCollection {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("collection")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Integer> numbers = Arrays.asList(1,2,3,4,5,6,7);
JavaRDD<Integer> numberRdd = sc.parallelize(numbers);
int sum = numberRdd.reduce(new Function2<Integer, Integer, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1+v2;
}
});
sc.close();
System.out.println(sum);
}
}
* 读取外部数据集,如本地文件、HDFS文件系统、HBase、Cassandra、Amazon S3等
JavaRDD<String> lines = sc.textFile("D:words.txt");
JavaRDD<String> lines = sc.textFile("hdfs://node:8020/words.txt");
6.RDD操作 transformation和action
transformation:针对RDD进行计算处理生成新的RDD —— 定义、记录
- map
将RDD中的每个数据项通过map中的函数映射变为一个新的元素
输入分区与输出分区一对一,即:有多少个输入分区,则有多少个输出分区
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
JavaRDD<Integer> numerRDD = sc.parallelize(numbers);
//Function<Integer, Integer> 输入类型 返回类型
JavaRDD<Integer> numAddRDD = numerRDD.map(new Function<Integer, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Integer call(Integer v1) throws Exception {
return v1*2;
}
});
numAddRDD.foreach(new VoidFunction<Integer>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Integer t) throws Exception {
System.out.println(t);
}
});
sc.close();
- filter 返回经过filter中的函数处理后值为true的原元素组成新的数据集
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
JavaRDD<Integer> numerRDD = sc.parallelize(numbers);
//Function<Integer, Integer> 输入类型 返回类型
JavaRDD<Integer> numAddRDD = numerRDD.filter(new Function<Integer, Boolean>() {
@Override
public Boolean call(Integer v1) throws Exception {
return v1 % 2 == 0;
}
}) ;
numAddRDD.foreach(new VoidFunction<Integer>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Integer t) throws Exception {
System.out.println(t);
}
});
sc.close();
- flatMap 将一条rdd数据使用定义的函数给分解成多条rdd数据
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<String> numbers = Arrays.asList("hello world","hello java","hello scala","hi java");
JavaRDD<String> numerRDD = sc.parallelize(numbers);
//Function<Integer, Integer> 输入类型 返回类型
JavaRDD<String> numAddRDD = numerRDD.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String t) throws Exception {
return Arrays.asList(t.split(" ")).iterator();
}
});
numAddRDD.foreach(new VoidFunction<String>() {
@Override
public void call(String t) throws Exception {
System.out.println(t);
}
});
sc.close();
- groupByKey 对键值对RDD(PairRDD)通过Key分组
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<String, Integer>> scoresList = Arrays.asList(
new Tuple2<String, Integer>("A",92),
new Tuple2<String, Integer>("B",82),
new Tuple2<String, Integer>("A",72),
new Tuple2<String, Integer>("B",52),
new Tuple2<String, Integer>("A",98));
JavaPairRDD<String, Integer> scores = sc.parallelizePairs(scoresList);
JavaPairRDD<String,Iterable<Integer>> groupScores = scores.groupByKey();
groupScores.foreach(new VoidFunction<Tuple2<String,Iterable<Integer>>>() {
@Override
public void call(Tuple2<String, Iterable<Integer>> t) throws Exception {
System.out.println("name:"+t._1);
Iterator<Integer> iterator = t._2.iterator();
while(iterator.hasNext()){
System.out.println(iterator.next());
}
System.out.println("===============================");
}
});
sc.close();
- reduceBykey 对键值对RDD通过函数合并每个key的value值
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<String, Integer>> scoresList = Arrays.asList(
new Tuple2<String, Integer>("A",92),
new Tuple2<String, Integer>("B",82),
new Tuple2<String, Integer>("A",72),
new Tuple2<String, Integer>("B",52),
new Tuple2<String, Integer>("A",98));
JavaPairRDD<String, Integer> scores = sc.parallelizePairs(scoresList);
JavaPairRDD<String,Integer> groupScores = scores.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1+v2;
}
});
groupScores.foreach(new VoidFunction<Tuple2<String,Integer>>() {
@Override
public void call(Tuple2<String, Integer> t) throws Exception {
System.out.println(t._1+":"+t._2);
}
});
sc.close();
- sortByKey 对键值对RDD根据key排序
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<Integer, String>> scoresList = Arrays.asList(
new Tuple2<Integer, String>(92,"A"),
new Tuple2<Integer, String>(72,"B"),
new Tuple2<Integer, String>(56,"C"),
new Tuple2<Integer, String>(100,"D"),
new Tuple2<Integer, String>(98,"E"),
new Tuple2<Integer, String>(62,"F")
);
JavaPairRDD<Integer,String> scores = sc.parallelizePairs(scoresList);
JavaPairRDD<Integer,String> groupScores = scores.sortByKey();//默认升序
groupScores.foreach(new VoidFunction<Tuple2<Integer,String>>() {
@Override
public void call(Tuple2<Integer, String> t) throws Exception {
System.out.println(t._2+" : "+t._1);
}
});
sc.close();
- join 对两个需要关联的RDD通过key进行内连接操作
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<Integer, String>> studentList = Arrays.asList(
new Tuple2<Integer, String>(101,"A"),
new Tuple2<Integer, String>(102,"B"),
new Tuple2<Integer, String>(103,"C"),
new Tuple2<Integer, String>(104,"D"),
new Tuple2<Integer, String>(105,"E"),
new Tuple2<Integer, String>(106,"F")
);
List<Tuple2<Integer, Integer>> stuScores = Arrays.asList(
new Tuple2<Integer, Integer>(101,92),
new Tuple2<Integer, Integer>(102,72),
new Tuple2<Integer, Integer>(103,56),
new Tuple2<Integer, Integer>(104,100),
new Tuple2<Integer, Integer>(105,98),
new Tuple2<Integer, Integer>(106,62)
);
JavaPairRDD<Integer,String> students = sc.parallelizePairs(studentList);
JavaPairRDD<Integer,Integer> scores = sc.parallelizePairs(stuScores);
JavaPairRDD<Integer, Tuple2<String, Integer>> stu_score = students.join(scores);
stu_score.foreach(new VoidFunction<Tuple2<Integer,Tuple2<String,Integer>>>() {
@Override
public void call(Tuple2<Integer, Tuple2<String, Integer>> t) throws Exception {
System.out.println("student id:"+t._1);
System.out.println("student name:"+t._2._1);
System.out.println("student score:"+t._2._2);
System.out.println("=================================");
}
});
sc.close();
- cogroup 对两个键值对RDD,每个RDD中相同key中的元素分别聚合成一个集合
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<Integer, String>> studentList = Arrays.asList(
new Tuple2<Integer, String>(101,"A"),
new Tuple2<Integer, String>(102,"B"),
new Tuple2<Integer, String>(103,"C")
);
List<Tuple2<Integer, Integer>> stuScores = Arrays.asList(
new Tuple2<Integer, Integer>(101,92),
new Tuple2<Integer, Integer>(102,72),
new Tuple2<Integer, Integer>(103,56),
new Tuple2<Integer, Integer>(103,96),
new Tuple2<Integer, Integer>(101,96)
);
JavaPairRDD<Integer,String> students = sc.parallelizePairs(studentList);
JavaPairRDD<Integer,Integer> scores = sc.parallelizePairs(stuScores);
JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> stu_score = students.cogroup(scores);
stu_score.foreach(new VoidFunction<Tuple2<Integer,Tuple2<Iterable<String>,Iterable<Integer>>>>() {
@Override
public void call(Tuple2<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> t) throws Exception {
System.out.println("student id:"+t._1);
System.out.println("student name:"+t._2._1);
System.out.println("student score:"+t._2._2);
System.out.println("=================================");
}
});
sc.close();
action:对RDD作最后的计算或处理,如遍历、reduce、保存数据、返回结果—触发计算操作
未完持续..