spark 集群单词统计_Spark教程

本文介绍了如何在Spark集群上进行单词统计,包括Spark的安装、工作原理、RDD的执行过程和特性。通过一个简单的`wordCount`示例,展示了创建Spark项目、编写代码并提交到集群执行的过程。此外,还探讨了RDD的操作如map、filter、flatMap、groupByKey等,并对Spark的架构原理进行了初步讲解。
摘要由CSDN通过智能技术生成

6f42d46d1bfca6fbaefdf6ad72eb4b3b.png

转载请私信。禁止无授权转载

1.spark安装

利用ambari安装整个集群 参考Ambari及其HDP集群安装及其配置教程

2.Spark工作原理

c24da82421a4462f3bfc66563ab642ee.png
spark工作流程

提交任务到spark集群后,spark会从数据存储系统如HDFS读取数据,生成RDD,RDD进行数据分区交由各个节点进行处理,每个节点处理完数据后的结果存储在内存中,同时可将该结果交由下一个节点继续处理.最后将最终结果写回HDFS/Mysql/Hbase等数据存储系统中.

RDD:弹性分布式数据集(Resilient Distributed Dataset)
一个RDD就是一个分布式对象集合,本质上是一个只读的分区记录集合,每个RDD可以分成多个分区,每个分区就是一个数据集片段,并且一个RDD的不同分区可以被保存到集群中不同的节点上,从而可以在集群中的不同节点上进行并行计算.
RDD提供了一组丰富的操作以支持常见的数据运算,分为“行动”(Action)和“转换”(Transformation)两种类型,前者用于执行计算并指定输出的形式,后者指定RDD之间的相互依赖关系。两类操作的主要区别是,转换操作(比如map、filter、groupBy、join等)接受RDD并返回RDD,而行动操作(比如count、collect等)接受RDD但是返回非RDD(即输出一个值或结果)

RDD典型的执行过程如下:

1.RDD读入外部数据源(或者内存中的集合)进行创建;
2.RDD经过一系列的“转换”操作,每一次都会产生不同的RDD,给下一个“转换”使用;
3.最后一个RDD经“行动”操作进行处理,并输出到外部数据源

7e43a95893a8b2c6838b90aa39e27734.png

RDD采用了惰性调用,即在RDD的执行过程中(如图2.2所示),真正的计算发生在RDD的“行动”操作,对于“行动”之前的所有“转换”操作,Spark只是记录下“转换”操作应用的一些基础数据集以及RDD生成的轨迹,即相互之间的依赖关系,而不会触发真正的计算.

e3adf597f6e0737f8cb7270315b25c78.png
图2.2

RDD特征:

  • 高容错性
  • 中间结果持久化到内存
  • 存放的数据可以是Java对象,避免了不必要的对象序列化和反序列化开销

3.spark-hello world

本地模式(Java)
  • 创建maven项目spark-note,添加Spark依赖,新建testSpark.java类

bec41f938fb19037768b2ebe1d2f20f9.png
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.jp</groupId>
  <artifactId>spark-note</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>spark-note</name>
  <url>Welcome to Apache Maven</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <spark.version>2.2.0</spark.version>
  </properties>

  <dependencies>
    <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
    </dependency>
    <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
     </dependency>
  </dependencies>
</project>
  • 创建words.txt文件
hello java
hello scala
hello python
hi C++
hi android
  • 编写代码
package com.jp.spark;

import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;

/**
 * Hello world!
 *
 */
public class TestSpark {
    public static void main( String[] args ){
	//设置配置信息
	SparkConf conf = new SparkConf()
	                 .setAppName("WordCount")
	                 .setMaster("local");//本地运行

	//创建JavaSparkContext对象-spark功能入口
	JavaSparkContext sc = new JavaSparkContext(conf);

	//针对输入源创建初始RDD
	JavaRDD<String> lines = sc.textFile("D:words.txt");//读取文件
	
	//==============计算操作================//
	//flatMap算子-拆分操作
	JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
		private static final long serialVersionUID = 1L;
		public Iterator<String> call(String line) throws Exception {
			return Arrays.asList(line.split(" ")).iterator();
		}
	});

	//每个单词映射为(单词,1)
	JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
		private static final long serialVersionUID = 1L;
		public Tuple2<String, Integer> call(String word) throws Exception {
			return new Tuple2<String, Integer>(word,1);
		}
	});
	
	//根据Key统计单词出现的次数
	JavaPairRDD<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
		private static final long serialVersionUID = 1L;
		public Integer call(Integer v1, Integer v2) throws Exception {
			return v1+v2;
		}
	});

	//action操作 触发计算
	wordsCount.foreach(new VoidFunction<Tuple2<String,Integer>>() {
		private static final long serialVersionUID = 1L;
		public void call(Tuple2<String, Integer> wordCount) throws Exception {
			System.out.println(wordCount._1+" apeared " +wordCount._2+" times");
		}
	});
	sc.close();
    }
}
  • 执行结果

e73712cd54e1addb78944039ef7025d0.png
  • 错误处理

1f07eeba4a399792e0a0bae215d45976.png

该错误可忽略,本地模式并无hadoop,报错正常不影响程序执行,将程序提交到spark集群上后则不会报错.若在本地模式下清理该错误可在本地安装winutils.exe,参考windows下调试hadoop


spark集群模式(Java) - 单机集群主机node, ip:192.168.1.64
  • window文件words.txt上传到到Linux

03d4ffc1dc034fce514b30ca28fd33d5.png

98e3028cbbf4502a56db502e7564712f.png
  • 将words.txt上传HDFS系统

8445d4e2752ea9f4b6879d668a3b15df.png
  • 查看文件是否上传成功

de610c0d7303b05a6fcdd5089f55652e.png
  • 修改部分代码
SparkConf conf = new SparkConf()
		 .setAppName("WordCount");
		  //.setMaster("local");//注释掉
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("hdfs://node:8020/words.txt");//读取HDFS文件

// hdfs://node:8020  ->登陆ambari查看HDFS core-site配置可知 
  • pom.xml中添加项目打包插件
<!-- 位置  </dependencies> -->
<build>  
    <plugins>  
  
     <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                   <artifactId> maven-assembly-plugin </artifactId>
                   <configuration>
                        <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                        <archive>
                             <manifest>
                                  <mainClass>com.jp.spark.TestSpark</mainClass>
                             </manifest>
                        </archive>
                   </configuration>
                   <executions>
                        <execution>
                             <id>make-assembly</id>
                             <phase>package</phase>
                             <goals>
                                  <goal>single</goal>
                             </goals>
                        </execution>
                   </executions>
              </plugin>
    </plugins>  
</build>  
<!-- 位置 </project> -->

837a67fd3dba5331c8920504e3994e0d.png
<mainClass>com.jp.spark.TestSpark</mainClass> 程序执行入口
  • 更新

933eb513a58865ab6e28ea16da8d0047.png
  • 打包

f26dc02a8e67a7a3484d33cd2a37f6eb.png
  • 打包成功

77a510c83d822a44ca567a5f1460ce91.png

8355bb77c73878f43367a349e1787380.png
spark-note-0.0.1-SNAPSHOT-jar-with-dependencies.jar:包含依赖的jar包
  • 上传spark-note-0.0.1-SNAPSHOT-jar-with-dependencies.jar到spark集群主机节点上

9f1c3f308bc937238debf6d844db8ef0.png
  • 提交任务执行
[root@node spark]# spark-submit --class com.jp.spark.TestSpark --num-executors 1  --executor-cores 1 /home/spark/spark-note-0.0.1-SNAPSHOT-jar-with-dependencies.jar 

1cd799d3039fc8d9420f7f19fa7275cd.png
note:教程中使用的集群属于单机集群模式,为伪分布式.若为真正的分布式(多台服务器或者多个虚拟机组成的集群),需加上 --master spark://192.168.1.64:7077 参数 IP地址为集群主节点的IP.若在伪分布式中加上该参数为报内存不足的错误,spark较耗内存.

通过http://192.168.1.64:4040可查看任务执行进度

348603544f3f52b64d6b150eef884f60.png

结果

65ebca20654333213e9eeebe63199d31.png

spark-shell 模式(scala)
  • 启动spark-shell(启动时间较慢)
[root@node ~]# spark-shell

1846707dc62b7c5fa056d083b296c789.png
  • 编写scala wordCount程序
scala> val lines = sc.textFile("hdfs://node:8020/words.txt")
scala> val words = lines.flatMap(line => line.split(" "))
scala> val pairs = words.map(word => (word,1))
scala> val wordCounts = pairs.reduceByKey(_+_)
scala> wordCounts.foreach(wordCount => println(wordCount._1 + " apeared " + wordCount._2 + " times"))

38ca247759323a9248b74d483856a074.png

wordCount执行过程

d82c5cff0ed734d51659f7c9fb91c5eb.png

4.spark架构原理

* application:用户编写的Spark应用程序;
* driver:控制节点 - 提交任务所在的某个集群主机上的进程;
* master:资源调度和分配 - 集群 主节点上的进程
* worker: 启动executor - 集群 子节点上的进程;
* executor:运行任务,存储数据 - 集群子节点上的进程;
* task:运行在executor上的工作单元 - executor启动的线程;

26b91f5576748dd49714273fa43777fb.png

5.创建RDD

  • 创建RDD的方式
* 调用SparkContext的parallelize方法,在程序中集合上创建
import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;

public class ParallelizeCollection {
	public static void main(String[] args) {
		SparkConf conf = new SparkConf()
				        .setAppName("collection")
				        .setMaster("local");
		JavaSparkContext sc = new JavaSparkContext(conf);
		List<Integer> numbers = Arrays.asList(1,2,3,4,5,6,7);
		JavaRDD<Integer> numberRdd = sc.parallelize(numbers);
		int sum = numberRdd.reduce(new Function2<Integer, Integer, Integer>() {

			private static final long serialVersionUID = 1L;
			@Override
			public Integer call(Integer v1, Integer v2) throws Exception {
				return v1+v2;
			}
		});
		sc.close();
		System.out.println(sum);
	}
}

45257caeae911fcc43435462521f2802.png
* 读取外部数据集,如本地文件、HDFS文件系统、HBase、Cassandra、Amazon S3等
JavaRDD<String> lines = sc.textFile("D:words.txt");
JavaRDD<String> lines = sc.textFile("hdfs://node:8020/words.txt");

6.RDD操作 transformation和action

transformation:针对RDD进行计算处理生成新的RDD —— 定义、记录
  • map

将RDD中的每个数据项通过map中的函数映射变为一个新的元素

输入分区与输出分区一对一,即:有多少个输入分区,则有多少个输出分区

SparkConf conf = new SparkConf()
         .setMaster("local")
         .setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
JavaRDD<Integer> numerRDD = sc.parallelize(numbers);

//Function<Integer, Integer> 输入类型 返回类型
JavaRDD<Integer> numAddRDD = numerRDD.map(new Function<Integer, Integer>() {
	private static final long serialVersionUID = 1L;
	@Override
	public Integer call(Integer v1) throws Exception {
		return v1*2;
	}
}); 

numAddRDD.foreach(new VoidFunction<Integer>() {
	private static final long serialVersionUID = 1L;
	@Override
	public void call(Integer t) throws Exception {
		System.out.println(t);
	}
});

sc.close();

451ef2b2e341efbe56eb815af069bdeb.png
  • filter 返回经过filter中的函数处理后值为true的原元素组成新的数据集
SparkConf conf = new SparkConf()
         .setMaster("local")
         .setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
JavaRDD<Integer> numerRDD = sc.parallelize(numbers);

//Function<Integer, Integer> 输入类型 返回类型
JavaRDD<Integer> numAddRDD = numerRDD.filter(new Function<Integer, Boolean>() {

	@Override
	public Boolean call(Integer v1) throws Exception {
		
		return v1 % 2 == 0;
	}
}) ;

numAddRDD.foreach(new VoidFunction<Integer>() {
	private static final long serialVersionUID = 1L;
	@Override
	public void call(Integer t) throws Exception {
		System.out.println(t);
	}
});

sc.close();

7bbb933faa51b166dceedb419260922d.png
  • flatMap 将一条rdd数据使用定义的函数给分解成多条rdd数据
SparkConf conf = new SparkConf()
         .setMaster("local")
         .setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<String> numbers = Arrays.asList("hello world","hello java","hello scala","hi java");
JavaRDD<String> numerRDD = sc.parallelize(numbers);

//Function<Integer, Integer> 输入类型 返回类型
JavaRDD<String> numAddRDD = numerRDD.flatMap(new FlatMapFunction<String, String>() {
	@Override
	public Iterator<String> call(String t) throws Exception {
		return Arrays.asList(t.split(" ")).iterator();
	}
});

numAddRDD.foreach(new VoidFunction<String>() {
	@Override
	public void call(String t) throws Exception {
		System.out.println(t);
	}
});

sc.close();

e4ab2575307152521d69c6cef33a1d04.png
  • groupByKey 对键值对RDD(PairRDD)通过Key分组
SparkConf conf = new SparkConf()
         .setMaster("local")
         .setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<String, Integer>> scoresList = Arrays.asList(
				new Tuple2<String, Integer>("A",92),
				new Tuple2<String, Integer>("B",82),
				new Tuple2<String, Integer>("A",72),
				new Tuple2<String, Integer>("B",52),
				new Tuple2<String, Integer>("A",98));

JavaPairRDD<String, Integer> scores = sc.parallelizePairs(scoresList);

JavaPairRDD<String,Iterable<Integer>> groupScores = scores.groupByKey();
groupScores.foreach(new VoidFunction<Tuple2<String,Iterable<Integer>>>() {

	@Override
	public void call(Tuple2<String, Iterable<Integer>> t) throws Exception {
		System.out.println("name:"+t._1);
		Iterator<Integer> iterator = t._2.iterator();
		while(iterator.hasNext()){
			System.out.println(iterator.next());
		}
		System.out.println("===============================");
	}
});
sc.close();

d4f846114091e56bd2f2961d3b2fd915.png
  • reduceBykey 对键值对RDD通过函数合并每个key的value值
SparkConf conf = new SparkConf()
         .setMaster("local")
         .setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<String, Integer>> scoresList = Arrays.asList(
				new Tuple2<String, Integer>("A",92),
				new Tuple2<String, Integer>("B",82),
				new Tuple2<String, Integer>("A",72),
				new Tuple2<String, Integer>("B",52),
				new Tuple2<String, Integer>("A",98));

JavaPairRDD<String, Integer> scores = sc.parallelizePairs(scoresList);

JavaPairRDD<String,Integer> groupScores = scores.reduceByKey(new Function2<Integer, Integer, Integer>() {
	
	@Override
	public Integer call(Integer v1, Integer v2) throws Exception {
		return v1+v2;
	}
});
groupScores.foreach(new VoidFunction<Tuple2<String,Integer>>() {
	@Override
	public void call(Tuple2<String, Integer> t) throws Exception {
		System.out.println(t._1+":"+t._2);
	}
});
sc.close();

e7ccfcd40165c9562b84ee0edc3d78a7.png
  • sortByKey 对键值对RDD根据key排序
SparkConf conf = new SparkConf()
         .setMaster("local")
         .setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<Integer, String>> scoresList = Arrays.asList(
				new Tuple2<Integer, String>(92,"A"),
				new Tuple2<Integer, String>(72,"B"),
				new Tuple2<Integer, String>(56,"C"),
				new Tuple2<Integer, String>(100,"D"),
				new Tuple2<Integer, String>(98,"E"),
				new Tuple2<Integer, String>(62,"F")
							);

JavaPairRDD<Integer,String> scores = sc.parallelizePairs(scoresList);

JavaPairRDD<Integer,String> groupScores = scores.sortByKey();//默认升序
groupScores.foreach(new VoidFunction<Tuple2<Integer,String>>() {
	@Override
	public void call(Tuple2<Integer, String> t) throws Exception {
		System.out.println(t._2+" : "+t._1);
	}

});
sc.close();

fc97c9ce157776893aa91515d9de80f4.png
  • join 对两个需要关联的RDD通过key进行内连接操作
SparkConf conf = new SparkConf()
         .setMaster("local")
         .setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<Integer, String>> studentList = Arrays.asList(
			new Tuple2<Integer, String>(101,"A"),
			new Tuple2<Integer, String>(102,"B"),
			new Tuple2<Integer, String>(103,"C"),
			new Tuple2<Integer, String>(104,"D"),
			new Tuple2<Integer, String>(105,"E"),
			new Tuple2<Integer, String>(106,"F")
							);
List<Tuple2<Integer, Integer>> stuScores = Arrays.asList(
			new Tuple2<Integer, Integer>(101,92),
			new Tuple2<Integer, Integer>(102,72),
			new Tuple2<Integer, Integer>(103,56),
			new Tuple2<Integer, Integer>(104,100),
			new Tuple2<Integer, Integer>(105,98),
			new Tuple2<Integer, Integer>(106,62)
							);
JavaPairRDD<Integer,String> students = sc.parallelizePairs(studentList);
JavaPairRDD<Integer,Integer> scores = sc.parallelizePairs(stuScores);
JavaPairRDD<Integer, Tuple2<String, Integer>> stu_score = students.join(scores);
stu_score.foreach(new VoidFunction<Tuple2<Integer,Tuple2<String,Integer>>>() {

	@Override
	public void call(Tuple2<Integer, Tuple2<String, Integer>> t) throws Exception {
		System.out.println("student id:"+t._1);
		System.out.println("student name:"+t._2._1);
		System.out.println("student score:"+t._2._2);
		System.out.println("=================================");
	}
});

sc.close();

7f4b72a1033d668e9c30f286c44161a6.png
  • cogroup 对两个键值对RDD,每个RDD中相同key中的元素分别聚合成一个集合
SparkConf conf = new SparkConf()
         .setMaster("local")
         .setAppName("Transformation");
JavaSparkContext sc = new JavaSparkContext(conf);
//构造数据
List<Tuple2<Integer, String>> studentList = Arrays.asList(
			new Tuple2<Integer, String>(101,"A"),
			new Tuple2<Integer, String>(102,"B"),
			new Tuple2<Integer, String>(103,"C")
							);
List<Tuple2<Integer, Integer>> stuScores = Arrays.asList(
			new Tuple2<Integer, Integer>(101,92),
			new Tuple2<Integer, Integer>(102,72),
			new Tuple2<Integer, Integer>(103,56),
			new Tuple2<Integer, Integer>(103,96),
			new Tuple2<Integer, Integer>(101,96)
			
							);
JavaPairRDD<Integer,String> students = sc.parallelizePairs(studentList);
JavaPairRDD<Integer,Integer> scores = sc.parallelizePairs(stuScores);
JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> stu_score = students.cogroup(scores);
stu_score.foreach(new VoidFunction<Tuple2<Integer,Tuple2<Iterable<String>,Iterable<Integer>>>>() {

	@Override
	public void call(Tuple2<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> t) throws Exception {
		System.out.println("student id:"+t._1);
		System.out.println("student name:"+t._2._1);
		System.out.println("student score:"+t._2._2);
		System.out.println("=================================");
		
	}
});

sc.close();

cc76d15550e2295b777249a5d915114d.png

action:对RDD作最后的计算或处理,如遍历、reduce、保存数据、返回结果—触发计算操作

未完持续..

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值