Spark WordCount
如果说一个刚接触java程序,敲出的是hello,那么刚接触spark,敲出的应该是wordcount
从最开始入门hadoop的时候,就接触过由hadoop官方提供的样例jar,里面的wordcount程序,用来mr程序进行的词频统计
spark也一样,在其example包中包含了wordcount的样例
简单描述:WordCount 是用来统计某个文件,或者某个数据集中,单词的出现次数
首先贴上官方源码
public final class JavaWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}
SparkSession spark = SparkSession
.builder()
.appName("JavaWordCount")
.getOrCreate();
JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD();
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD<String, Integer> ones = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2<?,?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
// spark.wait(1000000);
spark.stop();
}
}
这里官方针对sparkcontext 只提供了java 版本的,scala版本的没找到,我们可以自己实现
package it.luke.spark.sql
import org.apache.spark.sql.SparkSession
object WordCount {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local")
.appName("Spark Examples")
.getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("WARN")
val rdd1 = sc.textFile("examples/data/WordCount.txt")
val rdd2 = rdd1.flatMap(item=>item.split(" "))
val rdd3 = rdd2.map(item=>(item,1))
val rdd4 = rdd3.reduceByKey((curr,agg)=>curr+agg)
val res = rdd4.collect()
res.foreach(println(_))
}
}
在对应的路径下,准备好wordcount.txt的文件
hadoop hadoop hadoop flume flume
java php
执行程序
分析:
在本次执行的过程中分为以下几步
- 创建sparkcontext 对象
- 读取目标文件生成RDD
- 对数据集进行切分,并赋予初始化词频
- 进行聚合
sc.textFile(“examples/data/WordCount.txt”)
textFile 是sparkcontext中的一个方法
* 从HDFS、本地文件系统(在所有节点上都可以使用)或其他文件中读取文本文件
* hadoop支持的文件系统URI,并将其作为字符串的RDD返回。
* 支持的文件系统上的文本文件的路径
*
* @param minPartitions建议生成的RDD的最小分区数
* @return 返回文本文件的行数RDD
*
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
//defaultMinPartitions 如果不传值默认最小分区数
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions)
.map(pair => pair._2.toString)
.setName(path)
}
从源码可以看出,它默认调用的是hadoopfile方法,也就是默认优先支持hadoop文件系统
至于是生成RDD的最小分区数呢?
当我们指定的在没有指定执行的并行度的情况下,默认取最小不超过2
/**
* Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
* The reasons for this are discussed in https://github.com/mesos/spark/pull/718
*
* 当用户没有给出Hadoop RDDs的默认最小分区数时
* 注意我们使用数学。因此“defaultMinPartitions”不能大于2。
*/
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
def defaultParallelism: Int = {
assertNotStopped()
taskScheduler.defaultParallelism
}
//这里调用的是taskScheduler的方法,但是taskScheduler是一个特质,需要找到实现类TaskSchedulerImpl
defaultParallelism方法发现这个抽象方法实现类 --> TaskSchedulerImpl在这个类中有方法的实现
override def defaultParallelism(): Int = backend.defaultParallelism()
//调用了backend的方法,SchedulerBackend也是一个特质,有对应的实现类
从已有的实现类中看出
//CoarseGrainedSchedulerBackend 集群模式
//LocalSchedulerBackend 本地模式
private[spark] trait SchedulerBackend {
private val appId = "spark-application-" + System.currentTimeMillis
def start(): Unit
def stop(): Unit
def reviveOffers(): Unit
//CoarseGrainedSchedulerBackend 集群模式
//LocalSchedulerBackend 本地模式
def defaultParallelism(): Int
集群模式:CoarseGrainedSchedulerBackend
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
//这里获取默认的core数和2做比较,主要是集群模式分有的client 和 server ,所以至少需要两个
本地模式:LocalSchedulerBackend
override def defaultParallelism(): Int =
scheduler.conf.getInt("spark.default.parallelism", totalCores)
//如果没有配置平行度,就取默认的执行线程数
从这里可以看出,textFile()这个方法中如果不指定具体的数值的话,最终获取到分区数是不会超过2的.
在看完最小分区数之后,我们看到textFile()下方的另一个关键对象hadoopFile
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// This is a hack to enforce loading hdfs-site.xml.
// See SPARK-11227 for details.
//从本地文件系统中加载hadoop文件配置
// core-default.xml, core-site.xml, mapred-default.xml,
// mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml
FileSystem.getLocal(hadoopConfiguration)
// 一个Hadoop配置可以是大约10 KB,这是相当大的,所以要广播它。
// 广播它的意义是让集群中的每个执行对象都能访问到这份数据
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
这里可以看到new HadoopRDD();这是这个方法的核心
这个方法主要是继承默认RDD的方法,并重写一部分方法
部分重写的方法
getPartitions()
compute()
getPreferredLocations()
checkpoint()
persist()
将重写方法后,生成的hadoopRDD添加到任务执行链中
这里主要扩展getPartitions()这个方法,因为很多转换算子都需要拿到各个分区的数据,那么这些分区的数据,是如何划分的呢?
这里贴一下源码
override def getPartitions: Array[Partition] = {
//调用方法,从广播变量中取值
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) {
array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
getJobConf()获取广播变量中广播了的配置
protected def getJobConf(): JobConf = {
//从广播变量中取值
val conf: Configuration = broadcastedConf.value.value
....(还以一部分是关于是否克隆配置,因为hadoop的配置对象不是线程安全的)
}
addCredentials() 是用来加载配置中的一些用户身份信息
override def addCredentials(conf: JobConf) {
val jobCreds = conf.getCredentials()
jobCreds.mergeAll(UserGroupInformation.getCurrentUser().getCredentials())
}
getInputFormat()
protected def getInputFormat(conf: JobConf): InputFormat[K, V] = {
//通过反射,新建一个TextInputformat对象
val newInputFormat = ReflectionUtils.newInstance(inputFormatClass.asInstanceOf[Class[_]], conf)
.asInstanceOf[InputFormat[K, V]]
newInputFormat match {
case c: Configurable => c.setConf(conf)
case _ =>
}
newInputFormat
}
inputFormat.getSplits()这个方法是由FileInputFormat这个类实现的
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
StopWatch sw = new StopWatch().start();
FileStatus[] files = listStatus(job);
// Save the number of input files for metrics/loadgen
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; // compute total size
//获取需要读取的文件数
for (FileStatus file: files) { // check we have valid files
if (file.isDirectory()) {
throw new IOException("Not a file: "+ file.getPath());
}
totalSize += file.getLen();
}
//跟传入的分区数进行计算,计算出每个分区的处理数据量
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
//默认取1
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
//通过FileStatus拿到getBlockSize
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
//拿到迭代中文件的大小
long bytesRemaining = length;
//这是一段切片逻辑
//判断当前读取的文件是数据量和计算出来的切片大小进行比值
//当你大于设定的阈值时,就要进行切分
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
//记录文件的信息,位置等等,文件大小减去被切去的大小(切片大小)
//添加到切片数组中
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts[0], splitHosts[1]));
//文件大小减去被切去的大小(切片大小)继续循环
bytesRemaining -= splitSize;
}
//将跳出循环的文件,添加到切片集合中
if (bytesRemaining != 0) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts[0], splitHosts[1]));
}
} else {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts[0], splitHosts[1]));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
//返回切片集合
return splits.toArray(new FileSplit[splits.size()]);
}
protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
//这里会做一个实际和默认的比较,如果实际要读取的文件不是很大,那么切片大小就没必要那么大了,
//所以会优先计算出你每个分区可以处理的数据量和配置中的进行比较
小结:通过源码可以知道,在读取文件的时候,设置第二个参数,也就是minpartition 是可以改变我们读取文件时的切片大小的,(小于默认配置,优先取分区计算).
所以为了避免文件被频繁的切分,合理的分配文件的大小,以及分区数,可以在一定程度上优化spark的读取解析计算
看完读取完文件之后,我们来看处理逻辑
val rdd2 = rdd1.flatMap(item=>item.split(" "))
val rdd3 = rdd2.map(item=>(item,1))
val rdd4 = rdd3.reduceByKey((curr,agg)=>curr+agg)
val res = rdd4.collect()
这里用到几个非常常用的算子:flatMap(),map(),reduceByKey(), collect()
flatMap()在这里的作用,是将读取到的一行数据,通过空格进行切分
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
//检查并清除没有实现序列化的对象
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
这里调用了迭代器对象的flatMap
里面的逻辑用来递归的方法,判断你当前数据集中是否还有可以迭代的对象,将其展开
def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = new AbstractIterator[B] {
private var cur: Iterator[B] = empty
//判断下个对象能否转换成迭代器
private def nextCur() { cur = f(self.next()).toIterator }
def hasNext: Boolean = {
//如果可以就继续递归判断直到不能转换成迭代器为止
while (!cur.hasNext) {
if (!self.hasNext) return false
nextCur()
}
true
}
//这里的next()是通过调用上方的方法,将可迭代对象以扁平化的方式调用出来
def next(): B = (if (hasNext) cur else empty).next()
}
map()最常用的转换算子之一,没有做过多的处理,只是将数据集转换成迭代器,迭代出来
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
def hasNext = self.hasNext
def next() = f(self.next())
}
reduceByKey()用于聚合处理的常用算子,在这里我们是用于将拥有相同key的进行词频叠加
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
*使用关联和交换的reduce函数合并每个键的值。这将
*同样,在将结果发送到一个reducer之前,也要在每个mapper上执行本地合并
*到MapReduce中的“合成器”。
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
collect()用于触发收集结果
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
reduceByKey和collect是action算子(待整理…)
部署到yarn集群运行的话,使用的是spark-submit 命令进行提交
这就需要我们将我们的代码打成jar包上传上去,由于spark本身就自带了运行环境,所以我们只需要将自己的逻辑打包就行了
创建maven工程,导入依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>it.luke</groupId>
<artifactId>SparkTest</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
<slf4j.version>1.7.16</slf4j.version>
<log4j.version>1.2.17</log4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>it.luke.spark.wordcount.WordCount</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
编写代码
package it.luke.spark.wordcount
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
//创建spark配置(**涉及到执行优先级**,shell 配置更高优先级)
val conf = new SparkConf().setAppName("luke").setMaster("local[6]")
//初始化spark
val sc = new SparkContext(conf)
//读取本地目录
val rdd1 = sc.textFile("/data/testData.txt")
//切分拍平
val splitrdd = rdd1.flatMap(item=>item.split(" "))
//初始话每个单词的词频
val rdd2 = splitrdd.map(item=>(item,1))
//进行聚合
val reducerdd = rdd2.reduceByKey((curr,agg) => curr+agg)
//执行
val res = reducerdd.collect()
//打印
res.foreach(println(_))
}
}
打包上传到我们的linux环境
进入spark目录启动spark集群,进行提交(这里提交的是yarn集群)
bin/spark-submit --master spark://node01:7077 --class it.luke.spark.wordcount.WordCount /luke/upjars/original-SparkTest-1.0-SNAPSHOT.jar
这里读取的文件是hadoop上面.
提交集群的流程就到这里~
待补充的点:action算子的整理