scala> val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/input/wc.input")
scala> val wordRdd = rdd.flatMap(_.split(" "))
scala> val kvRdd = wordRdd.map((_,1))
scala> val wordCountRdd = kvRdd.reduceByKey(_ + _)
scala> wordCountRdd.collect
scala> wordCountRdd.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/sparkOutput")
hdfs -> rdd -> wordRdd -> kvRdd -> wordCountRdd -> hdfs
RDD lineage 生命线 ,保存如何转换得来的
=====================================================
针对 Key-Value类型的RDD,可以指定分区函数
map-01
reduce-01
map-02
reduce-02
map-03
处理RDD split进行计算时,split数据在哪里,我们尽量在那台机器上进行计算
【移动计算而不是移动数据】
=====================================================
val keyRdd = kvRdd.groupByKey(2)
wordRdd kvRdd keyRdd
------- map
spark (spark,1) (spark,list(1,1,1,1,1))
hadoop (hadoop,1) (rdd,list(1,1))
hdfs (hdfs,1)
spark (spark,1)
-------- map
spark (spark,1)
yarn (yarn,1) --------------------
rdd (rdd,1) (hadoop,list(1))
spark (spark,1) (hdfs,list(1))
--------- map (yarn,list(1))
spark (spark,1)
rdd (rdd,1)
----------------------------------------------------------------------------------------------
Spark Standalone 部署
Master
Work
## WordCount
val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/input/wc.input")
val wordcount = rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
wordcount.collect
wordcount.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/sparkOutput99")
======result
(hive,2)
(mapreduce,1)
(mapreduce2,1)
(hadoop,5)
(hdfs,1)
发现:
Spark 运行WordCoun程序,并没有像MapReduce程序那样,对Key进行排序。
## Key Sort
wordcount.sortByKey().collect ## 默认情况是 升序
wordcount.sortByKey(true).collect
wordcount.sortByKey(false).collect
===============================================================
------result
(hive,2)
(mapreduce,1)
(mapreduce2,1)
(hadoop,5)
(hdfs,1)
需求:
按照value值进行降序
## Value Sort
wordcount.map(x => (x._2,x._1)).sortByKey(false).collect
wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)).collect
## Top N
wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)).take(3)
+++++++++++++++++++++++++++++++++++++++++++
Group Top Key
WordCount 程序,前KEY值
需求:
类似MapReduce中的二次排序
1) 按照第一个字段进行分组
2) 对分组中的第二字段进行排序(降序)
3) 获取每个分组Top N,比如获取前三个值
aa 78
bb 98
aa 80
cc 98
aa 69
cc 87
bb 97
cc 86
aa 97
bb 78
bb 34
cc 85
bb 92
cc 72
bb 32
bb 23
功能分析:
(aa,list(78,80,69,97)) -> (aa,list(69,78,80,97)) -> (aa,list(69,78,80))
val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/spark/grouptop/input/score.input")
rdd.map(_.split(" ")).collect
Array(aa, 78)
rdd.map(_.split(" ")).map(x => (x(0),x(1))).collect //x(0)取第一个元素
(aa,78)
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.collect
(aa,CompactBuffer(78, 80, 69, 97))
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
x => {
val xx = x._1 //位置,取第一个位置
val yy = x._2
yy
}
).collect
Iterable[String]
Iterable 方法:
def toList: List[A]
返回包含此遍历的迭代器的所有元素的列表
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
x => {
val xx = x._1
val yy = x._2
yy.toList
}
).collect
List[String]
List(78, 80, 69, 97)
List 方法:
def sorted[B >: A]: List[A]
根据排序对列表进行排序
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
x => {
val xx = x._1
val yy = x._2
yy.toList.sorted
}
).collect
List[String]
List(69, 78, 80, 97)
List 方法:
def reverse: List[A]
返回新列表,在相反的顺序元素
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
x => {
val xx = x._1
val yy = x._2
yy.toList.sorted.reverse
}
).collect
List[String]
List(97, 80, 78, 69)
List 方法:
def take(n: Int): List[A]
返回前n个元素
def takeRight(n: Int): List[A]
返回最后n个元素
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
x => {
val xx = x._1
val yy = x._2
yy.toList.sorted.reverse.take(3)
}
).collect
要求返回的是一个元组对
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
x => {
val xx = x._1
val yy = x._2
(xx,yy.toList.sorted.reverse.take(3))
}
).collect
val groupTopKeyRdd = rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
x => {
val xx = x._1
val yy = x._2
(xx,yy.toList.sorted.reverse.take(3))
}
)
groupTopKeyRdd.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/spark/grouptop/output")
==================================================================
bin/spark-submit \
--master spark://hadoop-senior.ibeifeng.com:7077 \
jars/sparkApp.jar
--------------------------------------------------------------------------------
Spark Core
>> 四大优势
>> 编译
依据不同版本的Hadoop以及Hive进行编译
>> local mode
spark-shell
>> cluster
安装配置部署Standalone
>> RDD
五大特性
transformation、action、persistent
WordCount
>> Spark Core案例
如何导入源码IDEA、查看源码
>> Spark 调度
DAG -> Task -> Run on Executor
>> 如何配置HistoryServer
jetty
类似于MapReduce中JobHistoryServer
>> Spark Application 编程模型
如何在IDEA中开发一个Spark应用、测试、打包、运行
>> Spark on YARN
deploy-mode
cluster
client
常见面试题:
第一、RDD 理解、五大特性
第二、Spark Applicaiton Scheduler
第三、Spark on yarn
=========================================================================
Spark Streaming Demo
从Socket实时读取数据,进行实时处理
# rpm -ivh nc-1.84-22.el6.x86_64.rpm
## 运行nc针对于端口号9999
$ nc -lk 9999
## 运行Demo
bin/run-example streaming.NetworkWordCount hadoop-senior.ibeifeng.com 9999
-------------------------------------
nc -lk 9991
成功 spark-submit --master local[*] --class org.apache.spark.examples.streaming.NetworkWordCount --name wordCount /opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/jars/spark-examples-1.2.0-cdh5.3.6-hadoop2.5.0-cdh5.3.6.jar master 9999
hadoop spark hdfs a b hadoop spark hdfs a b hadoop spark hdfs a b hadoop spark hdfs
---------------------------------------
=====================Initializing StreamingContext===================
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
// read data
val lines = ssc.socketTextStream("localhost", 9999)
// process
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
=====================HDFS ===================
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(5))
// read data
val lines = ssc.textFileStream("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/streaming/input/hdfs/")
// process
val words = lines.flatMap(_.split("\t"))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
-----------------------------------
如何在Spark-shell中执行某个scala代码
:load /opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/HDFSWordCount.scala
==========================将处理数据保存到HDFS================================
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(5))
// read data
val lines = ssc.textFileStream("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/streaming/input/hdfs/")
// process
val words = lines.flatMap(_.split("\t"))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/streaming/output/")
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
===============Spark Streaming + Flume Integration=================
Flume有三个组件
Source ---> Channel ---> Sink(Spark Streaming)
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.flume._
import org.apache.spark.storage.StorageLevel
val ssc = new StreamingContext(sc, Seconds(5))
// read data
val stream = FlumeUtils.createStream(ssc, "hadoop-senior.ibeifeng.com", 9999, StorageLevel.MEMORY_ONLY_SER_2)
stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
------------------
bin/spark-shell --jars \
/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/spark-streaming-flume_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/flume-avro-source-1.5.0-cdh5.3.6.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/flume-ng-sdk-1.5.0-cdh5.3.6.jar
------
run command:
bin/flume-ng agent -c conf -n a2 -f conf/flume-spark-push.sh -Dflume.root.logger=DEBUG,console
====================================================================
启动Kafka集群
nohup bin/kafka-server-start.sh config/server.properties &
创建topic命令
bin/kafka-topics.sh --create --zookeeper hadoop-senior.ibeifeng.com:2181 --replication-factor 1 --partitions 1 --topic test
查看已用topic
bin/kafka-topics.sh --list --zookeeper hadoop-senior.ibeifeng.com:2181
生产数据
bin/kafka-console-producer.sh --broker-list hadoop-senior.ibeifeng.com:9092 --topic test
消费数据
bin/kafka-console-consumer.sh --zookeeper hadoop-senior.ibeifeng.com:2181 --topic test --from-beginning
===============Spark Streaming + Kafka Integration=================
import java.util.HashMap
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
val ssc = new StreamingContext(sc, Seconds(5))
val topicMap = Map("test" -> 1)
// read data
val lines = KafkaUtils.createStream(ssc, "hadoop-senior.ibeifeng.com:2181", "testWordCountGroup", topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
------------------
bin/spark-shell --jars \
/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/spark-streaming-kafka_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/kafka_2.10-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/kafka-clients-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/zkclient-0.3.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/metrics-core-2.2.0.jar
第两种方式
===============Spark Streaming + Kafka Integration=================
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
val ssc = new StreamingContext(sc, Seconds(5))
val kafkaParams = Map[String, String]("metadata.broker.list" -> "hadoop-senior.ibeifeng.com:9092")
val topicsSet = Set("test")
// read data
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
------------------
bin/spark-shell --jars \
/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/spark-streaming-kafka_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/kafka_2.10-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/kafka-clients-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/zkclient-0.3.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/metrics-core-2.2.0.jar
================UpdataStateByKey===============================
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
val ssc = new StreamingContext(sc, Seconds(5))
ssc.checkpoint(".")
val kafkaParams = Map[String, String]("metadata.broker.list" -> "hadoop-senior.ibeifeng.com:9092")
val topicsSet = Set("test")
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.sum
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
// read data
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
stateDstream.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
=====================
在scala中,有一个很重要的功能,就是隐式转换
比如
A类对象,->通过一个函数将对象转换成另外一个对象 B类对象
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S]
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner())
}
spark
最新推荐文章于 2024-07-14 16:41:48 发布