spark

最新推荐文章于 2024-07-14 16:41:48 发布
wanghenghengheng
最新推荐文章于 2024-07-14 16:41:48 发布
阅读量121
点赞数
文章标签： spark
本文链接：https://blog.csdn.net/wanghenghengheng/article/details/90168788
版权
scala> val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/input/wc.input")

scala> val wordRdd = rdd.flatMap(_.split(" "))

scala> val kvRdd = wordRdd.map((_,1))

scala> val wordCountRdd = kvRdd.reduceByKey(_ + _)

scala> wordCountRdd.collect

scala> wordCountRdd.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/sparkOutput")

hdfs -> rdd  -> wordRdd  -> kvRdd  -> wordCountRdd -> hdfs

RDD lineage 生命线 ，保存如何转换得来的

=====================================================
针对 Key-Value类型的RDD，可以指定分区函数

map-01
   					reduce-01
map-02
   					reduce-02
map-03



处理RDD split进行计算时，split数据在哪里，我们尽量在那台机器上进行计算


【移动计算而不是移动数据】


=====================================================
val keyRdd = kvRdd.groupByKey(2)

wordRdd 				kvRdd 						keyRdd
-------      map
spark 					(spark,1)					(spark,list(1,1,1,1,1))
hadoop   				(hadoop,1)					(rdd,list(1,1))	
hdfs					(hdfs,1)					
spark 					(spark,1)
--------    map
spark 					(spark,1)
yarn					(yarn,1)					--------------------
rdd 					(rdd,1)						(hadoop,list(1))
spark 					(spark,1)					(hdfs,list(1))
---------   map 									(yarn,list(1))
spark 					(spark,1)
rdd 					(rdd,1)
----------------------------------------------------------------------------------------------

Spark Standalone 部署
	Master
	Work

## WordCount
val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/input/wc.input")
val wordcount = rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
wordcount.collect
wordcount.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/sparkOutput99")

======result
(hive,2)
(mapreduce,1)
(mapreduce2,1)
(hadoop,5)
(hdfs,1)
发现：
	Spark 运行WordCoun程序，并没有像MapReduce程序那样，对Key进行排序。
## Key Sort
wordcount.sortByKey().collect    ## 默认情况是 升序
wordcount.sortByKey(true).collect	
wordcount.sortByKey(false).collect		
	
===============================================================	
------result
(hive,2)
(mapreduce,1)
(mapreduce2,1)
(hadoop,5)
(hdfs,1)	
需求：
	按照value值进行降序
## Value Sort
wordcount.map(x => (x._2,x._1)).sortByKey(false).collect
wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)).collect
	
## Top N
wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)).take(3)

+++++++++++++++++++++++++++++++++++++++++++
Group Top Key
	WordCount 程序，前KEY值

需求：
	类似MapReduce中的二次排序
	1) 按照第一个字段进行分组
	2) 对分组中的第二字段进行排序(降序)
	3) 获取每个分组Top N，比如获取前三个值
aa 78
bb 98
aa 80
cc 98				
aa 69
cc 87
bb 97
cc 86
aa 97
bb 78
bb 34
cc 85
bb 92
cc 72
bb 32
bb 23	
	
功能分析：
	(aa,list(78,80,69,97)) ->  (aa,list(69,78,80,97))  -> (aa,list(69,78,80))
	
val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/spark/grouptop/input/score.input")
	
rdd.map(_.split(" ")).collect	
	Array(aa, 78)
	
rdd.map(_.split(" ")).map(x => (x(0),x(1))).collect  //x(0)取第一个元素
	(aa,78)
	
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.collect
	(aa,CompactBuffer(78, 80, 69, 97))

rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
    x => {
       val xx = x._1   //位置，取第一个位置
       val yy = x._2
       yy
    }
).collect


	Iterable[String]
	
Iterable 方法：
	def toList: List[A]
	返回包含此遍历的迭代器的所有元素的列表
	
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
    x => {
       val xx = x._1
       val yy = x._2
       yy.toList   
    }
).collect
	List[String]
	List(78, 80, 69, 97)
	
List 方法：	
	def sorted[B >: A]: List[A]
	根据排序对列表进行排序
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
    x => {
       val xx = x._1
       val yy = x._2
       yy.toList.sorted
    }
).collect	
	List[String]
	List(69, 78, 80, 97)
	
List 方法：	
	def reverse: List[A]
	返回新列表，在相反的顺序元素
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
    x => {
       val xx = x._1
       val yy = x._2
       yy.toList.sorted.reverse
    }
).collect
	List[String]
	List(97, 80, 78, 69)

List 方法：	
	def take(n: Int): List[A]
	返回前n个元素
	
	def takeRight(n: Int): List[A]
	返回最后n个元素
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
    x => {
       val xx = x._1
       val yy = x._2
       yy.toList.sorted.reverse.take(3)
    }
).collect

要求返回的是一个元组对
rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
    x => {
       val xx = x._1
       val yy = x._2
       (xx,yy.toList.sorted.reverse.take(3))
    }
).collect

val groupTopKeyRdd = rdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey.map(
    x => {
       val xx = x._1
       val yy = x._2
       (xx,yy.toList.sorted.reverse.take(3))
    }
)
groupTopKeyRdd.saveAsTextFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/spark/grouptop/output")


==================================================================

bin/spark-submit \
  --master spark://hadoop-senior.ibeifeng.com:7077 \
  jars/sparkApp.jar
--------------------------------------------------------------------------------

Spark Core
	>> 四大优势
	>> 编译
		依据不同版本的Hadoop以及Hive进行编译
	>> local mode
		spark-shell
	>> cluster
		安装配置部署Standalone
	>> RDD
		五大特性
		transformation、action、persistent
		WordCount
	>> Spark Core案例
		如何导入源码IDEA、查看源码
	>> Spark 调度
		DAG -> Task -> Run on Executor
	>> 如何配置HistoryServer
		jetty
		类似于MapReduce中JobHistoryServer
	>> Spark Application 编程模型
		如何在IDEA中开发一个Spark应用、测试、打包、运行
	>> Spark on YARN
		deploy-mode
			cluster
			client
常见面试题：
	第一、RDD 理解、五大特性
	第二、Spark Applicaiton Scheduler
	第三、Spark on yarn

=========================================================================

Spark Streaming Demo

从Socket实时读取数据，进行实时处理
# rpm -ivh nc-1.84-22.el6.x86_64.rpm

## 运行nc针对于端口号9999
$ nc -lk 9999

## 运行Demo
bin/run-example streaming.NetworkWordCount hadoop-senior.ibeifeng.com 9999

-------------------------------------
nc -lk 9991
成功 spark-submit --master local[*] --class org.apache.spark.examples.streaming.NetworkWordCount --name wordCount /opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/jars/spark-examples-1.2.0-cdh5.3.6-hadoop2.5.0-cdh5.3.6.jar master 9999
hadoop spark hdfs a b hadoop spark hdfs a b hadoop spark hdfs a b hadoop spark hdfs

---------------------------------------
=====================Initializing StreamingContext===================
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ 

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(5))

// read data
val lines = ssc.socketTextStream("localhost", 9999)

// process
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
 
ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate



=====================HDFS ===================
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ 

val ssc = new StreamingContext(sc, Seconds(5))

// read data
val lines = ssc.textFileStream("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/streaming/input/hdfs/")

// process
val words = lines.flatMap(_.split("\t"))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
 
ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

-----------------------------------
如何在Spark-shell中执行某个scala代码
:load /opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/HDFSWordCount.scala

==========================将处理数据保存到HDFS================================
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ 

val ssc = new StreamingContext(sc, Seconds(5))

// read data
val lines = ssc.textFileStream("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/streaming/input/hdfs/")

// process
val words = lines.flatMap(_.split("\t"))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/streaming/output/")
 
ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate


===============Spark Streaming + Flume Integration=================
Flume有三个组件
	Source  --->   Channel  --->   Sink(Spark Streaming)

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ 
import org.apache.spark.streaming.flume._
import org.apache.spark.storage.StorageLevel

val ssc = new StreamingContext(sc, Seconds(5))

// read data
val stream = FlumeUtils.createStream(ssc, "hadoop-senior.ibeifeng.com", 9999, StorageLevel.MEMORY_ONLY_SER_2)

stream.count().map(cnt => "Received " + cnt + " flume events." ).print()

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

------------------
bin/spark-shell --jars \
/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/spark-streaming-flume_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/flume-avro-source-1.5.0-cdh5.3.6.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/flume-ng-sdk-1.5.0-cdh5.3.6.jar

------
run command:
	bin/flume-ng agent -c conf -n a2 -f conf/flume-spark-push.sh -Dflume.root.logger=DEBUG,console


====================================================================

启动Kafka集群
	nohup bin/kafka-server-start.sh config/server.properties & 

创建topic命令
	bin/kafka-topics.sh --create --zookeeper hadoop-senior.ibeifeng.com:2181 --replication-factor 1 --partitions 1 --topic test

查看已用topic
	bin/kafka-topics.sh --list --zookeeper hadoop-senior.ibeifeng.com:2181

生产数据
	bin/kafka-console-producer.sh --broker-list hadoop-senior.ibeifeng.com:9092 --topic test

消费数据
	bin/kafka-console-consumer.sh --zookeeper hadoop-senior.ibeifeng.com:2181 --topic test --from-beginning



===============Spark Streaming + Kafka Integration=================
import java.util.HashMap
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ 
import org.apache.spark.streaming.kafka._

val ssc = new StreamingContext(sc, Seconds(5))

val topicMap = Map("test" -> 1)

// read data
val lines = KafkaUtils.createStream(ssc, "hadoop-senior.ibeifeng.com:2181", "testWordCountGroup", topicMap).map(_._2)

val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

------------------
bin/spark-shell --jars \
/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/spark-streaming-kafka_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/kafka_2.10-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/kafka-clients-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/zkclient-0.3.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/metrics-core-2.2.0.jar

第两种方式
===============Spark Streaming + Kafka Integration=================
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ 
import org.apache.spark.streaming.kafka._

val ssc = new StreamingContext(sc, Seconds(5))

val kafkaParams = Map[String, String]("metadata.broker.list" -> "hadoop-senior.ibeifeng.com:9092")
val topicsSet = Set("test")

// read data
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

------------------
bin/spark-shell --jars \
/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/spark-streaming-kafka_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/kafka_2.10-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/kafka-clients-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/zkclient-0.3.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/metrics-core-2.2.0.jar

================UpdataStateByKey===============================
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ 
import org.apache.spark.streaming.kafka._

val ssc = new StreamingContext(sc, Seconds(5))
ssc.checkpoint(".")

val kafkaParams = Map[String, String]("metadata.broker.list" -> "hadoop-senior.ibeifeng.com:9092")
val topicsSet = Set("test")

val updateFunc = (values: Seq[Int], state: Option[Int]) => {
	val currentCount = values.sum
	val previousCount = state.getOrElse(0)
	Some(currentCount + previousCount)
}

// read data
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))

val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)

stateDstream.print()

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate


=====================
在scala中，有一个很重要的功能，就是隐式转换
	比如
		A类对象，->通过一个函数将对象转换成另外一个对象   B类对象

/**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S]
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner())
  }
wanghenghengheng
关注
0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark

scala> val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/mapreduce/wordcount/input/wc.input")scala> val wordRdd = rdd.flatMap(_.split(" "))scala> val kvRdd = word...
复制链接

扫一扫