Spark的算子与应用

Spark支持两个类型(算子)操作:TransformationAction

Transformation

主要做的是就是将一个已有的RDD生成另外一个RDD。Transformation具有lazy特性(延迟加载)。Transformation算子的代码不会真正被执行。只有当我们的程序里面遇到一个action算子的时候,代码才会真正的被执行。这种设计让Spark更加有效率地运行。

常用的Transformation:

转换含义
map(func) √返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成
filter(func) √返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成
flatMap(func) √类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素)
mapPartitions(func) √类似于map,但独立地在RDD的每一个分片上运行,因此在类型为T的RDD上运行时,func的函数类型必须是Iterator[T] => Iterator[U]
mapPartitionsWithIndex(func)类似于mapPartitions,但func带有一个整数参数表示分片的索引值,因此在类型为T的RDD上运行时,func的函数类型必须是(Int, Interator[T]) => Iterator[U]
sample(withReplacement, fraction, seed) √根据fraction指定的比例对数据进行采样,可以选择是否使用随机数进行替换,seed用于指定随机数生成器种子
union(otherDataset)对源RDD和参数RDD求并集后返回一个新的RDD
intersection(otherDataset)对源RDD和参数RDD求交集后返回一个新的RDD
distinct([numTasks]))对源RDD进行去重后返回一个新的RDD
groupByKey([numTasks]) √在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD
reduceByKey(func, [numTasks]) √在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,与groupByKey类似,reduce任务的个数可以通过第二个可选的参数来设置
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])先按分区聚合 再总的聚合 每次要跟初始值交流 例如:aggregateByKey(0)(+,+) 对k/y的RDD进行操作
sortByKey([ascending], [numTasks])在一个(K,V)的RDD上调用,K必须实现Ordered接口,返回一个按照key进行排序的(K,V)的RDD
sortBy(func,[ascending], [numTasks]) √与sortByKey类似,但是更灵活 第一个参数是根据什么排序 第二个是怎么排序 false倒序 第三个排序后分区数 默认与原RDD一样
join(otherDataset, [numTasks]) √在类型为(K,V)和(K,W)的RDD上调用,返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD 相当于内连接(求交集)
cogroup(otherDataset, [numTasks])在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable,Iterable))类型的RDD
cartesian(otherDataset)两个RDD的笛卡尔积 的成很多个K/V
pipe(command, [envVars])调用外部程序
coalesce(numPartitions**)**重新分区 第一个参数是要分多少区,第二个参数是否shuffle 默认false 少分区变多分区 true 多分区变少分区 false
repartition(numPartitions) √重新分区 必须shuffle 参数是要分多少区 少变多
repartitionAndSortWithinPartitions(partitioner)重新分区+排序 比先分区再排序效率高 对K/V的RDD进行操作
foldByKey(zeroValue)(seqOp)该函数用于K/V做折叠,合并处理 ,与aggregate类似 第一个括号的参数应用于每个V值 第二括号函数是聚合例如:+
combineByKey合并相同的key的值 rdd1.combineByKey(x => x, (a: Int, b: Int) => a + b, (m: Int, n: Int) => m + n)
partitionBy(partitioner)对RDD进行分区 partitioner是分区器 例如new HashPartition(2
**cache ** √RDD缓存,可以避免重复计算从而减少时间,区别:cache内部调用了persist算子,cache默认就一个缓存级别MEMORY-ONLY ,而persist则可以选择缓存级别
persist
Subtract返回前rdd元素不在后rdd的rdd
leftOuterJoinleftOuterJoin类似于SQL中的左外关联left outer join,返回结果以前面的RDD为主,关联不上的记录为空。只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。
**rightOuterJoin ** √rightOuterJoin类似于SQL中的有外关联right outer join,返回结果以参数中的RDD为主,关联不上的记录为空。只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可
subtractByKeysubstractByKey和基本转换操作中的subtract类似只不过这里是针对K的,返回在主RDD中出现,并且不在otherRDD中出现的元素
map(func) √

Return a new distributed dataset formed by passing each element of the source through a function func.RDD[T] -> RDD[U]

scala> sc.makeRDD(List(1,2,3,4,5)).map(item=>item*item).collect
res0: Array[Int] = Array(1, 4, 9, 16, 25)
filter(func) √

Return a new dataset formed by selecting those elements of the source on which funcreturns true.

scala> sc.makeRDD(List(1,2,3,4,5)).filter(item=>item%2==0).collect
res2: Array[Int] = Array(2, 4)
flatMap(func) √

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

scala> sc.makeRDD(List("hello world","hello spark")).flatMap(line=>line.split(" ")).collect
res3: Array[String] = Array(hello, world, hello, spark)
mapPartitions(func) √

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.

scala> sc.makeRDD(List("a","b","c","d","e"),3).mapPartitions(vs=> vs.map(v=>(v,1))).collect
res5: Array[(String, Int)] = Array((a,1), (b,1), (c,1), (d,1), (e,1))

scala> sc.makeRDD(List("a","b","c","d","e"),3).map(v=>(v,1)).collect
res6: Array[(String, Int)] = Array((a,1), (b,1), (c,1), (d,1), (e,1))
mapPartitionsWithIndex(func) √

Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.

scala> sc.makeRDD(List("a","b","c","d","e"),4).mapPartitionsWithIndex((index,vs)=> vs.map(t=>(t,index))).collect
res9: Array[(String, Int)] = Array((a,0), (b,1), (c,2), (d,3), (e,3))
sample(withReplacement, fraction, seed)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

scala> sc.makeRDD(List("a","b","c","d","e"),4).sample(false,0.8,5L).collect
res15: Array[String] = Array(a, b, c, d)

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument.

scala> var rdd1=sc.makeRDD(List("a","b","c","d","e"),4)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[33] at makeRDD at <console>:24

scala> var rdd2=sc.makeRDD(List("a","b","c","d","e"),4)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[34] at makeRDD at <console>:24

scala> rdd1.union(rdd2).collect
res17: Array[String] = Array(a, b, c, d, e, a, b, c, d, e)

intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument.

scala> var rdd1=sc.makeRDD(List("a","b","c","d","e"),4)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at makeRDD at <console>:24

scala> var rdd2=sc.makeRDD(List("a","g","e"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[37] at makeRDD at <console>:24

scala> rdd1.intersection(rdd2).collect()
res18: Array[String] = Array(e, a)
distinct([numPartitions])) √

Return a new dataset that contains the distinct elements of the source dataset.

scala> var rdd1=sc.makeRDD(List("a","b","c","d","e","a"),4)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[44] at makeRDD at <console>:24

scala> rdd1.distinct.collect
res20: Array[String] = Array(d, e, a, b, c)

cartesian(otherDataset)-了解

When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

scala> var rdd1=sc.makeRDD(List("a","b","c"),4)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[54] at makeRDD at <console>:24

scala> var rdd2=sc.makeRDD(List(1,2,3))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at makeRDD at <console>:24

scala> rdd1.cartesian(rdd2).collect
res22: Array[(String, Int)] = Array((a,1), (a,2), (a,3), (b,1), (b,2), (b,3), (c,1), (c,2), (c,3))
coalesce(numPartitions) √

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

scala> var rdd1=sc.makeRDD(List("a","b","c"),4).getNumPartitions
rdd1: Int = 4

scala> var rdd1=sc.makeRDD(List("a","b","c"),4).coalesce(2).getNumPartitions
rdd1: Int = 2

只能缩小,不能放大分区。

repartition(numPartitions) √

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

scala> var rdd1=sc.makeRDD(List("a","b","c","e","f"),4).mapPartitionsWithIndex((i,vs)=> vs.map(v=>(v,i))).collect
rdd1: Array[(String, Int)] = Array((a,0), (b,1), (c,2), (e,3), (f,3))

scala> var rdd1=sc.makeRDD(List("a","b","c","e","f"),4).repartition(5).mapPartitionsWithIndex((i,vs)=> vs.map(v=>(v,i))).collect
rdd1: Array[(String, Int)] = Array((a,1), (c,2), (e,2), (b,3), (f,3))

既可以放大也可以缩小分区。

groupByKey([numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks.

scala> var linesRDD=sc.textFile("hdfs:///demo/words")
linesRDD: org.apache.spark.rdd.RDD[String] = hdfs:///demo/words MapPartitionsRDD[2] at textFile at <console>:24
scala> linesRDD.flatMap(_.split("\\s+")).map((_,1)).groupByKey(3)
res3: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[9] at groupByKey at <console>:26

scala> linesRDD.flatMap(_.split("\\s+")).map((_,1)).groupByKey(3).map(t=>(t._1,t._2.sum)).collect
res6: Array[(String, Int)] = Array((day,2), (come,1), (baby,1), (up,1), (is,1), (a,1), (demo,1), (this,1), (on,1), (good,2), (study,1))
reduceByKey(func, [numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

scala> var linesRDD=sc.textFile("hdfs:///demo/words")
linesRDD: org.apache.spark.rdd.RDD[String] = hdfs:///demo/words MapPartitionsRDD[2] at textFile at <console>:24
scala> linesRDD.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).collect
res7: Array[(String, Int)] = Array((up,1), (this,1), (is,1), (a,1), (on,1), (day,2), (demo,1), (come,1), (good,2), (study,1), (baby,1))
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])
scala> var linesRDD=sc.textFile("hdfs:///demo/words")
linesRDD: org.apache.spark.rdd.RDD[String] = hdfs:///demo/words MapPartitionsRDD[2] at textFile at <console>:24
scala> linesRDD.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)((z,v)=>z+v,(c1,c2)=>c1+c2).collect
res9: Array[(String, Int)] = Array((up,1), (this,1), (is,1), (a,1), (on,1), (day,2), (demo,1), (come,1), (good,2), (study,1), (baby,1))

sortByKey

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

scala> var linesRDD=sc.textFile("hdfs:///demo/words")
linesRDD: org.apache.spark.rdd.RDD[String] = hdfs:///demo/words MapPartitionsRDD[2] at textFile at <console>:24
scala> linesRDD.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)((z,v)=>z+v,(c1,c2)=>c1+c2).sortByKey(true,4).collect
res10: Array[(String, Int)] = Array((a,1), (baby,1), (come,1), (day,2), (demo,1), (good,2), (is,1), (on,1), (study,1), (this,1), (up,1))

相比较而言sortBy算子因为使用比sortByKey更加灵活,因此sortBy使用更多

scala> var linesRDD=sc.textFile("hdfs:///demo/words")
linesRDD: org.apache.spark.rdd.RDD[String] = hdfs:///demo/words MapPartitionsRDD[2] at textFile at <console>:24

scala> linesRDD.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)((z,v)=>z+v,(c1,c2)=>c1+c2).sortBy(_._2,false,4).collect
res12: Array[(String, Int)] = Array((day,2), (good,2), (up,1), (this,1), (is,1), (a,1), (on,1), (demo,1), (come,1), (study,1), (baby,1))
join(otherDataset, [numPartitions])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

scala> var rdd1=sc.parallelize(Array(("001","张三"),("002","李四"),("003","王五")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[47] at parallelize at <console>:24

scala>  var rdd2=sc.parallelize(Array(("001",("apple",18.0)),("001",("orange",18.0))))
rdd2: org.apache.spark.rdd.RDD[(String, (String, Double))] = ParallelCollectionRDD[48] at parallelize at <console>:24

scala> rdd1.join(rdd2).collect
res13: Array[(String, (String, (String, Double)))] = Array((001,(张三,(apple,18.0))), (001,(张三,(orange,18.0))))

scala> rdd1.join(rdd2).map(t=>(t._1,t._2._1,t._2._2._1,t._2._2._2)).collect
res15: Array[(String, String, String, Double)] = Array((001,张三,apple,18.0), (001,张三,orange,18.0))

cogroup(otherDataset, [numPartitions]) - 了解

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. This operation is also called groupWith.

scala> var rdd1=sc.parallelize(Array(("001","张三"),("002","李四"),("003","王五")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[47] at parallelize at <console>:24

scala>  var rdd2=sc.parallelize(Array(("001",("apple",18.0)),("001",("orange",18.0))))
rdd2: org.apache.spark.rdd.RDD[(String, (String, Double))] = ParallelCollectionRDD[48] at parallelize at <console>:24

scala> rdd1.cogroup(rdd2).collect
res33: Array[(String, (Iterable[String], Iterable[(String, Double)]))] = Array((003,(CompactBuffer(王五),CompactBuffer())), (002,(CompactBuffer(李四),CompactBuffer())), (001,(CompactBuffer(张三),CompactBuffer((apple,18.0), (orange,18.0)))))

scala> rdd1.groupWith(rdd2).collect
res34: Array[(String, (Iterable[String], Iterable[(String, Double)]))] = Array((003,(CompactBuffer(王五),CompactBuffer())), (002,(CompactBuffer(李四),CompactBuffer())), (001,(CompactBuffer(张三),CompactBuffer((apple,18.0), (orange,18.0)))))

和join的区别在于cogroup并不会做连接操作

Action

触发代码的运行,我们一段spark代码里面至少需要有一个action操作。

常用的Action:

动作含义
reduce(func)通过func函数聚集RDD中的所有元素,这个功能必须是可交换且可并联的
collect()在驱动程序中,以数组的形式返回数据集的所有元素
count()返回RDD的元素个数
first()返回RDD的第一个元素(类似于take(1))
take(n)返回一个由数据集的前n个元素组成的数组
takeSample(withReplacement,num, [seed])返回一个数组,该数组由从数据集中随机采样的num个元素组成,可以选择是否用随机数替换不足的部分,seed用于指定随机数生成器种子
takeOrdered(n, [ordering])
saveAsTextFile(path)将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本
saveAsSequenceFile(path)将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。
saveAsObjectFile(path)
countByKey()针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。
foreach(func)在数据集的每一个元素上,运行函数func进行更新。
reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

scala> var rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.reduce((v1,v2)=>v1+v2)
res0: Int = 21
collect() - 结果拿到Driver

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

scala> var rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd.collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6)
count()

Return the number of elements in the dataset.

scala> var rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> rdd.count
res2: Long = 6

first()

Return the first element of the dataset (similar to take(1)).

scala> var rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.first
res3: Int = 1

scala> rdd.take(1)
res4: Array[Int] = Array(1)

scala> rdd.take(2)
res5: Array[Int] = Array(1, 2)
take(n)

Return an array with the first n elements of the dataset.

scala> var rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.take(1)
res4: Array[Int] = Array(1)

scala> rdd.take(2)
res5: Array[Int] = Array(1, 2)

takeSample(withReplacement, num, [seed])

Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.

scala> var rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> rdd.takeSample(true,3)
res6: Array[Int] = Array(6, 4, 2)
takeOrdered(n, [ordering])

Return the first n elements of the RDD using either their natural order or a custom comparator.

scala> var rdd=sc.parallelize(List(1,4,2,5,3,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> rdd.takeOrdered(4)
res7: Array[Int] = Array(1, 2, 3, 4)

scala> rdd.takeOrdered(4)(new Ordering[Int]{ //降序
          override def compare(x: Int, y: Int): Int = {
           -1*(x-y)
          }
   	 })
saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

scala> sc.textFile("hdfs:///demo/words").flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).sortByKey(true,3).saveAsTextFile("hdfs:///demo/results001")
saveAsSequenceFile(path)

Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).

scala> import org.apache.hadoop.io.{IntWritable, Text}
import org.apache.hadoop.io.{IntWritable, Text}

scala> sc.sequenceFile[Text,IntWritable]("hdfs://CentOS:9000/demo/results002/",classOf[Text],classOf[IntWritable]).map(t=>(t._1.ttring,t._2.get)).collect
res1: Array[(String, Int)] = Array((a,1), (baby,1), (come,1), (day,2), (demo,1), (good,2), (is,1), (on,1), (study,1), (this,1), (,1))

countByKey()

Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.

scala> sc.parallelize(List(("a",2),("b",1),("a",3))).countByKey
res0: scala.collection.Map[String,Long] = Map(a -> 2, b -> 1)
foreach(func) - 远程执行。

Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.

[root@CentOS spark-2.4.3]# ./bin/spark-shell --master local[6]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = spark://CentOS:7077, app id = app-20190815184546-0000).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.sequenceFile[Text,IntWritable]("hdfs://CentOS:9000/demo/results002/",classOf[Text],classOf[IntWritable]).map(t=>(t._1.toString,t._2.get)).foreach(println) //本地测试,所有有输出
(a,1)
(baby,1)
(come,1)
(day,2)
(demo,1)
(good,2)
(is,1)
(on,1)
(study,1)
(this,1)
(up,1)

注意foreach该Action动作并行执行的,输出结果是在远程。如果是远程

[root@CentOS spark-2.4.3]# ./bin/spark-shell --master spark://CentOS:7077 --total-executor-cores 6
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = spark://CentOS:7077, app id = app-20190815184546-0000).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.
scala>  sc.sequenceFile[Text,IntWritable]("hdfs://CentOS:9000/demo/results002/",classOf[Text],classOf[IntWritable]).map(t=>(t._1.toString,t._2.get)).foreach(println)
                                 
scala>

foreach写入外围系统
  • 不可以在Driver端定义连接参数,因为如果在Driver定义的变量被算子所引起,Spark在任务执行的时候会将定义在Driver端的变量分发给下游的Task(需要做变量序列化)。由于连接的特殊性(不可复制性)导致无法再Driver定义连接参数。

在这里插入图片描述

既然不能再Drive端定义,我可将变量定义在算子内部。

  • 改进方案2(频繁创建连接降低程序执行效率-不考虑)
    在这里插入图片描述

  • 改进方案3-(每个分区创建一个连接|连接池)

在这里插入图片描述

如果一个JVM中有多个分区,系统可能重复创建冗余连接参数。建议将连接参数作为静态。

  • 方案4(静态-类加载)
    在这里插入图片描述

可以保证一个Executor进程只实例化一个连接对象。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值