spark常用RDD介绍及Demo

最新推荐文章于 2023-05-11 11:15:19 发布

停不下的脚步

最新推荐文章于 2023-05-11 11:15:19 发布

阅读量1.4k

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/mylittlered/article/details/46334879

版权

spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Transformation：

map(func): Return a new distributed dataset formed by passing each element of the source through a function func.

val list=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4)))
val result = list.map(x => (x._1,x._2+1))
for(each <- result){
  print(each)
}

console:(a,2)(a,3)(b,4)(b,5)

 
 val list=sc.parallelize( 
 List( 
 1 
 , 
 2 
 , 
 3 
 , 
 4 
 , 
 5)) 
 
 
 val result = list.map(_+ 
 1) 
 
 
 for(each <- result){ 
 
   
 print(each) 
 
 } 

console:23456

filter(func): Return a new dataset formed by selecting those elements of the source on which func returns true.

 
 val list=sc.parallelize( 
 List( 
 1 
 , 
 2 
 , 
 3 
 , 
 4 
 , 
 5 
 , 
 6)) 
 
 
 val result = list.filter(_% 
 2== 
 0) 
 
 
 for(each <- result){ 
 
   
 print(each) 
 
 } 

console:246

flatMap(func): Similar to map,but each input item can be mapped to 0 or more output items(so func should return a Seq rather than a single item).

eg:

 
 val list = sc.parallelize( 
 List( 
 "abc" 
 , 
 "def")) 
 
 val result = list.flatMap(_.toList) 
 
 for(each <- result){ 
 
 print(each) 
 
 }

console:abcdef

union(otherDataset): Return a new dataset that contain the union of the elements in the source dataset and the argument.

 
 val list=sc.parallelize( 
 List( 
 1 
 , 
 2 
 , 
 3)) 
 
 
 val list1=sc.parallelize( 
 List( 
 4 
 , 
 5 
 , 
 6)) 
 
 
 val result = list.union(list1) 
 
 
 for(each <- result){ 
 
   
 print(each) 
 
 } 

console:123456

join(otherDataset,[numTasks]): When called on datasets of type(K,V) and (K,W),returns a dataset of (K,(V,W)) pairs with all pairs of elements for each key.Outer joins are supported leftOutJoin, rightOuterJoin,and fullOuterJoin.

 
 val list1=sc.parallelize( 
 List(( 
 'a' 
 , 
 1) 
 ,( 
 'a' 
 , 
 2) 
 ,( 
 'b' 
 , 
 3) 
 ,( 
 'b' 
 , 
 4) 
 ,( 
 'c' 
 , 
 4))) 
 
 
 val list2=sc.parallelize( 
 List(( 
 'a' 
 , 
 5) 
 ,( 
 'a' 
 , 
 6) 
 ,( 
 'b' 
 , 
 7) 
 ,( 
 'b' 
 , 
 8))) 
 
 
 for(each <- list1.join(list2)){ 
 
   
 print(each+ 
 " ") 
 
 } 

console:(a,(1,5)) (a,(1,6)) (a,(2,5)) (a,(2,6)) (b,(3,7)) (b,(3,8)) (b,(4,7)) (b,(4,8))

intersection(otherDataset): Return a new RDD that contains the intersection of elements in the source dataset and the argument.

 
 val list1=sc.parallelize( 
 List(( 
 'a' 
 , 
 1) 
 ,( 
 'a' 
 , 
 5) 
 ,( 
 'b' 
 , 
 3) 
 ,( 
 'b' 
 , 
 4) 
 ,( 
 'c' 
 , 
 4))) 
 
 
 val list2=sc.parallelize( 
 List(( 
 'a' 
 , 
 5) 
 ,( 
 'a' 
 , 
 6) 
 ,( 
 'b' 
 , 
 4) 
 ,( 
 'b' 
 , 
 8))) 
 
 
 for(each <- list1.intersection(list2)){ 
 
   
 print(each+ 
 " ") 
 
 } 

console:(b,4) (a,5)

distinct([numTasks]): Return a new dataset that contains the distinct elements of the source dataset.

 
 val list1=sc.parallelize( 
 List(( 
 'a' 
 , 
 1) 
 ,( 
 'a' 
 , 
 1) 
 ,( 
 'b' 
 , 
 3) 
 ,( 
 'b' 
 , 
 4) 
 ,( 
 'c' 
 , 
 4))) 
 
 
 for(each <- list1.distinct()){ 
 
   
 print(each +  
 " ") 
 
 } 

console:(a,1) (b,4) (b,3) (c,4)

groupByKey([numTasks]): When called on a dataset of (K,V) pairs,returns a dataset of (K,Iterable<V>) pairs.

Note:If you are grouping in order to perform an aggregation(such as a sum or average) over each key,using reduceByKey or combineByKey will yield much better performance.

Note:By default,the level of parallelism in the output depends on the number of partitions of the parent RDD.You can pass an optional numTasks argument to set a different number of tasks.

 
 val list1=sc.parallelize( 
 List(( 
 'a' 
 , 
 1) 
 ,( 
 'a' 
 , 
 2) 
 ,( 
 'b' 
 , 
 3) 
 ,( 
 'b' 
 , 
 4) 
 ,( 
 'c' 
 , 
 4))) 
 
 
 for(each <- list1.groupByKey()){ 
 
   
 print(each +  
 " ") 
 
 } 

console:(a,CompactBuffer(1, 2)) (b,CompactBuffer(3, 4)) (c,CompactBuffer(4))

reduceByKey(func,[numTasks]): When called on a dataset of (K,V) pairs,returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func,which must be of type(V,V)=>V. Like in groupByKey,the number of reduce tasks is configurable through an optional second argument.

 
 val list1=sc.parallelize( 
 List(( 
 'a' 
 , 
 1) 
 ,( 
 'a' 
 , 
 2) 
 ,( 
 'b' 
 , 
 3) 
 ,( 
 'b' 
 , 
 4) 
 ,( 
 'c' 
 , 
 4))) 
 
 
 for(each <- list1.reduceByKey(_+_)){ 
 
   
 print(each +  
 " ") 
 
 } 

console:(a,3) (b,7) (c,4)

sortByKey([ascending],[numTasks]): When called on a dataset of (K,V) pairs where K implements Ordered,return a dataset of (K,V) pairs sorted by keys in ascending or descending order,as specified in the boolean ascending argument.

 
 val list1=sc.parallelize( 
 List(( 
 'a' 
 , 
 1) 
 ,( 
 'e' 
 , 
 2) 
 ,( 
 'b' 
 , 
 3) 
 ,( 
 'd' 
 , 
 4) 
 ,( 
 'c' 
 , 
 4))) 
 
 
 for(each <- list1.sortByKey( 
 false)){ 
 
   
 print(each +  
 " ") 
 
 } 

console:(e,2) (d,4) (c,4) (b,3) (a,1)

Action:

reduce(func): Aggregate the elements of the dataset using a function func(which takes two arguments and return one).The function should be commutative and associative so that it can be computed correctly in parallel.

eg:

 
 val rdd=sc.parallelize( 
 List( 
 1 
 , 
 2 
 , 
 3 
 , 
 4)) 
 
 
 print(rdd.reduce((_+_))) 

console:10

collect(): Return all the elements of the dataset as an array at the driver program.This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

 
 val rdd=sc.parallelize( 
 List( 
 1 
 , 
 2 
 , 
 3 
 , 
 4)) 
 
 
 val result = rdd.filter(_% 
 2== 
 0).collect() 
 
 
 for(each <- result){ 
 
   
 print(each +  
 " ") 

}

console:2 4

count(): Return the number of elements in the dataset.

 
  val rdd = sc.parallelize( 
  List( 
  1 
  , 
  2 
  , 
  3 
  , 
  4)) 
  
 
  print(rdd.count()) 
  
 
 

console:4

first(): Return the first element of the dataset(similar to take(1))

 
 val rdd = sc.parallelize( 
 List( 
 1 
 , 
 2 
 , 
 3 
 , 
 4)) 
 
 
 print(rdd.first()) 

console:1

take(n): Return an array with the first n elements of the dataset.Note that this is currently not executed in parallel.instead,the driver program computers all the elements.