Transformation:
map(func): Return a new distributed dataset formed by passing each element of the source through a function func.
val list=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4))) val result = list.map(x => (x._1,x._2+1)) for(each <- result){ print(each) }
console:(a,2)(a,3)(b,4)(b,5)
val list=sc.parallelize(
List(
1
,
2
,
3
,
4
,
5))
val result = list.map(_+ 1)
for(each <- result){
print(each)
}
val result = list.map(_+ 1)
for(each <- result){
print(each)
}
console:23456
filter(func): Return a new dataset formed by selecting those elements of the source on which func returns true.
val list=sc.parallelize(
List(
1
,
2
,
3
,
4
,
5
,
6))
val result = list.filter(_% 2== 0)
for(each <- result){
print(each)
}
val result = list.filter(_% 2== 0)
for(each <- result){
print(each)
}
console:246
flatMap(func): Similar to map,but each input item can be mapped to 0 or more output items(so func should return a Seq rather than a single item).
eg:
val list = sc.parallelize(
List(
"abc"
,
"def"))
val result = list.flatMap(_.toList)
for(each <- result){
print(each)
}
val result = list.flatMap(_.toList)
for(each <- result){
print(each)
}
console:abcdef
union(otherDataset): Return a new dataset that contain the union of the elements in the source dataset and the argument.
val list=sc.parallelize(
List(
1
,
2
,
3))
val list1=sc.parallelize( List( 4 , 5 , 6))
val result = list.union(list1)
for(each <- result){
print(each)
}
val list1=sc.parallelize( List( 4 , 5 , 6))
val result = list.union(list1)
for(each <- result){
print(each)
}
console:123456
join(otherDataset,[numTasks]): When called on datasets of type(K,V) and (K,W),returns a dataset of (K,(V,W)) pairs with all pairs of elements for each key.Outer joins are supported
leftOutJoin,
rightOuterJoin,and
fullOuterJoin.
val list1=sc.parallelize(
List((
'a'
,
1)
,(
'a'
,
2)
,(
'b'
,
3)
,(
'b'
,
4)
,(
'c'
,
4)))
val list2=sc.parallelize( List(( 'a' , 5) ,( 'a' , 6) ,( 'b' , 7) ,( 'b' , 8)))
for(each <- list1.join(list2)){
print(each+ " ")
}
val list2=sc.parallelize( List(( 'a' , 5) ,( 'a' , 6) ,( 'b' , 7) ,( 'b' , 8)))
for(each <- list1.join(list2)){
print(each+ " ")
}
console:(a,(1,5)) (a,(1,6)) (a,(2,5)) (a,(2,6)) (b,(3,7)) (b,(3,8)) (b,(4,7)) (b,(4,8))
intersection(otherDataset): Return a new RDD that contains the intersection of elements in the source dataset and the argument.
val list1=sc.parallelize(
List((
'a'
,
1)
,(
'a'
,
5)
,(
'b'
,
3)
,(
'b'
,
4)
,(
'c'
,
4)))
val list2=sc.parallelize( List(( 'a' , 5) ,( 'a' , 6) ,( 'b' , 4) ,( 'b' , 8)))
for(each <- list1.intersection(list2)){
print(each+ " ")
}
val list2=sc.parallelize( List(( 'a' , 5) ,( 'a' , 6) ,( 'b' , 4) ,( 'b' , 8)))
for(each <- list1.intersection(list2)){
print(each+ " ")
}
console:(b,4) (a,5)
distinct([numTasks]): Return a new dataset that contains the distinct elements of the source dataset.
val list1=sc.parallelize(
List((
'a'
,
1)
,(
'a'
,
1)
,(
'b'
,
3)
,(
'b'
,
4)
,(
'c'
,
4)))
for(each <- list1.distinct()){
print(each + " ")
}
for(each <- list1.distinct()){
print(each + " ")
}
console:(a,1) (b,4) (b,3) (c,4)
groupByKey([numTasks]): When called on a dataset of (K,V) pairs,returns a dataset of (K,Iterable<V>) pairs.
Note:If you are grouping in order to perform an aggregation(such as a sum or average) over each key,using reduceByKey or combineByKey will yield much better performance.
Note:By default,the level of parallelism in the output depends on the number of partitions of the parent RDD.You can pass an optional numTasks argument to set a different number of tasks.
val list1=sc.parallelize(
List((
'a'
,
1)
,(
'a'
,
2)
,(
'b'
,
3)
,(
'b'
,
4)
,(
'c'
,
4)))
for(each <- list1.groupByKey()){
print(each + " ")
}
for(each <- list1.groupByKey()){
print(each + " ")
}
console:(a,CompactBuffer(1, 2)) (b,CompactBuffer(3, 4)) (c,CompactBuffer(4))
reduceByKey(func,[numTasks]): When called on a dataset of (K,V) pairs,returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func,which must be of type(V,V)=>V. Like in groupByKey,the number of reduce tasks is configurable through an optional second argument.
val list1=sc.parallelize(
List((
'a'
,
1)
,(
'a'
,
2)
,(
'b'
,
3)
,(
'b'
,
4)
,(
'c'
,
4)))
for(each <- list1.reduceByKey(_+_)){
print(each + " ")
}
for(each <- list1.reduceByKey(_+_)){
print(each + " ")
}
console:(a,3) (b,7) (c,4)
sortByKey([ascending],[numTasks]): When called on a dataset of (K,V) pairs where K implements Ordered,return a dataset of (K,V) pairs sorted by keys in ascending or descending order,as specified in the boolean ascending argument.
val list1=sc.parallelize(
List((
'a'
,
1)
,(
'e'
,
2)
,(
'b'
,
3)
,(
'd'
,
4)
,(
'c'
,
4)))
for(each <- list1.sortByKey( false)){
print(each + " ")
}
for(each <- list1.sortByKey( false)){
print(each + " ")
}
console:(e,2) (d,4) (c,4) (b,3) (a,1)
Action:
reduce(func): Aggregate the elements of the dataset using a function func(which takes two arguments and return one).The function should be commutative and associative so that it can be computed correctly in parallel.
eg:
val rdd=sc.parallelize(
List(
1
,
2
,
3
,
4))
print(rdd.reduce((_+_)))
print(rdd.reduce((_+_)))
console:10
collect(): Return all the elements of the dataset as an array at the driver program.This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
val rdd=sc.parallelize(
List(
1
,
2
,
3
,
4))
val result = rdd.filter(_% 2== 0).collect()
for(each <- result){
print(each + " ")
val result = rdd.filter(_% 2== 0).collect()
for(each <- result){
print(each + " ")
}
console:2 4
count(): Return the number of elements in the dataset.
val rdd = sc.parallelize(
List(
1
,
2
,
3
,
4))
print(rdd.count())
print(rdd.count())
console:4
first(): Return the first element of the dataset(similar to take(1))
val rdd = sc.parallelize(
List(
1
,
2
,
3
,
4))
print(rdd.first())
print(rdd.first())
console:1
take(n): Return an array with the first n elements of the dataset.Note that this is currently not executed in parallel.instead,the driver program computers all the elements.