Spark&Python 基本操作

最新推荐文章于 2022-03-28 17:25:41 发布

JungleChow

最新推荐文章于 2022-03-28 17:25:41 发布

阅读量240

点赞数

分类专栏： Spark 文章标签： Spark Python

本文链接：https://blog.csdn.net/yiyao8236/article/details/79536892

版权

Spark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

 
 一、RDD的操作 

 
 ( 
 RDD操作=Transformation+Action 
   
 ) 

Transformation	Meaning
map ( func )	Return a new distributed dataset formed by passing each element of the source through a function func .
filter ( func )	Return a new dataset formed by selecting those elements of the source on which func returns true.
flatMap ( func )	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions ( func )	Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator[T] => Iterator[U] when running on an RDD of type T.
mapPartitionsWithSplit ( func )	Similar to mapPartitions, but also provides func with an integer value representing the index of the split, so func must be of type (Int, Iterator[T]) => Iterator[U] when running on an RDD of type T.
sample ( withReplacement , fraction , seed )	Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union ( otherDataset )	Return a new dataset that contains the union of the elements in the source dataset and the argument.
distinct ([ numTasks ]))	Return a new dataset that contains the distinct elements of the source dataset.
groupByKey ([ numTasks ])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs. Note: By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKey ( func , [ numTasks ])	When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function. Like in groupByKey , the number of reduce tasks is configurable through an optional second argument.
sortByKey ([ ascending ], [ numTasks ])	When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
join ( otherDataset , [ numTasks ])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup ( otherDataset , [ numTasks ])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples. This operation is also called groupWith .
cartesian ( otherDataset )	When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

Action	Meaning
reduce ( func )	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect ()	Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count ()	Return the number of elements in the dataset.
first ()	Return the first element of the dataset (similar to take(1)).
take ( n )	Return an array with the first n elements of the dataset. Note that this is currently not executed in parallel. Instead, the driver program computes all the elements.
takeSample ( withReplacement , num , seed )	Return an array with a random sample of num elements of the dataset, with or without replacement, using the given random number generator seed.
saveAsTextFile ( path )	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile ( path )	Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
countByKey ()	Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key.
foreach ( func )	Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems.

 
 - map 

 
 将一个RDD中的每个数据项，通过map中的函数映射变为一个新的元素。 

 
 输入分区与输出分区一对一，即：有多少个输入分区，就有多少个输出分区。 

 
 任何原RDD中的元素在新RDD中都有且只有一个元素与之对应。 

 
 - distinct 

 
 对RDD中的元素进行去重操作 

 
 - reduce 

 
 整合一个RDD中的元素，即：输入两个参数，输出一个参数 

 
 reduce将RDD中元素两两传递给输入函数，同时产生一个新的值，新产生的值与RDD中下一个元素再被传递给输入函数直到最后只有一个值为止 

 
 -reduceByKey 

 
 针对KV数据将相同key的value聚合到一起。与groupByKey不同，会进行一个类似mapreduce中的combine操作，减少相应的数据IO操作，加快效率。如果想进行一些非叠加操作，我们可以将value组合成字符串或其他格式将相同key的value组合在一起，再通过迭代，组合的数据拆开操作。 

 
 -countByValue(self) 

 
 Return the count of each unique value in this RDD as a dictionary of (value, count) pairs. 

 
 -Collect、collectAsMap 

 
 将RDD转换成list或者Map。结果以List或者HashMap的方式输出 

 
 二、其他操作 

 
 1.过滤空字符：filter(!_.isEmpty) 

 
 2.找出含“C”的：filter(_.contains(“C”)) 

 
 3.为了合计所有计数，调用一个reduce步骤——reduceByKey(_+_)。 _+_ 可以非常便捷地为每个key赋值 

 
 4.保存文件：rdd.repartition(1).saveAsTextFile("路径/文件名.txt") 

 
 5.转换成RDD：distData= 
 sc.parallelize 
 (data) 

 
 三、数据类型 

 
 1.本地向量 

 
 本地向量的基类是 Vector,我们提供了两个实现 DenseVector 和 SparseVector。我们建议通过 Vectors中实现的工厂方法来创建本地向量 

 
 2.含类标签的点 

 
 含有类标签的点通过case class LabeledPoint来表示 

 
 3.-zipWithIndex 

 
 该函数将RDD中的元素和这个元素在RDD中的ID（索引号）组合成键/值对。 

JungleChow

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Spark&amp;Python 基本操作

Spark&Python 基本操作