Spark&Python 基本操作


一、RDD的操作
( RDD操作=Transformation+Action )

Transformation
Meaning
map  (  func )
Return a new distributed dataset formed by passing each element of the source through a function   func  .
filter  (  func )
Return a new dataset formed by selecting those elements of the source on which   func   returns true.
flatMap  ( func  )
Similar to map, but each input item can be mapped to 0 or more output items (so   func   should return a Seq rather than a single item).
mapPartitions  (  func  )
Similar to map, but runs separately on each partition (block) of the RDD, so  func   must be of type Iterator[T] => Iterator[U] when running on an RDD of type T.
mapPartitionsWithSplit  (  func  )
Similar to mapPartitions, but also provides   func   with an integer value representing the index of the split, so  func   must be of type (Int, Iterator[T]) => Iterator[U] when running on an RDD of type T.
sample  ( withReplacement  , fraction  ,  seed  )
Sample a fraction   fraction   of the data, with or without replacement, using a given random number generator seed.
union  ( otherDataset  )
Return a new dataset that contains the union of the elements in the source dataset and the argument.
distinct  ([ numTasks ]))
Return a new dataset that contains the distinct elements of the source dataset.
groupByKey  ([ numTasks ])
When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs.  
Note:   By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional   numTasks  argument to set a different number of tasks.
reduceByKey  (  func  , [ numTasks ])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function. Like in   groupByKey  , the number of reduce tasks is configurable through an optional second argument.
sortByKey ([ ascending  ], [  numTasks ])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean   ascending   argument.
join  ( otherDataset  , [ numTasks ])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup  ( otherDataset  , [ numTasks ])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples. This operation is also called   groupWith  .
cartesian  ( otherDataset  )
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). 
Action
Meaning
reduce func  )
Aggregate the elements of the dataset using a function   func   (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect ()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count  ()
Return the number of elements in the dataset.
first  ()
Return the first element of the dataset (similar to take(1)).
take  (  n )
Return an array with the first   n   elements of the dataset. Note that this is currently not executed in parallel. Instead, the driver program computes all the elements.
takeSample  ( withReplacement num  ,  seed  )
Return an array with a random sample of   num   elements of the dataset, with or without replacement, using the given random number generator seed.
saveAsTextFile path  )
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile  ( path  )
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
countByKey  ()
Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key.
foreach func  )
Run a function   func   on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems. 

- map
将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。
输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区。
任何原RDD中的元素在新RDD中都有且只有一个元素与之对应。

- distinct
对RDD中的元素进行去重操作

- reduce
整合一个RDD中的元素,即:输入两个参数,输出一个参数
reduce将RDD中元素两两传递给输入函数,同时产生一个新的值,新产生的值与RDD中下一个元素再被传递给输入函数直到最后只有一个值为止

-reduceByKey
针对KV数据将相同key的value聚合到一起。与groupByKey不同,会进行一个类似mapreduce中的combine操作,减少相应的数据IO操作,加快效率。如果想进行一些非叠加操作,我们可以将value组合成字符串或其他格式将相同key的value组合在一起,再通过迭代,组合的数据拆开操作。

-countByValue(self)
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.

-Collect、collectAsMap
将RDD转换成list或者Map。结果以List或者HashMap的方式输出

二、其他操作
1.过滤空字符:filter(!_.isEmpty)
2.找出含“C”的:filter(_.contains(“C”))
3.为了合计所有计数,调用一个reduce步骤——reduceByKey(_+_)。 _+_ 可以非常便捷地为每个key赋值
4.保存文件:rdd.repartition(1).saveAsTextFile("路径/文件名.txt")
5.转换成RDD:distData= sc.parallelize (data)

三、数据类型
1.本地向量
本地向量的基类是 Vector,我们提供了两个实现 DenseVector 和 SparseVector。我们建议通过 Vectors中实现的工厂方法来创建本地向量

2.含类标签的点
含有类标签的点通过case class LabeledPoint来表示

3.-zipWithIndex
该函数将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值