一、RDD的操作
(
RDD操作=Transformation+Action
)
Transformation
|
Meaning
|
map
(
func
)
|
Return a new distributed dataset formed by passing each element of the source through a function
func
.
|
filter
(
func
)
|
Return a new dataset formed by selecting those elements of the source on which
func
returns true.
|
flatMap
(
func
)
|
Similar to map, but each input item can be mapped to 0 or more output items (so
func
should return a Seq rather than a single item).
|
mapPartitions
(
func
)
|
Similar to map, but runs separately on each partition (block) of the RDD, so
func
must be of type Iterator[T] => Iterator[U] when running on an RDD of type T.
|
mapPartitionsWithSplit
(
func
)
|
Similar to mapPartitions, but also provides
func
with an integer value representing the index of the split, so
func
must be of type (Int, Iterator[T]) => Iterator[U] when running on an RDD of type T.
|
sample
(
withReplacement
,
fraction
,
seed
)
|
Sample a fraction
fraction
of the data, with or without replacement, using a given random number generator seed.
|
union
(
otherDataset
)
|
Return a new dataset that contains the union of the elements in the source dataset and the argument.
|
distinct
([
numTasks
]))
|
Return a new dataset that contains the distinct elements of the source dataset.
|
groupByKey
([
numTasks
])
|
When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs.
Note:
By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional
numTasks
argument to set a different number of tasks.
|
reduceByKey
(
func
, [
numTasks
])
|
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function. Like in
groupByKey
, the number of reduce tasks is configurable through an optional second argument.
|
sortByKey
([
ascending
], [
numTasks
])
|
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean
ascending
argument.
|
join
(
otherDataset
, [
numTasks
])
|
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
|
cogroup
(
otherDataset
, [
numTasks
])
|
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples. This operation is also called
groupWith
.
|
cartesian
(
otherDataset
)
|
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
|
Action
|
Meaning
|
reduce
(
func
)
|
Aggregate the elements of the dataset using a function
func
(which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
|
collect
()
|
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
|
count
()
|
Return the number of elements in the dataset.
|
first
()
|
Return the first element of the dataset (similar to take(1)).
|
take
(
n
)
|
Return an array with the first
n
elements of the dataset. Note that this is currently not executed in parallel. Instead, the driver program computes all the elements.
|
takeSample
(
withReplacement
,
num
,
seed
)
|
Return an array with a random sample of
num
elements of the dataset, with or without replacement, using the given random number generator seed.
|
saveAsTextFile
(
path
)
|
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
|
saveAsSequenceFile
(
path
)
|
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
|
countByKey
()
|
Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key.
|
foreach
(
func
)
|
Run a function
func
on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems.
|
- map
将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。
输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区。
任何原RDD中的元素在新RDD中都有且只有一个元素与之对应。
- distinct
对RDD中的元素进行去重操作
- reduce
整合一个RDD中的元素,即:输入两个参数,输出一个参数
reduce将RDD中元素两两传递给输入函数,同时产生一个新的值,新产生的值与RDD中下一个元素再被传递给输入函数直到最后只有一个值为止
-reduceByKey
针对KV数据将相同key的value聚合到一起。与groupByKey不同,会进行一个类似mapreduce中的combine操作,减少相应的数据IO操作,加快效率。如果想进行一些非叠加操作,我们可以将value组合成字符串或其他格式将相同key的value组合在一起,再通过迭代,组合的数据拆开操作。
-countByValue(self)
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.
-Collect、collectAsMap
将RDD转换成list或者Map。结果以List或者HashMap的方式输出
二、其他操作
1.过滤空字符:filter(!_.isEmpty)
2.找出含“C”的:filter(_.contains(“C”))
3.为了合计所有计数,调用一个reduce步骤——reduceByKey(_+_)。 _+_ 可以非常便捷地为每个key赋值
4.保存文件:rdd.repartition(1).saveAsTextFile("路径/文件名.txt")
5.转换成RDD:distData=
sc.parallelize
(data)
三、数据类型
1.本地向量
本地向量的基类是 Vector,我们提供了两个实现 DenseVector 和 SparseVector。我们建议通过 Vectors中实现的工厂方法来创建本地向量
2.含类标签的点
含有类标签的点通过case class LabeledPoint来表示
3.-zipWithIndex
该函数将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对。