Spark Core

最新推荐文章于 2021-06-27 11:16:01 发布

jerrfy_w

最新推荐文章于 2021-06-27 11:16:01 发布

阅读量134

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/wzj_wp/article/details/106845493

版权

Spark 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

Spark Core

RDD五大特性：

A list of partitions：一系列的分区
A function for computing each split：对每一个分片做计算
A list of dependencies on other
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

一个Partitions对应一个Task

scala> val L = Array(1,2,3,4,5)
L: Array[Int] = Array(1, 2, 3, 4, 5)

scala> sc.parallelize(L,3).collect
res7: Array[Int] = Array(1, 2, 3, 4, 5)

scala> sc.parallelize(L,2).collect
res8: Array[Int] = Array(1, 2, 3, 4, 5)

UI上面的job中tasks会分别显示3个task和2个task

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

/**
 服务器是不需要new sc的 已经自动定义了
 如果需要定义参数传递到spark中去，必须使用spark.开头
 setAppName和setMaster底层已经写好了spark.
*/

package spark

import org.apache.spark.{SparkConf, SparkContext}

object RDDApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("App01").setMaster("local")
    val sc = new SparkContext(conf)
    println(sc.appName)
    println(sc.master)
    sc.stop()
  }
}

RDD的创建：

1.sc.parallelize

scala> val L = Array(1,2,3,4,5)
L: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val rdd = sc.parallelize(L)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26

scala> rdd.reduce(_+_)
res1: Int = 15

2.External Datasets

sc.textFile(path) 返回MappartitionsRDD

scala> sc.textFile("/home/wzj/data/test.txt")
res0: org.apache.spark.rdd.RDD[String] = /home/wzj/data/test.txt MapPartitionsRDD[1] at textFile at <console>:25

scala> sc.textFile("/home/wzj/data/test.txt").collect
res1: Array[String] = Array(study,study,study, no,no, i)

sc.wholeTextFile 返回路径+内容

3.transformations/actions 通过算子计算一个RDD返回一个新的RDD

transformations特性：lazy 不会立刻执行，只有遇到第一个actions才会提交作业

scala> val a = sc.parallelize(List(1,2,3,4,5))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24

scala> val b = a.map(_*2)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at <console>:25

scala> b.collect
res5: Array[Int] = Array(2, 4, 6, 8, 10)

transformations算子：

1.map/mapPartitions

map 作用于每一个元素

mapPartitions 作用于每一个分区

scala> b.mapPartitions(partition => partition.map(_*2)).collect
res6: Array[Int] = Array(4, 8, 12, 16, 20)

2.filter 过滤

rdd.filter(条件)

3.zip 拉链

rdd1.zip(rdd2)

分区，元素个数必须保持一致

4.sortby/sortbykey

scala> val rdd1 = sc.parallelize(Array(("qwe",20),("asd",15),("zxc",35)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd1.sortBy(x =>x._2).collect()
res0: Array[(String, Int)] = Array((asd,15), (qwe,20), (zxc,35)) 

scala> rdd1.map(x =>(x._2,x._1)).sortByKey().map(x=>(x._2,x._1)).collect()
res2: Array[(String, Int)] = Array((asd,15), (qwe,20), (zxc,35))

scala> rdd1.map(x =>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).collect()
res3: Array[(String, Int)] = Array((zxc,35), (qwe,20), (asd,15))

5.reduceByKey/groupByKey

groupByKey进行shuffer的数据量远大于reduceByKey,reduceByKey有局部聚合的功能相当于mr中的Combine，所以优先选择reducebykey

scala> rdd2.groupByKey()
res4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[13] at groupByKey at <console>:26

scala> rdd2.groupByKey().collect
res5: Array[(String, Iterable[Int])] = Array((a,CompactBuffer(1, 99)), (b,CompactBuffer(2)), (c,CompactBuffer(23)))

scala> rdd2.groupByKey().map(x=>(x._1,x._2.sum)).collect
res7: Array[(String, Int)] = Array((a,100), (b,2), (c,23))

scala> rdd2.reduceByKey(_+_).collect
res9: Array[(String, Int)] = Array((a,100), (b,2), (c,23))

scala> rdd2.reduceByKey((x,y)=>x+y).collect
res11: Array[(String, Int)] = Array((a,100), (b,2), (c,23))

join/leftOuterJoin/leftOuterJoin/fullOuterJoin

rdd1.join(rdd2)

排序：

actions算子：

1.collect 数据全部加载进内存，所以适用于测试小数据，数据过大会产生OOM

2.first 返回第一个元素

3.take(num) 取前num个元素

4.count 返回元素个数

5.top(num) 取排序后的前num个元素

6.takeOrdered

7.reduce

8.countByKey

9.lookup eg:rdd.lookup(“key”)

写数据到本地

import org.apache.spark.{SparkConf, SparkContext}

object test {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("app001").setMaster("local")

    val sc = new SparkContext(conf)
    saveTextFile(sc)
    sc.stop()
  }

  def saveTextFile(sc:SparkContext): Unit ={

    val rdd1 = sc.parallelize(Array(("qwe",10),("asd",123),("zxc",34)))

    val out = "out"
    rdd1.saveAsTextFile(out)
  }
}

写数据到HDFS

import org.apache.spark.{SparkConf, SparkContext}

object test {

  def main(args: Array[String]): Unit = {

    System.setProperty("HADOOP_USER_NAME","wzj")
    val conf = new SparkConf().setAppName("app001").setMaster("local")

    val sc = new SparkContext(conf)
    saveTextFile(sc)
    sc.stop()
  }

  def saveTextFile(sc:SparkContext): Unit ={

    val rdd1 = sc.parallelize(Array(("qwe",10),("asd",123),("zxc",34)))

    sc.hadoopConfiguration.set("fs.defaultFS","hdfs://hadoop001:9000")
    sc.hadoopConfiguration.set("dfs.client.use.datanode.hostname","true")


    val out = "hdfs://hadoop001:9000/test"
    rdd1.saveAsTextFile(out)
  }
}

欢迎关注公众号，一起愉快的交流
在这里插入图片描述