spark开发教程

spark开发教程


前言

spark开发主要的基于RDD、Datasets、DataFrame、sql 。其中rdd是最核心的底层,Datasets、DataFrame、sql都是基于rdd封装的高级api,dataframe是datasets的一种(类型为row)。


一、初始化spark

一个spark脚本的提交,会产生一个driver,如何通过把driver的运行逻辑传递给各个executor,就是通过sparkcontext。
在这里插入图片描述
SparkContext是与ClusterManager打交道的,clusterManager类似yarn的resourceManager负责资源的分配。

初始化脚本

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)

Spark2.0中只要创建一个SparkSession就够了,SparkConf、SparkContext和SQLContext都已经被封装在SparkSession当中。
参考:spark原理

二、获取数据源

1.创建数据结构

  • RDD创建
val lines=sc.parallelize(List("pandas", "apple"))
  • DataFrame创建
 val df=spark.createDataFrame(Seq(
      ("ming", 20, 15552211521L),
      ("hong", 19, 13287994007L),
      ("zhi", 21, 15552211523L)
    )) toDF("name", "age", "phone")
  • DataSet创建
val person1 = new Person("Andy", 32);
val person2 = new Person("katy", 33);
import spark.implicits._
val javaBeanDS= spark.createDataset(List(person1,person2))

SparkSession内部封装了SparkContext,创建sparksession就可以了。

  • spark-sql
    spark-sql的使用,一般是直接使用sql,通过dataframe 转化为临时表

2.连接外部数据

textfile

可以从本地文件系统或者hdfs文件系统读取数据

spark.sparkContext.textFile(path,1)
  • 如果textFile指定分区数量为0或者1的话,defaultMinPartitions值为1,则有多少个文件,就会有多少个分区。

  • 如果不指定默认分区数量,则默认分区数量为2,则会根据所有文件字节大小totalSize除以分区数量partitons的值goalSize,然后比较goalSize和hdfs指定分块大小(这里是32M)作比较,以较小的最为goalSize作为切分大小,对每个文件进行切分,若文件大于大于goalSize,则会生成该文件大小/goalSize + 1个分区。

  • 如果指定分区数量大于等于2,则默认分区数量为指定值,生成分区数量规则同2中的规则。
    参考textFile解读

jdbc
val jdbcDF = spark.read.format("jdbc").options(
     Map("url" ->  "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql",
     "dbtable" -> "ontime.ontime_sm",
     "fetchSize" -> "10000",
     "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2015", "numPartitions" -> "48"
     )).load()
hive
    val conf = new SparkConf()
    				.setAppName(this.getClass.getSimpleName)
    				.setMaster("local[*]")
    val session = SparkSession.builder()
      				.config(conf)
      				// 指定hive的metastore的端口  默认为9083 在hive-site.xml中查看
      				.config("hive.metastore.uris", "thrift://hadoop-01:9083,thrift://hadoop-02:9083")

3. 数据处理

rdd算子
transform算子
  1. map
  2. filter
  3. flatmap
  4. mapPartition和mapPartitionWithIndex
  5. sortBy和sortByKey
  6. groupBy和groupByKey
  7. reduceByKey、aggregateByKey、foldByKey、combineByKey
  8. distinct
  9. union
  10. intersection
  11. join、leftJoin 、rightJoin
  12. cogroup 类似fulloutJoin
  13. zip
action算子
  1. collect
  2. reduce
  3. fold
  4. aggregate
  5. count
  6. take top first
  7. foreach foreachPartition
  8. saveAsTextFile
  • map
    Return a new distributed dataset formed by passing each element of the source through a function func.

  • filter(func)
    Return a new dataset formed by selecting those elements of the source on which func returns true.

  • flatMap(func)
    Similar to map,but each input item can be mapped to 0 or more output items(so func should return a Seq rather than a single item).

  • union(otherDataset)
    Return a new dataset that contain the union of the elements in the source dataset and the argument.

  • join(otherDataset,[numTasks])
    When called on datasets of type(K,V) and (K,W),returns a dataset of (K,(V,W)) pairs with all pairs of elements for each key.Outer joins are supported leftOutJoin, rightOuterJoin,and fullOuterJoin.

  • intersection(otherDataset)
    Return a new RDD that contains the intersection of elements in the source dataset and the argument.

  • distinct([numTasks])
    Return a new dataset that contains the distinct elements of the source dataset.

  • groupByKey([numTasks])
    When called on a dataset of (K,V) pairs,returns a dataset of (K,Iterable) pairs.
    Note:If you are grouping in order to perform an aggregation(such as a sum or average) over each key,using reduceByKey or combineByKey will yield much better performance.

  • reduceByKey(func,[numTasks])
    When called on a dataset of (K,V) pairs,returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func,which must be of type(V,V)=>V. Like in groupByKey,the number of reduce tasks is configurable through an optional second argument.

  • sortByKey([ascending],[numTasks])
    When called on a dataset of (K,V) pairs where K implements Ordered,return a dataset of (K,V) pairs sorted by keys in ascending or descending order,as specified in the boolean ascending argument.

  • cogroup
    For each key k in this or other, return a resulting RDD that contains a tuple with the
    list of values for that key in this as well as other.

dataframe操作
  • selectExpr
    Selects a set of SQL expressions. This is a variant of select that accepts
    The following are equivalent:
    ds.selectExpr(“colA”, “colB as newName”, “abs(colC)”)
    ds.select(expr(“colA”), expr(“colB as newName”), expr(“abs(colC)”))
    val df1=spark.createDataFrame(List(( "a" , 1) ,( "a" , 2) ,( "b" , 3),("a",1) ,( "b" , 4) ,("c" , 4))).toDF("a","b")
    df1.select("dfd","b").show()
    df1.selectExpr("concat(a,\"b\") as a","b+10").show()
  • select
    Selects a set of columns. This is a variant of select that can only select
    existing columns using column names (i.e. cannot construct expressions).
    ds.select(“colA”, “colB”)
    ds.select($“colA”, $“colB”)
    val df1=spark.createDataFrame(List(( "a" , 1) ,( "a" , 2) ,( "b" , 3),("a",1) ,( "b" , 4) ,("c" , 4))).toDF("a","b")
    df1.select("dfd","b").show()
    df1.selectExpr("concat(a,\"b\") as a","b+10").show()
  • group
    (Scala-specific) Compute aggregates by specifying the column names and
    aggregate methods. The resulting DataFrame will also contain the grouping columns.
    The available aggregate methods are avg, max, min, sum, count.
    // Selects the age of the oldest employee and the aggregate expense for each department
    df.groupBy(“department”).agg(
    “age” -> “max”,
    “expense” -> “sum”
    )
df.groupBy("department").agg(
    "age" -> "max",
    "expense" -> "sum"
   )

其他类似RDD算子

dataset操作

dataset与dataframe的区别

4. 共享变量

  • 累加变量Accumulator
    val list1=spark.sparkContext.parallelize( List(( 'a' , 1) ,( 'a' , 2) ,( 'b' , 3),('a',1) ,( 'b' , 4) ,( 'c' , 4)),4)
    val accum1=spark.sparkContext.collectionAccumulator[String]("a")
    list1.foreachPartition(x=> accum1.add("123"))
    println(accum1.value)
  • 广播变量
	val temp=List(1,2,2,3,4)
    val broad1=spark.sparkContext.broadcast(temp)
    list1.foreachPartition(x=> println(broad1.value))

broadcast:通过调用SparkContext的broadcast()方法,来针对某个变量创建广播变量。然后在算子的函数内,使用到广播变量时,每个节点只会拷贝一份副本了。每个节点可以使用广播变量的value()方法获取值。记住,广播变量,是只读的。
Accumulator,主要用于多个节点对一个变量进行共享性的操作。Accumulator只提供了累加的功能。但是确给我们提供了多个task对一个变量并行操作的功能。但是task只能对Accumulator进行累加操作,不能读取它的值。只有Driver程序可以读取Accumulator的值。
参考

5.写入数据

总结

spark离线开发王者

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值