大数据spark篇--spark基本概念

在路上的小y

已于 2022-09-16 13:22:24 修改

阅读量300

点赞数

分类专栏：大数据文章标签：大数据 spark

于 2022-09-16 13:21:46 首次发布

本文链接：https://blog.csdn.net/qq_43709558/article/details/126888521

版权

大数据专栏收录该内容

33 篇文章 8 订阅

订阅专栏

Spark VS MapReduce

Spark和MapReduce两款计算引擎，差异如下所示：

引擎	MapReduce	Spark
编程友好	一般，支持Map/Reduce两种算子	较好，支持的算子丰富(map/filter/reduce/aggregate等)
性能	一般,中间态数据需要落盘，计算逻辑相对复杂时，MapReduce会涉及到多MapReduce任务执行(多次shuffle)，每次shuffle也会涉及到大量的磁盘IO	较好，基于内存计算，基于DAG可以构建RDD的血缘关系，在调度过程中可以避免大量无效的磁盘IO，另外rdd共享机制可以降低网络IO的开销
集团生态	较好，odps提供MapReduce计算框架支持，可以通过LogView查看日志	较好，odps提供Spark计算引擎支持，可以通过LogView查看日志，目前提供了stand-alone、集群及client三种模式的支持

比较形象的对比(并不是说spark不会落盘，在基于DAG图拆分stage时，也会涉及到shuffle，但整体的磁盘IO消耗比MapReduce要低)
在这里插入图片描述

编程模式优势

: RDD + DataFrame 的编程模式
如上面和MapReduce的比较中看到 Spark 在编程友好性上比MapReduce好一些，比较适合后端开发人员。

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))
// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")