Spark SQL and DataFrame Guide

最新推荐文章于 2024-08-12 15:57:42 发布

岸芷汀兰whu

最新推荐文章于 2024-08-12 15:57:42 发布

阅读量671

点赞数

分类专栏： spark 大数据文章标签： spark SSQL 数据框

spark 同时被 2 个专栏收录

66 篇文章 0 订阅

订阅专栏

大数据

59 篇文章 0 订阅

订阅专栏

数据框是分布式数据集，组织为有名字的列。概念上等价于关系数据库或者R/Python的数据框，只是有更丰富的操作。数据框可以有结构化数据文件，hive表，外部数据库或者RDD来创建

入口：SQLContext

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

另外，也可以创建HiveContext,使用它，不需要安装hive,且可以获得SQLContext的所有数据源。HiveContext分开打包而没在spark安装包中。
用于解析语句的SQL指定变量也可以用spark.sql.dialect选项查询。这个参数可以用SQLContext的setConf方法改变或者在SQL中用SET key=value命令改变。对于一个SQLContext，唯一的对话变量是sql,它使用Spark SQL提供的简单SQL解析。在HiveContext,默认的是hiveql, 它更完备；
下面基于JSON文件创建一个数据框；

hadoop fs -put examples/src/main/resources/people.json /user/hadoop/people.json

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("people.json")

// Displays the content of the DataFrame to stdout
df.show()

15/08/27 08:46:00 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/people.json:36+37
15/08/27 08:46:00 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 3). 2117 bytes result sent to driver
15/08/27 08:46:00 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 28 ms on localhost (1/1)
15/08/27 08:46:00 INFO scheduler.DAGScheduler: ResultStage 2 (show at <console>:26) finished in 0.028 s
15/08/27 08:46:00 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
15/08/27 08:46:00 INFO scheduler.DAGScheduler: Job 2 finished: show at <console>:26, took 0.040699 s
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+


scala>

数据框操作



val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create the DataFrame
val df = sqlContext.read.json("examples/src/main/resources/people.json")

// Show the content of the DataFrame
df.show()
// age  name
// null Michael
// 30   Andy
// 19   Justin

// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column
df.select("name").show()
// name
// Michael
// Andy
// Justin

// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// name    (age + 1)
// Michael null
// Andy    31
// Justin  20

// Select people older than 21
df.filter(df("age") > 21).show()
// age name
// 30  Andy

// Count people by age
df.groupBy("age").count().show()
// age  count
// null 1
// 19   1
// 30   1

程序调用SQL

SQLContext的sql方法返回数据框



val sqlContext = ...  // An existing SQLContext
val df = sqlContext.sql("SELECT * FROM table")

Java:



SQLContext sqlContext = ...  // An existing SQLContext
DataFrame df = sqlContext.sql("SELECT * FROM table")

RDD交互

Spark SQL支持两种方法把RDD转化为数据框，第一种方法使用反射推导包含指定对象类型的RDD的模式。
第二种方法是通过编程接口，允许你建立一个（模式）schema并且把它应用到RDD.允许你创建数据框，即使在运行前你不知道它的类型和列。

使用反射推导schema

sparkSQL的scala接口支持自动把包含case class的RDD转化为数据框。这些case class定义了表的schema，case class的参数名用反射读取并且转化为列名。case class也可以嵌套或者包含像sequences或者Array的复杂类型。RDD可以转化为数据框并且注册为表。表可以用SQL语句

 hadoop fs -put people.txt /user/hadoop

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
输出：15/08/27 09:07:17 INFO spark.SparkContext: Created broadcast 23 from textFile at <console>:28
people: org.apache.spark.sql.DataFrame = [name: string, age: int]



people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

//Name: Justin

// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)

编程指定schema

当事先不能定义case class（例如：以字符串编码的记录或文本数据集，不同用户解析的方式不同）三步创建数据框

从原始RDD创建行的RDD
创建一个structType表示的模式与第一步创建的RDD的行结构像匹配
在行的RDD上通过applyschema方法应用模式

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD ,people.txt要放在hdfs上/user/hadoop/people.txt
scala> val people = sc.textFile("people.txt")
15/08/28 08:55:50 INFO storage.MemoryStore: ensureFreeSpace(182784) called with curMem=97823, maxMem=278302556
15/08/28 08:55:50 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 178.5 KB, free 265.1 MB)
15/08/28 08:55:50 INFO storage.MemoryStore: ensureFreeSpace(17383) called with curMem=280607, maxMem=278302556
15/08/28 08:55:50 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 17.0 KB, free 265.1 MB)
15/08/28 08:55:50 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:37236 (size: 17.0 KB, free: 265.4 MB)
15/08/28 08:55:50 INFO spark.SparkContext: Created broadcast 1 from textFile at <console>:23
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:23

scala> // The schema is encoded in a string

scala> val schemaString = "name age"
schemaString: String = name age

scala> 

scala> // Import Row.

scala> import org.apache.spark.sql.Row;
import org.apache.spark.sql.Row

scala> 

scala> // Import Spark SQL data types

scala> import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.types.{StructType, StructField, StringType}

scala> 

scala> // Generate the schema based on the string of schema

scala> val schema =
     |   StructType(
     |     schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true))

scala> 

scala> // Convert records of the RDD (people) to Rows.

scala> val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[11] at map at <console>:28

scala> 

scala> // Apply the schema to the RDD.

scala> val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
peopleDataFrame: org.apache.spark.sql.DataFrame = [name: string, age: string]

scala> 

scala> // Register the DataFrames as a table.

scala> peopleDataFrame.registerTempTable("people")

scala> 

scala> // SQL statements can be run by using the sql methods provided by sqlContext.

scala> val results = sqlContext.sql("SELECT name FROM people")
results: org.apache.spark.sql.DataFrame = [name: string]

scala> 

scala> // The results of SQL queries are DataFrames and support all the normal RDD operations.

scala> // The columns of a row in the result can be accessed by field index or by field name.

scala> results.map(t => "Name: " + t(0)).collect().foreach(println)
15/08/28 08:56:01 INFO mapred.FileInputFormat: Total input paths to process : 1
15/08/28 08:56:02 INFO spark.SparkContext: Starting job: collect at <console>:31
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:31) with 2 output partitions (allowLocal=false)
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(collect at <console>:31)
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Missing parents: List()
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[15] at map at <console>:31), which has no missing parents
15/08/28 08:56:02 INFO storage.MemoryStore: ensureFreeSpace(5992) called with curMem=297990, maxMem=278302556
15/08/28 08:56:02 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 5.9 KB, free 265.1 MB)
15/08/28 08:56:02 INFO storage.MemoryStore: ensureFreeSpace(3049) called with curMem=303982, maxMem=278302556
15/08/28 08:56:02 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.0 KB, free 265.1 MB)
15/08/28 08:56:02 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:37236 (size: 3.0 KB, free: 265.4 MB)
15/08/28 08:56:02 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[15] at map at <console>:31)
15/08/28 08:56:02 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/08/28 08:56:02 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, ANY, 1417 bytes)
15/08/28 08:56:02 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, ANY, 1417 bytes)
15/08/28 08:56:02 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/08/28 08:56:02 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/08/28 08:56:02 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/people.txt:0+16
15/08/28 08:56:02 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/people.txt:16+16
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/08/28 08:56:02 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1821 bytes result sent to driver
15/08/28 08:56:02 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 1807 bytes result sent to driver
15/08/28 08:56:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 417 ms on localhost (1/2)
15/08/28 08:56:02 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 410 ms on localhost (2/2)
15/08/28 08:56:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/08/28 08:56:02 INFO scheduler.DAGScheduler: ResultStage 0 (collect at <console>:31) finished in 0.432 s
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:31, took 0.549062 s
Name: Michael
Name: Andy
Name: Justin

数据源

sparkSQL通过数据框接口支持很多数据源，数据框可以作为RDD操作，也可以注册为临时表。把数据框注册为表从而可以使用SQL语句。

一般加载/保存函数



val df = sqlContext.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

保存模式

指定怎么处理已经存在的数据

保存为永久表

使用HiveContext时，数据框可以用saveAsTable命令保存为永久表。不像registerTempTable,saveAsTable将物化数据框的内容并在HiveMetastore里创建一个指针指向这个数据。只要连接到同一个metastore,重启程序后，持久表仍然存在。持久表的数据框可以通过SQLContext的table方法及表名来创建。
默认saveAsTable将会创建一个“管理表”，即数据的位置有metastore控制。当表drop后，管理表自动删除。

parquet 文件

SparkSQL支持读写parquet文件且自动保存原始数据的模式
编程加载数据



// sqlContext from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.

// The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet.
people.write.parquet("people.parquet")

// Read in the parquet file created above.  Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a DataFrame.
val parquetFile = sqlContext.read.parquet("people.parquet")

//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

未完待续。。。。。。。。。。。。

JSON数据集

SparkSQL可以自动推导JSON数据集的schema并把它加载为数据框。这可以用SQLContext.read.json()作用与String RDD或JSON文件完成

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)

// The inferred schema can be visualized using the printSchema() method.
people.printSchema()
// root
//  |-- age: integer (nullable = true)
//  |-- name: string (nullable = true)

// Register this DataFrame as a table.
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
teenagers.show

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
val anotherPeopleRDD = sc.parallelize(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)

SparkSQL支持读写存贮在hive中的数据，它有很多依赖，不再spark默认的安装包中，因此要支持hive,需要在编译spark时加上-Phive -Phive-thriftserver.这个jar必须放在所有的worker节点上，因为hive序列化和反序列化需要这些库获得hive中的数据
配置hive需要把hive-site.xml放在conf/目录下

必须创建HiveContext,它继承自SQLContext且添加支持查找元数据中的表和使用HiveQL写队列。当步配置hive-site.xml时，context将自动在当前目录中创建metastore_db和warehouse.

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)