第64课:SparkSQL下Parquet的数据切分和压缩内幕详解学习笔记
本期内容:
1 SparkSQL下Parquet数据切分
2 SparkSQL下的Parquet数据压缩
以Spark官网上的SparkSQL操作Parquet的实例进行讲解:
Schema Merging
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
// sqlContext from the previous example is used in this example.// This is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Create a simple DataFrame, stored into a partition directory
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")
// Create another DataFrame in a new partition directory,// adding a new column and dropping an existing column
val df2 = sc.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
df2.write.parquet("data/test_table/key=2")
// Read the partitioned table
val df3 = sqlContext.read.option("mergeSchema", "true").parquet("data/test_table")
df3.printSchema()
// The final schema consists of all 3 columns in the Parquet files together// with the partitioning column appeared in the partition directory paths.// root// |-- single: int (nullable = true)// |-- double: int (nullable = true)// |-- triple: int (nullable = true)// |-- key : int (nullable = true)
实际运行结果:
scala> val df1 = sc.makeRDD(1 to 5).map(i => (i,i * 2)).toDF("single","double")
df1: org.apache.spark.sql.DataFrame = [single: int, double: int]
scala> df1.write.parquet("data/text_table/key=1")
16/04/02 04:27:07 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/04/02 04:27:07 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/04/02 04:27:07 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/04/02 04:27:07 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/04/02 04:27:07 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/04/02 04:27:07 INFO parquet.ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
16/04/02 04:27:07 INFO datasources.DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
16/04/02 04:27:09 INFO spark.SparkContext: Starting job: parquet at <console>:33
16/04/02 04:27:09 INFO scheduler.DAGScheduler: Got job 0 (parquet at <console>:33) with 3 output partitions
16/04/02 04:27:09 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (parquet at <console>:33)
16/04/02 04:27:09 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/04/02 04:27:09 INFO scheduler.DAGScheduler: Missing parents: List()
16/04/02 04:27:09 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at parquet at <console>:33), which has no missing parents
16/04/02 04:27:12 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 68.0 KB, free 68.0 KB)
16/04/02 04:27:12 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.6 KB, free 92.5 KB)
16/04/02 04:27:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.121:56069 (size: 24.6 KB, free: 517.4 MB)
16/04/02 04:27:12 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/04/02 04:27:12 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at parquet at <console>:33)
16/04/02 04:27:12 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
16/04/02 04:27:13 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slq1, partition 0,PROCESS_LOCAL, 2078 bytes)
16/04/02 04:27:13 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slq2, partition 1,PROCESS_LOCAL, 2078 bytes)
16/04/02 04:27:13 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, slq3, partition 2,PROCESS_LOCAL, 2135 bytes)
16/04/02 04:27:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slq2:44836 (size: 24.6 KB, free: 517.4 MB)
16/04/02 04:27:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slq3:53765 (size: 24.6 KB, free: 517.4 MB)
16/04/02 04:27:18 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slq1:44043 (size: 24.6 KB, free: 517.4 MB)
16/04/02 04:28:13 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 60174 ms on slq3 (1/3)
16/04/02 04:28:16 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 62700 ms on slq2 (2/3)
16/04/02 04:28:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 74088 ms on slq1 (3/3)
16/04/02 04:28:27 INFO scheduler.DAGScheduler: ResultStage 0 (parquet at <console>:33) finished in 74.105 s
16/04/02 04:28:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/04/02 04:28:27 INFO scheduler.DAGScheduler: Job 0 finished: parquet at <console>:33, took 78.540234 s
16/04/02 04:28:29 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
16/04/02 04:28:35 INFO datasources.DefaultWriterContainer: Job job_201604020427_0000 committed.
16/04/02 04:28:36 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table/key=1 on driver
16/04/02 04:28:36 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table/key=1 on driver
scala> 16/04/02 04:39:10 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on slq2:44836 in memory (size: 24.6 KB, free: 517.4 MB)
16/04/02 04:39:10 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.1.121:56069 in memory (size: 24.6 KB, free: 517.4 MB)
16/04/02 04:39:11 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on slq3:53765 in memory (size: 24.6 KB, free: 517.4 MB)
16/04/02 04:39:11 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on slq1:44043 in memory (size: 24.6 KB, free: 517.4 MB)
16/04/02 04:39:11 INFO spark.ContextCleaner: Cleaned accumulator 3
16/04/02 04:39:11 INFO spark.ContextCleaner: Cleaned accumulator 2
scala> val df2 = sc.makeRDD(6 to 10).map(i => (i,i * 3)).toDF("single","triple")
df2: org.apache.spark.sql.DataFrame = [single: int, triple: int]
scala> df2.write.parquet("data/text_table/key=2")
16/04/02 04:56:13 INFO parquet.ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
16/04/02 04:56:13 INFO datasources.DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
16/04/02 04:56:14 INFO spark.SparkContext: Starting job: parquet at <console>:33
16/04/02 04:56:14 INFO scheduler.DAGScheduler: Got job 1 (parquet at <console>:33) with 3 output partitions
16/04/02 04:56:14 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (parquet at <console>:33)
16/04/02 04:56:14 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/04/02 04:56:14 INFO scheduler.DAGScheduler: Missing parents: List()
16/04/02 04:56:14 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[14] at