[Spark][Python][DataFrame][Write]DataFrame写入的例子

最新推荐文章于 2024-03-28 21:48:56 发布

转载最新推荐文章于 2024-03-28 21:48:56 发布 · 451 阅读

文章标签：

#python #大数据 #json

本文展示如何使用Spark将JSON文件读取为DataFrame，并将其写入分区的Parquet表中。通过具体步骤和代码示例，读者可以了解整个流程及其中涉及的关键技术点。

[Spark][Python][DataFrame][Write]DataFrame写入的例子

$ hdfs dfs -cat people.json

{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}

$pyspark

sqlContext = HiveContext(sc)

peopleDF = sqlContext.read.json("people.json")

peopleDF.write.format("parquet").mode("append").partitionBy("age").saveAsTable("people")

17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 65.5 KB, free 338.2 KB)
17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 21.4 KB, free 359.6 KB)
17/10/07 00:58:18 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:59616 (size: 21.4 KB, free: 208.8 MB)
17/10/07 00:58:18 INFO spark.SparkContext: Created broadcast 2 from saveAsTable at NativeMethodAccessorImpl.java:-2
17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 251.1 KB, free 610.7 KB)
17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.6 KB, free 632.4 KB)
17/10/07 00:58:18 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:59616 (size: 21.6 KB, free: 208.7 MB)
17/10/07 00:58:18 INFO spark.SparkContext: Created broadcast 3 from saveAsTable at NativeMethodAccessorImpl.java:-2
17/10/07 00:58:19 INFO parquet.ParquetRelation: Using default output committer for Parquet: parquet.hadoop.ParquetOutputCommitter
17/10/07 00:58:19 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/10/07 00:58:19 INFO datasources.DynamicPartitionWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
17/10/07 00:58:19 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/10/07 00:58:19 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/07 00:58:19 INFO spark.SparkContext: Starting job: saveAsTable at NativeMethodAccessorImpl.java:-2
17/10/07 00:58:19 INFO scheduler.DAGScheduler: Got job 1 (saveAsTable at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/07 00:58:19 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (saveAsTable at NativeMethodAccessorImpl.java:-2)
17/10/07 00:58:19 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/07 00:58:19 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/07 00:58:19 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[7] at saveAsTable at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/07 00:58:19 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 72.7 KB, free 705.0 KB)
17/10/07 00:58:20 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 26.4 KB, free 731.4 KB)
17/10/07 00:58:20 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:59616 (size: 26.4 KB, free: 208.7 MB)
17/10/07 00:58:20 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
17/10/07 00:58:20 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[7] at saveAsTable at NativeMethodAccessorImpl.java:-2)
17/10/07 00:58:20 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/10/07 00:58:20 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
17/10/07 00:58:20 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
17/10/07 00:58:20 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 314.888218 ms
17/10/07 00:58:20 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/10/07 00:58:20 INFO datasources.DynamicPartitionWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
17/10/07 00:58:20 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 46.978197 ms
17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 64.665839 ms
17/10/07 00:58:21 INFO codegen.GenerateUnsafeProjection: Code generated in 94.259071 ms
17/10/07 00:58:21 INFO codec.CodecConfig: Compression: GZIP
17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Dictionary is on
17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Validation is off
17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
17/10/07 00:58:21 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "pcode",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "pcoe",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
optional binary name (UTF8);
optional binary pcode (UTF8);
optional binary pcoe (UTF8);
}

17/10/07 00:58:21 INFO compress.CodecPool: Got brand-new compressor [.gz]
17/10/07 00:58:21 INFO datasources.DynamicPartitionWriterContainer: Maximum partitions reached, falling back on sorting.
17/10/07 00:58:21 INFO codegen.GenerateUnsafeProjection: Code generated in 34.281133 ms
17/10/07 00:58:21 INFO codegen.GenerateOrdering: Code generated in 85.573905 ms
17/10/07 00:58:21 INFO datasources.DynamicPartitionWriterContainer: Sorting complete. Writing out partition files one at a time.
17/10/07 00:58:21 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 54
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-hadoop-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-pig-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-format-2.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-jdbc-1.1.0-cdh5.7.0-standalone.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-exec-1.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [shaded.parquet.org.slf4j.helpers.NOPLoggerFactory]
17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 80B for [name] BINARY: 2 values, 26B raw, 43B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 73B for [pcode] BINARY: 2 values, 24B raw, 38B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcoe] BINARY: 2 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
17/10/07 00:58:22 INFO codec.CodecConfig: Compression: GZIP
17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Dictionary is on
17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Validation is off
17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
17/10/07 00:58:22 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "pcode",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "pcoe",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
optional binary name (UTF8);
optional binary pcode (UTF8);
optional binary pcoe (UTF8);
}

17/10/07 00:58:22 INFO compress.CodecPool: Got brand-new compressor [.gz]
17/10/07 00:58:22 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 13
17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [name] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcode] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcoe] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
17/10/07 00:58:22 INFO output.FileOutputCommitter: Saved output of task 'attempt_201710070058_0001_m_000000_0' to hdfs://localhost:8020/user/hive/warehouse/people/_temporary/0/task_201710070058_0001_m_000000
17/10/07 00:58:22 INFO mapred.SparkHadoopMapRedUtil: attempt_201710070058_0001_m_000000_0: Committed
17/10/07 00:58:22 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 2057 bytes result sent to driver
17/10/07 00:58:22 INFO scheduler.DAGScheduler: ResultStage 1 (saveAsTable at NativeMethodAccessorImpl.java:-2) finished in 2.797 s
17/10/07 00:58:22 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 2797 ms on localhost (1/1)
17/10/07 00:58:22 INFO scheduler.DAGScheduler: Job 1 finished: saveAsTable at NativeMethodAccessorImpl.java:-2, took 3.236619 s
17/10/07 00:58:22 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/10/07 00:58:23 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
17/10/07 00:58:23 INFO datasources.DynamicPartitionWriterContainer: Job job_201710070058_0000 committed.
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=19 on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=30 on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=46 on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=__HIVE_DEFAULT_PARTITION__ on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=19 on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=30 on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=46 on driver
17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=__HIVE_DEFAULT_PARTITION__ on driver
17/10/07 00:58:24 WARN hive.HiveContext$$anon$2: Persisting partitioned data source relation `people` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s):
hdfs://localhost:8020/user/hive/warehouse/people

[training@localhost ~]$ hive

hive>
> show tables like 'people';
OK
people
Time taken: 5.046 seconds, Fetched: 1 row(s)
hive>

sqlContext =HiveContext(sc)
newPeopleDF = sqlContext.read.table("people")

newPeopleDF.limit(5).show()

+-------+-----+-----+----+
| name|pcode| pcoe| age|
+-------+-----+-----+----+
|Brayden|94304| null| 30|
| Diana| null| null| 46|
| Carla| null|10036| 19|
| Alice|94304| null|null|
|Etienne|94104| null|null|
+-------+-----+-----+----+

可以看到，确实把一个从jason 读取得到的 DataFrame，写入了parquet 格式的表，表名为 people

然后，通过再一次地通过 HiveContext 来读取此表，得到并显示了它的数据。