有时我们想将spark的dataframe转为pandas的dataframe,首先需要将rdd转为spark的dataframe,下面是一种方法:
from pyspark.sql.types import *
from pyspark.sql import Row
schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)
df.printSchema()
df.show()
输出:
root
|– name: string (nullable = true)
|– age: integer (nullable = true)
+——-+—+
| name|age|
+——-+—+
|Severin| 33|
| John| 48|
+——-+—+
接下来使用df.toPandas()方法即可将spark的dataframe转为pandas的dataframe~
references:
https://stackoverflow.com/questions/44948465/creating-a-dataframe-from-row-results-in-infer-schema-issue
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame