rdd、dataframe、读取parquet
rdd和dataframe的区别
author:https://blog.csdn.net/u012925804/article/details/113641681
http://spark.apache.org/docs/2.4.3/sql-getting-started.html
-
其实spark教程最开始就介绍了rdd,说rdd有转换和行为两种,也就是有map,filter这些功能。
-
但是dataframe可能没有这些功能。
-
所以就需要吧dataframe转换成rdd,然后做一些处理。
这个是个很好的例子:
- .rdd就是将dataframe转换成rdd,然后可以对每一行执行相应的处理。表的字段相当于key。
- collect()是将数据拉到一个机器上(拉到一块)。
from pyspark.sql import Row
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
# 加载txt文件,然后转换成Row,其实也就是rdd。
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
# 然后创建一个表的视图,也就是临时表,程序结束就没了。
schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")
# SQL can be run over DataFrames that have been registered as a table.
# 使用sql
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
## .rdd就是将dataframe转换成rdd,然后可以对每一行执行相应的处理。表的字段相当于key。
teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
print(name)
# Name: Justin
spark sql
- 与Spark SQL交互的方法有多种,可以用SQL,也可以用spark里面的函数 API。
Python does not have the support for the Dataset API.
dataframe
dataframe是一种格式。
RDD和dataframe两个东西。虽然都是数据。
- dataframe这个东西是指spark里面的结构,就像pandas里面csv和dataframe的关系。
- A DataFrame is a Dataset organized into named columns. 列格式。
- 从概念上讲,它等效于关系数据库中的表或R / Python中的数据框,但是在后台进行了更丰富的优化。
- 可以从多种来源构造DataFrame,例如:结构化数据文件,Hive中的表,外部数据库或现有RDD
sparkSession
-
spark所有功能的入口点都是SparkSession类。
-
这里就用到了SparkSession,说明在用dataframe的时候需要用到SparkSession。
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
访问字段
在Python中,可以通过属性(df.age)或通过索引(df [‘age’])访问DataFrame的列。
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# +-------+---------+
# | name|(age + 1)|
# +-------+---------+
# |Michael| null|
# | Andy| 31|
# | Justin| 20|
# +-------+---------+
# Select people older than 21
df.filter(df['age'] > 21).show()
# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+
# Count people by age
df.groupBy("age").count().show()
# +----+-----+
# | age|count|
# +----+-----+
# | 19| 1|
# |null| 1|
# | 30| 1|
# +----+-----+
执行sql语句
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+
datasets,
park 2 里面python还不支持dataset。
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+
// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)
// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
rdd转换为dataframe
- 需要构建schema;
- 然后spark.createDataFrame。
- 最后dataframe就支持sql。
# Import data types
from pyspark.sql.types import *
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
# Creates a temporary view using the DataFrame
schemaPeople.createOrReplaceTempView("people")
# SQL can be run over DataFrames that have been registered as a table.
results = spark.sql("SELECT name FROM people")
results.show()
# +-------+
# | name|
# +-------+
# |Michael|
# | Andy|
# | Justin|
# +-------+
读写parquet
- 创建视图以后:
parquetFile.createOrReplaceTempView("parquetFile")
, 才可以用sql查询。
peopleDF = spark.read.json("examples/src/main/resources/people.json")
# DataFrames can be saved as Parquet files, maintaining the schema information.
peopleDF.write.parquet("people.parquet")
# Read in the Parquet file created above.
# Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile = spark.read.parquet("people.parquet")
# Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView("parquetFile")
teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.show()
# +------+
# | name|
# +------+
# |Justin|
# +------+
分区
在分区表中,数据通常存储在不同的目录中。所有内置文件源(包括Text / CSV / JSON / ORC / Parquet)都能够自动发现和推断分区信息。例如,我们可以使用以下目录结构将之前使用的所有填充数据存储到一个分区表中,该目录结构具有两个额外的列gender
和country
作为分区列:
path
└── to
└── table
├── gender=male
│ ├── ...
│ │
│ ├── country=US
│ │ └── data.parquet
│ ├── country=CN
│ │ └── data.parquet
│ └── ...
└── gender=female
├── ...
│
├── country=US
│ └── data.parquet
├── country=CN
│ └── data.parquet
└── ...
通过传递path/to/table
给SparkSession.read.parquet
或SparkSession.read.load
,Spark SQL将自动从路径中提取分区信息。现在,返回的DataFrame的架构变为:
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
读hive
要使用hive必须加:enableHiveSupport()
park = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
sqlDF.rdd.map
- 通过把df转换成rdd,然后就可以对每一行进行处理,方式是操作key和value,key就是字段名。
# The results of SQL queries are themselves DataFrames and support all normal functions.
sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")
# The items in DataFrames are of type Row, which allows you to access each column by ordinal.
stringsDS = sqlDF.rdd.map(lambda row: "Key: %d, Value: %s" % (row.key, row.value))
for record in stringsDS.collect():
print(record)
# Key: 0, Value: val_0
# Key: 0, Value: val_0
# Key: 0, Value: val_0
# ...
# You can also use DataFrames to create temporary views within a SparkSession.
Record = Row("key", "value")
recordsDF = spark.createDataFrame([Record(i, "val_" + str(i)) for i in range(1, 101)])
recordsDF.createOrReplaceTempView("records")
# Queries can then join DataFrame data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
# +---+------+---+------+
# |key| value|key| value|
# +---+------+---+------+
# | 2| val_2| 2| val_2|
# | 4| val_4| 4| val_4|
# | 5| val_5| 5| val_5|