RDD是分布式的 Java对象的集合,比如,RDD[Person]是以Person为类型参数,但是,Person类的内部结构对于RDD而言却是不可知的。
DataFrame是一种以RDD为基础的分布式数据集,也就是分布式的Row对象的集合(每个Row对象代表一行记录),提供了详细的结构信息,也就是我们经常说的模式(schema),Spark SQL可以清楚地知道该数据集中包含哪些列、每列的名称和类型。
hive数据源 dataframe
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
#从hive表中读取数据 note:hive.metastore.warehouse.dir的配置,否则读取的是本地默认路径
warehouse_location = abspath('hdfs://192.168.1.107:9000/usr/local/src/utils/apache-hive-1.2.2-bin/warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
spark.sql("use traffic")
spark.sql("select * from traffic.monitor_camera_info ").show()
#查询结果即为dataframes
sqlDF = spark.sql("SELECT monitor_id,camera_id FROM traffic.monitor_camera_info WHERE monitor_id == '0006'")
#DataFrames中的项目为Row类型,允许您按顺序访问每列。同时将dataframe转化成RDD
stringsDS = sqlDF.rdd.map(lambda row: "Key: %s, Value: %s" % (row.monitor_id, row.camera_id))
for record in stringsDS.collect():
print(record)
#从RDD转换为dataframe指定列
Record = Row("key", "value")
recordsDF = spark.createDataFrame([Record(i, "val_" + str(i)) for i in range(0006,0007)])
#使用DataFrames在SparkSession中创建临时视图
recordsDF.createOrReplaceTempView("records")
spark.sql("SELECT * FROM records r JOIN traffic.monitor_camera_info s ON r.key = s.monitor_id").show()
二、从RDD转化成dataframe的两种方式
①根据反射推断schema,Spark SQL可以将Row对象的RDD转换为DataFrame,从而推断出数据类型,通过将键/值对列表作为kwargs传递给Row类来构造行,此列表的键定义表的列名称,并通过对整个数据集进行采样来推断类型,类似于对JSON文件执行的推断。
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf
spark = SparkSession.builder.appName("spark Sql").config().getOrCreate()
sc = spark.sparkContext
lines = sc.textFile("file:/usr/local/src/utils/spark-2.2.0-bin-hadoop2.6/examples/src/main/resources/people.txt")
parts = lines.map(lambda l:l.split(","))
#从原来的RDD中创建一个行RDD,用来推断,ROW对象
people = parts.map(lambda p:Row(name=p[0],age=int(p[1])))
#将RDD转化为dataframe(反射推断,并没有手动创建)
schemaPeople = spark.createDataFrame(people)
#创建临时视图,dataframe操作
schemaPeople.createOrReplaceTempView("people")
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
#转化为RDD
teenNames = teenagers.rdd.map(lambda p:"Name:"+p.name).collect()
for name in teenNames:
print(name)
②指定schema
a、从原始的RDD创建元组或一行列表,row对象
b、创建由StructType表示的模式,该模式与步骤1中创建的RDD中的元组或列表的结构相匹配
c、通过SparkSession提供的createDataFrame方法将模式应用于RDD
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import sys
reload(sys)
sys.setdefaultencoding('utf8')
if __name__ == '__main__':
spark = SparkSession.builder.appName("spark Sql").config("spark.some.config.option","some-value").getOrCreate()
sc = spark.sparkContext
lines = sc.textFile("hdfs:/aaaa.aa")
#从原始的RDD创建元组或一行列表,row对象
parts = lines.map(lambda x:x.split("|"))
people = parts.map(lambda x:ROW(name=x[0],age=int(x[2]).strip()))
#创建由StructType表示的模式,该模式与步骤1中创建的RDD中的元组或列表的结构相匹配
#创建自定义的列名
schemaString = "name value"
fields = [StructField(field_name, StringType()) for field_name in schemaString.split(" ")]
schema = StructType(fields)
#通过SparkSession提供的createDataFrame方法将模式应用于RDD
schemapeople = spark.createDataFrame(people,schema)
schemapeople.createOrReplaceTempView("people")
results = spark.sql("SELECT value FROM people")