DataFrame场景——实时飞行性能
分析航空公司的实时性能以及航班延误的原因——实时数据
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# 设置文件路径
flightPerfFilePath = './data/flights/departuredelays.csv'
airportFilePath = './data/flights/airport-codes-na.txt'
# 获得机场数据
airports = spark.read.csv(airportFilePath, header='true', inferSchema='true', sep='\t')
airports.createOrReplaceTempView('airports')
# 获得起飞延时数据集
flightPerf = spark.read.csv(flightPerfFilePath, header='true')
flightPerf.createOrReplaceTempView('FlightPerformance')
# 获得起飞延迟数据集
flightPerf.cache()
DataFrame[date: string, delay: string, distance: string, origin: string, destination: string]
注意:使用CSV阅读器(com.databricks.spark.csv)导入数据,这个方法适用于任何指定的分隔符(机场数据是制表符(tab)分隔,飞行性能数据是逗号(comma)分隔)。最后对飞行数据集进行缓存,以便加快后续查询。
连接飞行性能和机场
DataFrame/SQL的较为常见的任务是将两种不同的数据集关联在一起。使用DataFrame,针对这些关联了大量的性能优化默认包括:
# 通过城市和起飞代码查询航班延误的总数
spark.sql('''
select a.City, f.origin,
sum(f.delay) as Delays
from FlightPerformance f
join airports a
on a.IATA = f.origin
where a.State = "WA"
group by a.City, f.origin
order by sum(f.delay) desc'''
).show()
+-------+------+--------+
| City|origin| Delays|
+-------+------+--------+
|Seattle| SEA|159086.0|
|Spokane| GEG| 12404.0|
| Pasco| PSC| 949.0|
+-------+------+--------+
可视化飞行性能数据
分解美国大陆上所有联邦州。
Flight = spark.sql('''
select a.State, sum(f.delay) as Delays
from FlightPerformance f
join airports a
on a.IATA = f.origin
where a.Country = 'USA'
group by a.State
'''
).show()
+-----+---------+
|State| Delays|
+-----+---------+
| SC| 80666.0|
| AZ| 401793.0|
| LA| 199136.0|
| MN| 256811.0|
| NJ| 452791.0|
| OR| 109333.0|
| VA| 98016.0|
| null| 397237.0|
| RI| 30760.0|
| WY| 15365.0|
| KY| 61156.0|
| NH| 20474.0|
| MI| 366486.0|
| NV| 474208.0|
| WI| 152311.0|
| ID| 22932.0|
| CA|1891919.0|
| CT| 54662.0|
| NE| 59376.0|
| MT| 19271.0|
+-----+---------+
only showing top 20 rows