11-Aggregating Streams

Reading Data

display(dbutils.fs.ls('/mnt/training/ecommerce/events/events-2020-07-03.json'))

在这里插入图片描述

schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

# hourly events logged from the BedBricks website on July 3, 2020
hourlyEventsPath = "/mnt/training/ecommerce/events/events-2020-07-03.json"

df = (spark.readStream
  .schema(schema)
  .option("maxFilesPerTrigger", 1)
  .json(hourlyEventsPath)
)

Cast to timestamp and add watermark for 2 hours

  • Add column **** by dividing **** by 1M and casting to timestamp

  • Add watermark for 2 hours

from pyspark.sql.functions import col

eventsDF = (df.withColumn("createdAt", (col("event_timestamp") / 1e6).cast("timestamp"))
  .withWatermark("createdAt", "2 hours")
)

Aggregate active users by traffic source for 1 hour windows

  • Set default shuffle partitions to number of cores on your cluster (not required, but runs faster)

  • Group by **** with a 1 hour window based on the **** column

  • Aggregate the approximate count of distinct users and alias with “active_users”

  • Select ****, ****, and the **** extracted from **** with alias “hour”

  • Sort by ****

spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism) # spark.sparkContext.defaultParallelism

from pyspark.sql.functions import approx_count_distinct, hour, window

trafficDF = (eventsDF.groupBy("traffic_source", window(col("createdAt"), "1 hour")).agg(
    approx_count_distinct("user_id").alias("active_users"))
  .select(col("traffic_source"), col("active_users"), hour(col("window.start")).alias("hour"))
  .sort("hour")
)

Execute query with display() and plot results

  • Execute results for **** using display()

    • Set the **** parameter to set a name for the query
  • Plot the streaming query results as a bar graph

  • Configure the following plot options:

    • Keys: ****

    • Series groupings: ****

    • Values: ****

display(trafficDF, streamName="hourly_traffic_p")

会生成一张动态的图

在这里插入图片描述

Manage streaming query

  • Iterate over SparkSession’s list of active streams to find one with name “hourly_traffic”

  • Stop the streaming query

untilStreamIsReady("hourly_traffic_p")

for s in spark.streams.active:
  if s.name == "hourly_traffic_p":
    s.stop()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值