Apache Beam中python常用函数（二）：聚合函数

最新推荐文章于 2023-12-26 01:36:51 发布

奋斗的源

最新推荐文章于 2023-12-26 01:36:51 发布

阅读量840

点赞数

分类专栏： Apache基础知识文章标签： python apache

原文链接：https://beam.apache.org/documentation/transforms/python/elementwise/

版权

Apache基础知识专栏收录该内容

7 篇文章 2 订阅

订阅专栏

文章目录

聚合函数

聚合函数

函数	描述
CoGroupByKey	获取多个键控元素集合并生成一个集合，其中每个元素都包含一个键和与该键关联的所有值。
CombineGlobally	变换以组合元素。
CombinePerKey	转换以组合每个键的元素。
CombineValues	转换以组合键控迭代。
Count	计算每个聚合中的元素数。
Distinct	生成一个包含来自输入集合的不同元素的集合。
GroupByKey	获取元素的键集合并生成一个集合，其中每个元素由一个键和与该键关联的所有值组成。
GroupIntoBatches	将输入批处理为所需的批量大小。
Latest	获取具有最新时间戳的元素。
Max	获取每个聚合中具有最大值的元素。
Mean	计算每个聚合内的平均值。
Min	获取每个聚合中具有最小值的元素。
Sample	从每个聚合中随机选择一些元素。
Sum	对每个聚合中的所有元素求和。
Top	计算每个聚合中的最大元素。

1.CoGroupByKey

作用：通过键聚合所有输入元素，并允许下游处理使用与键关联的所有值。GroupByKey在单个输入集合上执行此操作，因此只有一种类型的输入值，而CoGroupByKey在多个输入集合上执行此操作。因此，每个键的结果是每个输入集合中与该键关联的值的元组。

在下面的例子中，我们创建了一个包含两个产品PCollection的管道，一个带有图标，一个带有持续时间，两者都有一个产品名称的公共键。然后，我们申请使用它们的键CoGroupByKey加入两个PCollections。

CoGroupByKey需要一个指定键的PCollections字典，并生成由它们的键连接的元素。每个输出元素的值都是字典，其中的名称对应于输入字典，其中包含为该键找到的所有值的列表。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  icon_pairs = pipeline | 'Create icons' >> beam.Create([
      ('Apple', '🍎'),
      ('Apple', '🍏'),
      ('Eggplant', '🍆'),
      ('Tomato', '🍅'),
  ])

  duration_pairs = pipeline | 'Create durations' >> beam.Create([
      ('Apple', 'perennial'),
      ('Carrot', 'biennial'),
      ('Tomato', 'perennial'),
      ('Tomato', 'annual'),
  ])

  plants = (({
      'icons': icon_pairs, 'durations': duration_pairs
  })
            | 'Merge' >> beam.CoGroupByKey()
            | beam.Map(print))

运行结果：
在这里插入图片描述

2.CombineGlobally

作用：组合集合中的所有元素。

在以下示例中，我们创建了一个带有产品的PCollection的管道。然后，我们CombineGlobally以多种方式应用来组合PCollection.

CombineGlobally接受一个函数，该函数接受元素的可迭代对象作为输入，并将它们组合起来返回单个元素。

（1）使用函数
我们定义了一个函数 get _ common _ items，它接受一个可迭代的集合作为输入，并计算这些集合的交集(公共项)。

import apache_beam as beam

def get_common_items(sets):
  # Set.intersection()接受多个集合作为单独的参数。
  # 我们使用*运算符将“set”列表解压为多个参数。
  # 组合转换可能会给我们一个空的“set”列表，
  # 因此，我们使用一个带有空集的列表作为默认值。
  return set.intersection(*(sets or [set()]))

with beam.Pipeline() as pipeline:
  common_items = (
      pipeline
      | 'Create produce' >> beam.Create([
          {'🍓', '🥕', '🍌', '🍅', '🌶️'},
          {'🍇', '🥕', '🥝', '🍅', '🥔'},
          {'🍉', '🥕', '🍆', '🍅', '🍍'},
          {'🥑', '🥕', '🌽', '🍅', '🥥'},
      ])
      | 'Get common items' >> beam.CombineGlobally(get_common_items)
      | beam.Map(print))

运行结果：
在这里插入图片描述

（2）使用lambda函数

import apache_beam as beam

with beam.Pipeline() as pipeline:
  common_items = (
      pipeline
      | 'Create produce' >> beam.Create([
          {'🍓', '🥕', '🍌', '🍅', '🌶️'},
          {'🍇', '🥕', '🥝', '🍅', '🥔'},
          {'🍉', '🥕', '🍆', '🍅', '🍍'},
          {'🥑', '🥕', '🌽', '🍅', '🥥'},
      ])
      | 'Get common items' >>
      beam.CombineGlobally(lambda sets: set.intersection(*(sets or [set()])))
      | beam.Map(print))

运行结果：
在这里插入图片描述

（3）使用多个参数
可以将带有多个参数的函数传递给CombineGlobally，它们作为函数的其他位置参数或关键字参数传递给函数。

在此示例中，lambda 函数将sets和exclude作为参数。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  common_items_with_exceptions = (
      pipeline
      | 'Create produce' >> beam.Create([
          {'🍓', '🥕', '🍌', '🍅', '🌶️'},
          {'🍇', '🥕', '🥝', '🍅', '🥔'},
          {'🍉', '🥕', '🍆', '🍅', '🍍'},
          {'🥑', '🥕', '🌽', '🍅', '🥥'},
      ])
      | 'Get common items with exceptions' >> beam.CombineGlobally(
          lambda sets, exclude: \
              set.intersection(*(sets or [set()])) - exclude,
          exclude={'🥕'})
      | beam.Map(print)
  )

运行结果：
在这里插入图片描述

（4）使用CombineFn
组合元素的更通用和最灵活的方法是使用继承自CombineFn.

CombineFn.create_accumulator()：这会创建一个空的累加器。例如，求和的空累加器为0，而乘积（乘法）的空累加器为1。

CombineFn.add_input()：每个元素调用一次。获取一个累加器和一个输入元素，将它们组合起来并返回更新后的累加器。

CombineFn.merge_accumulators()：可以并行处理多个累加器，因此此功能有助于将它们合并为一个累加器。

CombineFn.extract_output()：它允许在提取结果之前进行额外的计算。

import apache_beam as beam

class PercentagesFn(beam.CombineFn):
  def create_accumulator(self):
    return {}

  def add_input(self, accumulator, input):
    # accumulator == {}
    # input == '🥕'
    if input not in accumulator:
      accumulator[input] = 0  # {'🥕': 0}
    accumulator[input] += 1  # {'🥕': 1}
    return accumulator

  def merge_accumulators(self, accumulators):
    # accumulators == [
    #     {'🥕': 1, '🍅': 2},
    #     {'🥕': 1, '🍅': 1, '🍆': 1},
    #     {'🥕': 1, '🍅': 3},
    # ]
    merged = {}
    for accum in accumulators:
      for item, count in accum.items():
        if item not in merged:
          merged[item] = 0
        merged[item] += count
    # merged == {'🥕': 3, '🍅': 6, '🍆': 1}
    return merged

  def extract_output(self, accumulator):
    # accumulator == {'🥕': 3, '🍅': 6, '🍆': 1}
    total = sum(accumulator.values())  # 10
    percentages = {item: count / total for item, count in accumulator.items()}
    # percentages == {'🥕': 0.3, '🍅': 0.6, '🍆': 0.1}
    return percentages

with beam.Pipeline() as pipeline:
  percentages = (
      pipeline
      | 'Create produce' >> beam.Create(
          ['🥕', '🍅', '🍅', '🥕', '🍆', '🍅', '🍅', '🍅', '🥕', '🍅'])
      | 'Get percentages' >> beam.CombineGlobally(PercentagesFn())
      | beam.Map(print))

运行结果：
在这里插入图片描述

3.CombinePerKey

作用：组合集合中每个键的所有元素。

在以下示例中，我们创建了带有产品的PCollection的一个管道。然后，我们CombinePerKey以多种方式应用来组合PCollection.

CombinePerKey 接受一个函数，该函数将值列表作为输入，并为每个键组合它们。

（1）使用预定义函数
我们使用函数sum，它接受一个可迭代的数并将它们相加。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  total = (
      pipeline
      | 'Create plant counts' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Sum' >> beam.CombinePerKey(sum)
      | beam.Map(print))

运行结果：
在这里插入图片描述

（2）使用函数
我们定义了一个函数saturated_sum，它接受一个由数组成的可迭代对象，并将它们相加，直到预定义的最大数。

import apache_beam as beam

def saturated_sum(values):
  max_value = 8
  return min(sum(values), max_value)

with beam.Pipeline() as pipeline:
  saturated_total = (
      pipeline
      | 'Create plant counts' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Saturated sum' >> beam.CombinePerKey(saturated_sum)
      | beam.Map(print))

运行结果：
在这里插入图片描述

（3）使用lambda函数

import apache_beam as beam

with beam.Pipeline() as pipeline:
  saturated_total = (
      pipeline
      | 'Create plant counts' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Saturated sum' >>
      beam.CombinePerKey(lambda values: min(sum(values), 8))
      | beam.Map(print))

运行结果：
在这里插入图片描述

（4）使用多个参数
可以传递具有多个参数的函数到CombinePerkey。它们作为函数的其他位置参数或关键字参数传递给函数。

在此示例中，lambda 函数将values和max_value作为参数。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  saturated_total = (
      pipeline
      | 'Create plant counts' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Saturated sum' >> beam.CombinePerKey(
          lambda values, max_value: min(sum(values), max_value), max_value=8)
      | beam.Map(print))

运行结果：
在这里插入图片描述
（5）使用CombineFn
组合元素的更通用和最灵活的方法是使用继承自CombineFn.

CombineFn.create_accumulator()：这会创建一个空的累加器。例如，求和的空累加器为0，而乘积（乘法）的空累加器为1。

CombineFn.add_input()：每个元素调用一次。获取一个累加器和一个输入元素，将它们组合起来并返回更新后的累加器。

CombineFn.merge_accumulators()：可以并行处理多个累加器，因此此功能有助于将它们合并为一个累加器。

CombineFn.extract_output()：它允许在提取结果之前进行额外的计算。

import apache_beam as beam

class AverageFn(beam.CombineFn):
  def create_accumulator(self):
    sum = 0.0
    count = 0
    accumulator = sum, count
    return accumulator

  def add_input(self, accumulator, input):
    sum, count = accumulator
    return sum + input, count + 1

  def merge_accumulators(self, accumulators):
    # accumulators = [(sum1, count1), (sum2, count2), (sum3, count3), ...]
    sums, counts = zip(*accumulators)
    # sums = [sum1, sum2, sum3, ...]
    # counts = [count1, count2, count3, ...]
    return sum(sums), sum(counts)

  def extract_output(self, accumulator):
    sum, count = accumulator
    if count == 0:
      return float('NaN')
    return sum / count

with beam.Pipeline() as pipeline:
  average = (
      pipeline
      | 'Create plant counts' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Average' >> beam.CombinePerKey(AverageFn())
      | beam.Map(print))

运行结果：
在这里插入图片描述

4.CombineValues

作用：在键控元素集合中组合可迭代的值。

在以下示例中，我们创建了带有产品的PCollection的一个管道。然后，我们CombineValues以多种方式应用来组合PCollection.

CombineValues接受一个函数，该函数接受元素的可迭代对象作为输入，并将它们组合起来返回单个元素。CombineValues需要一个带键的元素 PCollection,，其中的值是一个要组合的元素的可迭代对象。

（1）使用预定义函数
我们使用函数sum，它接受一个可迭代的数并将它们相加。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  total = (
      pipeline
      | 'Create produce counts' >> beam.Create([
          ('🥕', [3, 2]),
          ('🍆', [1]),
          ('🍅', [4, 5, 3]),
      ])
      | 'Sum' >> beam.CombineValues(sum)
      | beam.Map(print))

运行结果：
在这里插入图片描述

（2）使用函数
我们想让总和达到最大值，所以我们使用饱和算法。

我们定义了一个函数saturated_sum，它接受一个由数组成的可迭代对象，并将它们相加，直到预定义的最大数。

import apache_beam as beam

def saturated_sum(values):
  max_value = 8
  return min(sum(values), max_value)

with beam.Pipeline() as pipeline:
  saturated_total = (
      pipeline
      | 'Create plant counts' >> beam.Create([
          ('🥕', [3, 2]),
          ('🍆', [1]),
          ('🍅', [4, 5, 3]),
      ])
      | 'Saturated sum' >> beam.CombineValues(saturated_sum)
      | beam.Map(print))

运行结果：
在这里插入图片描述

（3）使用lambda函数

import apache_beam as beam

with beam.Pipeline() as pipeline:
  saturated_total = (
      pipeline
      | 'Create plant counts' >> beam.Create([
          ('🥕', [3, 2]),
          ('🍆', [1]),
          ('🍅', [4, 5, 3]),
      ])
      | 'Saturated sum' >>
      beam.CombineValues(lambda values: min(sum(values), 8))
      | beam.Map(print))

运行结果：
在这里插入图片描述
（4）使用多个参数
可以将带有多个参数的函数传递给CombineValues。它们作为函数的其他位置参数或关键字参数传递给函数。

在此示例中，lambda 函数将values和max_value作为参数。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  saturated_total = (
      pipeline
      | 'Create plant counts' >> beam.Create([
          ('🥕', [3, 2]),
          ('🍆', [1]),
          ('🍅', [4, 5, 3]),
      ])
      | 'Saturated sum' >> beam.CombineValues(
          lambda values, max_value: min(sum(values), max_value), max_value=8)
      | beam.Map(print))

运行结果：
在这里插入图片描述
（5）使用CombineFn
组合元素的更通用和最灵活的方法是使用继承自CombineFn.

CombineFn.create_accumulator()：这会创建一个空的累加器。例如，求和的空累加器为0，而乘积（乘法）的空累加器为1。

CombineFn.add_input()：每个元素调用一次。获取一个累加器和一个输入元素，将它们组合起来并返回更新后的累加器。

CombineFn.merge_accumulators()：可以并行处理多个累加器，因此此功能有助于将它们合并为一个累加器。

CombineFn.extract_output()：它允许在提取结果之前进行额外的计算。

import apache_beam as beam

class AverageFn(beam.CombineFn):
  def create_accumulator(self):
    return {}

  def add_input(self, accumulator, input):
    # accumulator == {}
    # input == '🥕'
    if input not in accumulator:
      accumulator[input] = 0  # {'🥕': 0}
    accumulator[input] += 1  # {'🥕': 1}
    return accumulator

  def merge_accumulators(self, accumulators):
    # accumulators == [
    #     {'🥕': 1, '🍅': 1},
    #     {'🥕': 1, '🍅': 1, '🍆': 1},
    # ]
    merged = {}
    for accum in accumulators:
      for item, count in accum.items():
        if item not in merged:
          merged[item] = 0
        merged[item] += count
    # merged == {'🥕': 2, '🍅': 2, '🍆': 1}
    return merged

  def extract_output(self, accumulator):
    # accumulator == {'🥕': 2, '🍅': 2, '🍆': 1}
    total = sum(accumulator.values())  # 5
    percentages = {item: count / total for item, count in accumulator.items()}
    # percentages == {'🥕': 0.4, '🍅': 0.4, '🍆': 0.2}
    return percentages

with beam.Pipeline() as pipeline:
  percentages_per_season = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('spring', ['🥕', '🍅', '🥕', '🍅', '🍆']),
          ('summer', ['🥕', '🍅', '🌽', '🍅', '🍅']),
          ('fall', ['🥕', '🥕', '🍅', '🍅']),
          ('winter', ['🍆', '🍆']),
      ])
      | 'Average' >> beam.CombineValues(AverageFn())
      | beam.Map(print))

运行结果：
在这里插入图片描述

5.Count

作用：计算每个聚合中的元素数。

在以下示例中，我们使用包含产品的两个PCollection创建了一个管道。然后，我们Count以不同的方式获取元素的总数。

（1）计算 PCollection 中的所有元素
我们使用Count.Globally（）对PCollection中的所有元素进行计数，即使存在重复的元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  total_elements = (
      pipeline
      | 'Create plants' >> beam.Create(
          ['🍓', '🥕', '🥕', '🥕', '🍆', '🍆', '🍅', '🍅', '🍅', '🌽'])
      | 'Count all elements' >> beam.combiners.Count.Globally()
      | beam.Map(print))

运行结果：
在这里插入图片描述

（2）为每个键计算元素
我们使用 Count. PerKey ()计算键值的 PCollection 中每个唯一键的元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  total_elements_per_keys = (
      pipeline
      | 'Create plants' >> beam.Create([
          ('spring', '🍓'),
          ('spring', '🥕'),
          ('summer', '🥕'),
          ('fall', '🥕'),
          ('spring', '🍆'),
          ('winter', '🍆'),
          ('spring', '🍅'),
          ('summer', '🍅'),
          ('fall', '🍅'),
          ('summer', '🌽'),
      ])
      | 'Count elements per key' >> beam.combiners.Count.PerKey()
      | beam.Map(print))

运行结果：
在这里插入图片描述

（3）计算所有唯一元素
我们使用count . perelement()来计算PCollection中唯一的元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  total_unique_elements = (
      pipeline
      | 'Create produce' >> beam.Create(
          ['🍓', '🥕', '🥕', '🥕', '🍆', '🍆', '🍅', '🍅', '🍅', '🌽'])
      | 'Count unique elements' >> beam.combiners.Count.PerElement()
      | beam.Map(print))

运行结果：
在这里插入图片描述

6.Distinct

作用：生成一个包含输入集合的不同元素的集合。

在下面的示例中，我们使用包含产品的两个PCollections创建了一个管道。

我们使用Distinct来去除重复的元素，这将输出所有唯一元素的PCollection。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  unique_elements = (
      pipeline
      | 'Create produce' >> beam.Create([
          '🥕',
          '🥕',
          '🍆',
          '🍅',
          '🍅',
          '🍅',
      ])
      | 'Deduplicate elements' >> beam.Distinct()
      | beam.Map(print))

运行结果：
在这里插入图片描述

7.GroupByKey

作用：获取元素的键值集合，并生成一个集合，其中每个元素由一个键和与该键关联的所有值组成。

在以下示例中，我们创建了一个带有季节的PCollection的管道。
我们使用GroupByKey对每个季节的所有农产品进行分组。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  produce_counts = (
      pipeline
      | 'Create produce counts' >> beam.Create([
          ('spring', '🍓'),
          ('spring', '🥕'),
          ('spring', '🍆'),
          ('spring', '🍅'),
          ('summer', '🥕'),
          ('summer', '🍅'),
          ('summer', '🌽'),
          ('fall', '🥕'),
          ('fall', '🍅'),
          ('winter', '🍆'),
      ])
      | 'Group counts per produce' >> beam.GroupByKey()
      | beam.Map(print))

运行结果：
在这里插入图片描述

8.GroupBy

作用：获取元素的集合并生成按这些元素的属性分组的集合。
与GroupByKey不同的是，键是从元素本身动态创建的。

在以下示例中，我们创建了一种带有水果的PCollection的管道。
我们使用GroupBy将所有水果按其名称的第一个字母分组。

import apache_beam as beam

with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(['strawberry', 'raspberry', 'blueberry', 'blackberry', 'banana'])
      | beam.GroupBy(lambda s: s[0])
      | beam.Map(print))

运行结果：
在这里插入图片描述
如果需要，我们可以通过包含多个属性的组合键进行分组。

import apache_beam as beam
with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(['strawberry', 'raspberry', 'blueberry', 'blackberry', 'banana'])
      | beam.GroupBy(letter=lambda s: s[0], is_berry=lambda s: 'berry' in s)
      | beam.Map(print))

运行结果：
在这里插入图片描述
在希望按属性进行分组的情况下，可以将一个字符串传递给 GroupBy，代替可调用表达式。例如，假设我有以下数据：

GROCERY_LIST = [
    beam.Row(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.50),
    beam.Row(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.50),
    beam.Row(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.00),
    beam.Row(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='banana', quantity=3, unit_price=1.00),
]

我们可以进行如下操作：

import apache_beam as beam

GROCERY_LIST = [
    beam.Row(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.50),
    beam.Row(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.50),
    beam.Row(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.00),
    beam.Row(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='banana', quantity=3, unit_price=1.00),
]
with beam.Pipeline() as p:
    grouped = p | beam.Create(GROCERY_LIST) | beam.GroupBy('recipe') | beam.Map(print)

运行结果：
在这里插入图片描述也可以混合和匹配属性和表达式

import apache_beam as beam

GROCERY_LIST = [
    beam.Row(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.50),
    beam.Row(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.50),
    beam.Row(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.00),
    beam.Row(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='banana', quantity=3, unit_price=1.00),
]
with beam.Pipeline() as p:
  grouped = (
      p | beam.Create(GROCERY_LIST)
        | beam.GroupBy('recipe', is_berry=lambda x: 'berry' in x.fruit)
        | beam.Map(print))

运行结果：
在这里插入图片描述
聚合
分组通常与聚合结合使用，可以使用GroupBy转换的aggregate_field方法轻松实现这一点。此方法采用三个参数：要聚合的字段（或表达式）、要聚合的字段（或CombineFn关联callable），最后是存储结果的字段名称。例如，假设有人想计算要购买的每种水果的数量。

import apache_beam as beam

GROCERY_LIST = [
    beam.Row(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.50),
    beam.Row(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.50),
    beam.Row(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.00),
    beam.Row(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='banana', quantity=3, unit_price=1.00),
]
with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(GROCERY_LIST)
      | beam.GroupBy('fruit')
          .aggregate_field('quantity', sum, 'total_quantity')
      | beam.Map(print))

运行结果：
在这里插入图片描述
与GroupBy中的参数类似，还可以聚合多个字段和表达式。

import apache_beam as beam

GROCERY_LIST = [
    beam.Row(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.50),
    beam.Row(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.50),
    beam.Row(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.00),
    beam.Row(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='banana', quantity=3, unit_price=1.00),
]
with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(GROCERY_LIST)
      | beam.GroupBy('recipe')
          .aggregate_field('quantity', sum, 'total_quantity')
          .aggregate_field(lambda x: x.quantity * x.unit_price, sum, 'price')
      | beam.Map(print))

运行结果：
在这里插入图片描述
当然，也可以多次聚合同一个字段。这个示例还演示了全局分组，因为分组键是空的。

import apache_beam as beam
from apache_beam.transforms.combiners import MeanCombineFn

GROCERY_LIST = [
    beam.Row(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.50),
    beam.Row(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.50),
    beam.Row(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.00),
    beam.Row(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='banana', quantity=3, unit_price=1.00),
]
with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(GROCERY_LIST)
      | beam.GroupBy()
          .aggregate_field('unit_price', min, 'min_price')
          .aggregate_field('unit_price', MeanCombineFn(), 'mean_price')
          .aggregate_field('unit_price', max, 'max_price')
      | beam.Map(print))

运行结果：
在这里插入图片描述

9.GroupIntoBatches

作用：将输入批处理成所需的批处理大小。

在下面的例子中，我们创建了一个带有按季节生产的PCollection的管道。

我们使用GroupIntoBatches为每个键获取固定大小的批次，输出每个键的元素列表。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  batches_with_keys = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('spring', '🍓'),
          ('spring', '🥕'),
          ('spring', '🍆'),
          ('spring', '🍅'),
          ('summer', '🥕'),
          ('summer', '🍅'),
          ('summer', '🌽'),
          ('fall', '🥕'),
          ('fall', '🍅'),
          ('winter', '🍆'),
      ])
      | 'Group into batches' >> beam.GroupIntoBatches(3)
      | beam.Map(print))

运行结果：
在这里插入图片描述

10.Latest

作用：获取具有最新时间戳的元素
在下面的例子中，我们创建了一个带有农产品PCollection的管道，该管道带有收获日期的时间戳。

我们使用Latest从PCollection获取具有最新时间戳的元素。
（1）全局最新元素
我们使用Latest.Globally（）获取整个PCollection中具有最新时间戳的元素。

import apache_beam as beam
import time

def to_unix_time(time_str, format='%Y-%m-%d %H:%M:%S'):
  return time.mktime(time.strptime(time_str, format))

with beam.Pipeline() as pipeline:
  latest_element = (
      pipeline
      | 'Create crops' >> beam.Create([
          {
              'item': '🥬', 'harvest': '2020-02-24 00:00:00'
          },
          {
              'item': '🍓', 'harvest': '2020-06-16 00:00:00'
          },
          {
              'item': '🥕', 'harvest': '2020-07-17 00:00:00'
          },
          {
              'item': '🍆', 'harvest': '2020-10-26 00:00:00'
          },
          {
              'item': '🍅', 'harvest': '2020-10-01 00:00:00'
          },
      ])
      | 'With timestamps' >> beam.Map(
          lambda crop: beam.window.TimestampedValue(
              crop['item'], to_unix_time(crop['harvest'])))
      | 'Get latest element' >> beam.combiners.Latest.Globally()
      | beam.Map(print))

运行结果：
在这里插入图片描述

（2）每个键的最新元素
我们使用Latest.PerKey()获取具有键值PCollection中每个键的最新时间戳的元素。

import apache_beam as beam
import time

def to_unix_time(time_str, format='%Y-%m-%d %H:%M:%S'):
  return time.mktime(time.strptime(time_str, format))

with beam.Pipeline() as pipeline:
  latest_elements_per_key = (
      pipeline
      | 'Create crops' >> beam.Create([
          ('spring', {
              'item': '🥕', 'harvest': '2020-06-28 00:00:00'
          }),
          ('spring', {
              'item': '🍓', 'harvest': '2020-06-16 00:00:00'
          }),
          ('summer', {
              'item': '🥕', 'harvest': '2020-07-17 00:00:00'
          }),
          ('summer', {
              'item': '🍓', 'harvest': '2020-08-26 00:00:00'
          }),
          ('summer', {
              'item': '🍆', 'harvest': '2020-09-04 00:00:00'
          }),
          ('summer', {
              'item': '🥬', 'harvest': '2020-09-18 00:00:00'
          }),
          ('summer', {
              'item': '🍅', 'harvest': '2020-09-22 00:00:00'
          }),
          ('autumn', {
              'item': '🍅', 'harvest': '2020-10-01 00:00:00'
          }),
          ('autumn', {
              'item': '🥬', 'harvest': '2020-10-20 00:00:00'
          }),
          ('autumn', {
              'item': '🍆', 'harvest': '2020-10-26 00:00:00'
          }),
          ('winter', {
              'item': '🥬', 'harvest': '2020-02-24 00:00:00'
          }),
      ])
      | 'With timestamps' >> beam.Map(
          lambda pair: beam.window.TimestampedValue(
              (pair[0], pair[1]['item']), to_unix_time(pair[1]['harvest'])))
      | 'Get latest elements per key' >> beam.combiners.Latest.PerKey()
      | beam.Map(print))

运行结果：
在这里插入图片描述

11.Max

作用：获取每个聚合中具有最大值的元素。

在下面的例子中，我们使用PCollection创建一个管道。然后，我们通过不同的方法得到最大值的元素。
（1）PCollection 中的最大元素
我们使用Combine.Globally（）从整个PCollection中获取最大元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  max_element = (
      pipeline
      | 'Create numbers' >> beam.Create([3, 4, 1, 2])
      | 'Get max value' >>
      beam.CombineGlobally(lambda elements: max(elements or [None]))
      | beam.Map(print))

运行结果：
在这里插入图片描述

（2）每个键的最大元素数
我们使用Combine.PerKey()来获取键值PCollection中每个唯一键的最大元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  elements_with_max_value_per_key = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Get max value per key' >> beam.CombinePerKey(max)
      | beam.Map(print))

运行结果：
在这里插入图片描述

12.Min

作用：获取每个聚合中具有最小值的元素。

在下面的例子中，我们使用PCollection创建一个管道。然后，我们通过不同的方法得到最小值的元素。
（1）PCollection中的最小元素
我们使用Combine.Globally（）从整个PCollection中获取最小元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  min_element = (
      pipeline
      | 'Create numbers' >> beam.Create([3, 4, 1, 2])
      | 'Get min value' >>
      beam.CombineGlobally(lambda elements: min(elements or [-1]))
      | beam.Map(print))

运行结果：
在这里插入图片描述

（2）每个键的最小元素
我们使用Combine.PerKey()来获取键值PCollection中每个唯一键的最小元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  elements_with_min_value_per_key = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Get min value per key' >> beam.CombinePerKey(min)
      | beam.Map(print))

运行结果：
在这里插入图片描述

13.Mean

作用：用于计算集合中元素的算术平均值，或键值对集合中与每个键关联的值的平均值

在下面的示例中，我们使用PCollection创建一个管道。然后，我们用不同的方法得到平均值。
（1）PCollection中元素的平均值
我们使用Mean.Globally()来获取整个PCollection中元素的平均值。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  mean_element = (
      pipeline
      | 'Create numbers' >> beam.Create([3, 4, 1, 2])
      | 'Get mean value' >> beam.combiners.Mean.Globally()
      | beam.Map(print))

运行结果：
在这里插入图片描述
（2）每个键元素的平均值
我们使用Mean.PerKey()来获取键值PCollection中每个唯一键的元素的平均值。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  elements_with_mean_value_per_key = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Get mean value per key' >> beam.combiners.Mean.PerKey()
      | beam.Map(print))

运行结果：
在这里插入图片描述

14.Sample

作用：以获取集合中元素的样本，或与键值对集合中每个键关联的值的样本。
在下面的示例中，我们使用PCollection创建一个管道。然后，我们用不同的方法得到元素的随机样本。
（1）PCollection 中的样本元素
我们使用Sample.FixedSizeGlobally（）从整个PCollection中获取固定范围的元素随机样本。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  sample = (
      pipeline
      | 'Create produce' >> beam.Create([
          '🍓 Strawberry',
          '🥕 Carrot',
          '🍆 Eggplant',
          '🍅 Tomato',
          '🥔 Potato',
      ])
      | 'Sample N elements' >> beam.combiners.Sample.FixedSizeGlobally(3)
      | beam.Map(print))

运行结果：
在这里插入图片描述
（2）每个键的样本元素
我们使用Sample.FixedSizePerKey()为键值PCollection中的每个唯一键获取固定范围的随机样本。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  samples_per_key = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('spring', '🍓'),
          ('spring', '🥕'),
          ('spring', '🍆'),
          ('spring', '🍅'),
          ('summer', '🥕'),
          ('summer', '🍅'),
          ('summer', '🌽'),
          ('fall', '🥕'),
          ('fall', '🍅'),
          ('winter', '🍆'),
      ])
      | 'Samples per key' >> beam.combiners.Sample.FixedSizePerKey(3)
      | beam.Map(print))

运行结果：
在这里插入图片描述

15.Sum

作用：对每个聚合中的所有元素进行求和。

在下面的示例中，我们使用PCollection创建一个管道。然后，我们用不同的方法得到所有元素值的和。
（1）PCollection中元素的和
我们使用Combine.Globally()来获取整个PCollection中所有元素值的和。

import apache_beam as beam

with beam.Pipeline() as pipeline:
    total = (
            pipeline
            | 'Create numbers' >> beam.Create([3, 4, 1, 2, 2])
            | 'Sum values' >> beam.CombineGlobally(sum)
            | beam.Map(print))

运行结果：
在这里插入图片描述
（2）每个键的元素之和
我们使用Combine.PerKey()来获取键值PCollection中每个唯一键的所有元素值的和。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  totals_per_key = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Sum values per key' >> beam.CombinePerKey(sum)
      | beam.Map(print))

运行结果：
在这里插入图片描述

16.Top

作用：用于查找集合中最大(或最小)的元素集，或与键-值对集合中的每个键相关联的最大(或最小)值集。

在下面的示例中，我们使用PCollection创建一个管道。然后，我们用不同的方法得到最大或最小的元素。

（1）PCollection中的最大元素
我们使用Top.Largest()从整个PCollection中获取最大的元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  largest_elements = (
      pipeline
      | 'Create numbers' >> beam.Create([3, 4, 1, 2])
      | 'Largest N values' >> beam.combiners.Top.Largest(2)
      | beam.Map(print))

运行结果：
在这里插入图片描述
（2）每个键的最大元素
我们使用Top.LargestPerKey()来获取键值PCollection中每个唯一键的最大元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  largest_elements_per_key = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Largest N values per key' >> beam.combiners.Top.LargestPerKey(2)
      | beam.Map(print))

运行结果：
在这里插入图片描述
（3）PCollection中最小的元素
我们使用Top.Smallest()从整个PCollection中获取最小的元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  smallest_elements = (
      pipeline
      | 'Create numbers' >> beam.Create([3, 4, 1, 2])
      | 'Smallest N values' >> beam.combiners.Top.Smallest(2)
      | beam.Map(print))

运行结果：
在这里插入图片描述
（4）每个键的最小元素
我们使用Top.SmallestPerKey（）获取键值PCollection中每个唯一键的最小元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  smallest_elements_per_key = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('🥕', 3),
          ('🥕', 2),
          ('🍆', 1),
          ('🍅', 4),
          ('🍅', 5),
          ('🍅', 3),
      ])
      | 'Smallest N values per key' >> beam.combiners.Top.SmallestPerKey(2)
      | beam.Map(print))

运行结果：
在这里插入图片描述
（5）自定义PCollection中的元素
我们使用Top.Of()从整个PCollection中获取具有自定义规则的元素。

你可以更改元素与key的比较方式。默认情况下，你会得到最大的元素，但是你可以通过设置reverse=True来得到最小的元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  shortest_elements = (
      pipeline
      | 'Create produce names' >> beam.Create([
          '🍓 Strawberry',
          '🥕 Carrot',
          '🍏 Green apple',
          '🍆 Eggplant',
          '🌽 Corn',
      ])
      | 'Shortest names' >> beam.combiners.Top.Of(
          2,             # 输出元素的数量
          key=len,       # 可选, 默认为元素本身
          reverse=True,  # 可选, 默认为False (降序)
      )
      | beam.Map(print)
  )

运行结果：
在这里插入图片描述
（6）每个键的自定义元素
我们使用Top.PerKey()来获取具有自定义规则的元素，这些规则适用于键值的PCollection中的每个唯一键。

您可以更改元素与key的比较方式。默认情况下，你会得到最大的元素，但是你可以通过设置reverse=True来得到最小的元素。

import apache_beam as beam

with beam.Pipeline() as pipeline:
  shortest_elements_per_key = (
      pipeline
      | 'Create produce names' >> beam.Create([
          ('spring', '🥕 Carrot'),
          ('spring', '🍓 Strawberry'),
          ('summer', '🥕 Carrot'),
          ('summer', '🌽 Corn'),
          ('summer', '🍏 Green apple'),
          ('fall', '🥕 Carrot'),
          ('fall', '🍏 Green apple'),
          ('winter', '🍆 Eggplant'),
      ])
      | 'Shortest names per key' >> beam.combiners.Top.PerKey(2, key=len, reverse=True)
      | beam.Map(print)
  )