Learn Flink:Data Pipelines & ETL

Data Pipelines & ETL (数据管道和ETL)

One very common use case for Apache Flink is to implement ETL (extract, transform, load) pipelines that take data from one or more sources, perform some transformations and/or enrichments, and then store the results somewhere.
In this section we are going to look at how to use Flink’s DataStream API to implement this kind of application.

Apache Flink的一个非常常见的用例是实现ETL(提取、转换、加载)管道,该管道从一个或多个sources获取数据,执行一些转换和(或)丰富,然后将结果存储在某处。
在本节中,我们将了解如何使用Flink的DataStream API来实现这种应用程序

Note that Flink’s Table and SQL APIs are well suited for many ETL use cases.
But regardless of whether you ultimately use the DataStream API directly, or not, having a solid understanding the basics presented here will prove valuable.

注意,Flink的Table和SQL API非常适合许多ETL用例。
但是,无论您最终是否直接使用DataStream API,深入理解本文介绍的基础知识将证明是有价值的。

Stateless Transformations (无状态的转换)

This section covers map() and flatmap(), the basic operations used to implement stateless transformations.
The examples in this section assume you are familiar with the Taxi Ride data used in the hands-on exercises in the flink-training-repo .

本节介绍map()和flatmap(),这是用于实现无状态转换的基本操作。
本节中的示例假设您熟悉flink-training-repo的动手练习中使用的出租车乘坐数据。

map()

In the first exercise you filtered a stream of taxi ride events.
In that same code base there’s a GeoUtils class that provides a static method GeoUtils.mapToGridCell(float lon, float lat) which maps a location (longitude, latitude) to a grid cell that refers to an area that is approximately 100x100 meters in size.

在第一个练习中,您过滤了一系列出租车乘坐事件。
在同一个代码库中,有一个GeoUtils类,它提供了一个静态方法GeoUtils.mapToGridCell(float lon, float lat),将位置(经度、纬度)映射到网格单元,网格单元指的是尺寸约为100x100米的区域。

Now let’s enrich our stream of taxi ride objects by adding startCell and endCell fields to each event.
You can create an EnrichedRide object that extends TaxiRide, adding these fields:

现在,让我们通过向每个事件添加startCell和endCell字段来丰富出租车乘坐对象流。
您可以创建一个继承TaxiRide的EnrichedRide对象,添加以下字段:

public static class EnrichedRide extends TaxiRide {
   
    public int startCell;
    public int endCell;

    public EnrichedRide() {
   }

    public EnrichedRide(TaxiRide ride) {
   
        this.rideId = ride.rideId;
        this.isStart = ride.isStart;
        ...
        this.startCell = GeoUtils.mapToGridCell(ride.startLon, ride.startLat);
        this.endCell = GeoUtils.mapToGridCell(ride.endLon, ride.endLat);
    }

    public String toString() {
   
        return super.toString() + "," +
            Integer.toString(this.startCell) + "," +
            Integer.toString(this.endCell);
    }
}

You can then create an application that transforms the stream
然后,您可以创建一个应用程序来转换流

DataStream<TaxiRide> rides = env.addSource(new TaxiRideSource(...));

DataStream<EnrichedRide> enrichedNYCRides = rides
    .filter(new RideCleansingSolution.NYCFilter())
    .map(new Enrichment());

enrichedNYCRides.print();

with this MapFunction:
用如下的MapFunction

public static class Enrichment implements MapFunction<TaxiRide, EnrichedRide> {
   

    @Override
    public EnrichedRide map(TaxiRide taxiRide) throws Exception {
   
        return new EnrichedRide(taxiRide);
    }
}

flatmap()

A MapFunction is suitable only when performing a one-to-one transformation: for each and every stream element coming in, map() will emit one transformed element. Otherwise, you will want to use flatmap()
MapFunction仅在执行一对一转换时适用:对于传入的每个流元素,map()将产出一个经过转换后的元素。否则,您将希望使用flatmap()。

DataStream<TaxiRide> rides = env.addSource(new TaxiRideSource(...));

DataStream<EnrichedRide> enrichedNYCRides = rides
    .flatMap(new NYCEnrichment()
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值