Flink提供了一个流式文件系统的连接器:StreamingFileSink,它继承自抽象类RichSinkFunction,而且集成了Flink的检查点机制,保证精确一次的一致性语义
输出到文件
StreamingFileSink为批处理和流处理提供了一个统一的Sink,它可以将分区文件写入Flink支持的文件系统。它可以保证精确一次的状态一致性,大大改进了之前流式文件Sink的的方式。它的主要操作是将数据写入桶(buckets),每个桶中的数据都可以分割成一个个大小有限的分区文件,默认的分桶方式是基于时间,我们每小时写入一个新的桶。
StreamingFileSink支持行编码(Row-encoded)和批量编码(Bulk-encoded,比如Parquet)格式。这两种不同的方式都有各种的构建器(builder),调用方法也非常简单,也可以直接调用StreamingFileSink的静态方法:
- 行编码:StreamingFileSink.forRowFormat(basePath,rowEncoder)
- 批量编码:StreamingFileSink.forBulkFormat(basePath,bulkWriterFactory)
在创建行或批量编码Sink时,我们需要传入两个参数,用来指定存储桶的基本路径和数据的编码逻辑
import com.yingzi.chapter05.Source.Event;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink;
import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy;
import java.util.concurrent.TimeUnit;
public class SinkToFile {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
//从元素中读取数据
DataStreamSource<Event> stream = env.fromElements(
new Event("Mary", "./home", 1000L),
new Event("Bob", "./cart", 2000L),
new Event("Alice", "./prod?id=100", 3000L),
new Event("Bob", "./prod?id=1", 3300L),
new Event("Bob", "./home", 3500L),
new Event("Alice", "./prod?id=200", 3200L),
new Event("Bob", "./prod?id=2", 3800L),
new Event("Bob", "./prod?id=3", 4200L)
);
StreamingFileSink<String> streamingFileSink = StreamingFileSink.<String>forRowFormat(new Path("./output"), new SimpleStringEncoder<>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withMaxPartSize(1024 * 1024 * 1024)
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.build()
)
.build();
stream.map(data -> data.toString()).addSink(streamingFileSink);
env.execute();
}
}
这里我们创建了一个简单的Sik,通过.withRollingPolicy()方法指定了一个“滚动策略”。“滚动”的概念在日志文件的写入中经常遇到:文件有内容持续不断地写入,所以我们应该给一个标准,什么时候开启新的文件,将之前的内容归档保存。上面我们设置了再以下三种情况下将滚动分区文件:
- 文件大小达到1GB
- 至少包含15分钟的数据
- 最近5分组没有收到新的数据
输出到kafka
Flink官方为Kafka提供了Source和Sink的连接器,我们可以用它方便地从Kafka读写数据。Flink与Kafka的连接器提供了端到端的精确一次语义保证,这在实际项目中是最高级别的一致性保证
import com.yingzi.chapter05.Source.Event;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import java.util.Properties;
public class SinkToKafka {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
//1、从kafka中读取数据
Properties properties = new Properties();
properties.setProperty("bootstrap.servers","hadoop102:9092");
properties.setProperty("group.id", "consumer-group");
DataStreamSource<String> kafkaStream = env.addSource(new FlinkKafkaConsumer<String>("clicks", new SimpleStringSchema(), properties));
//2、用flink进行转换处理
SingleOutputStreamOperator<String> result = kafkaStream.map(new MapFunction<String, String>() {
@Override
public String map(String value) throws Exception {
String[] fields = value.split(",");
return new Event(fields[0].trim(), fields[1].trim(), Long.valueOf(fields[2].trim())).toString();
}
});
//3、结果数据写入kafka
result.addSink(new FlinkKafkaProducer<String>("hadoop102:9092","events",new SimpleStringSchema()));
env.execute();
}
}
输出到Redis
Flink没有直接提供官方的Redis连接器,不过Bahir项目担任了合格的辅助角色,为我们提供了Flink-Redis的连接工具。
RedisSink的构造方法需要传入两个参数:
- JFlinkJedisConfigBase:Jedis的连接配置
- RedisMapper:Redis映射类接口,说明怎样将数据转换成可以写入Redis的类型
public class SinkToRedis {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<Event> stream = env.addSource(new ClickSource());
//创建一个jedis连接配置
FlinkJedisPoolConfig config = new FlinkJedisPoolConfig.Builder()
.setHost("hadoop102")
.build();
//写入redis
stream.addSink(new RedisSink<>(config,new MyRedisMapper()));
env.execute();
}
//自定类实现RedisMapper接口
public static class MyRedisMapper implements RedisMapper<Event>{
@Override
public RedisCommandDescription getCommandDescription() {
return new RedisCommandDescription(RedisCommand.HSET,"clicks");
}
@Override
public String getKeyFromData(Event event) {
return event.user;
}
@Override
public String getValueFromData(Event event) {
return event.url;
}
}
}
输出到Elasticsearch
ElasticSearch是一个分布式的开源搜搜索和分析引擎,适用于所有类型的塑胶。ElasticSearch有着简洁的风格API,以良好的分布式特性,速度、和可扩展性的闻名,在大数据领域应用非常广泛
ElasticsearchSink,这个类的构造方法是私有(private)的,我们需要使用ElasticsearchSink的Builder内部静态类,调用它的build()方法才能创建出真正的SikFunction。Builder的构造方法又有两个参数:
- httpHosts:连接到Elasticsearch集群主机列表
- elasticsearchSinkFunction:这不是我们所说的sinkFunction,而是用来说明具体处理逻辑、准备数据向Elasticsearch发送请求的函数
具体的操作需要重写elasticsearchSinkFunction的process方法,我们可以将要发送的数据放在一个HashMap中,包装成IndexRequest向外部发送HTTP请求。
import com.yingzi.chapter05.Source.Event;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import java.util.ArrayList;
import java.util.HashMap;
public class SinkToEs {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//从元素中读取数据
DataStreamSource<Event> stream = env.fromElements(
new Event("Mary", "./home", 1000L),
new Event("Bob", "./cart", 2000L),
new Event("Alice", "./prod?id=100", 3000L),
new Event("Bob", "./prod?id=1", 3300L),
new Event("Bob", "./home", 3500L),
new Event("Alice", "./prod?id=200", 3200L),
new Event("Bob", "./prod?id=2", 3800L),
new Event("Bob", "./prod?id=3", 4200L)
);
//定义hosts的列表
ArrayList<HttpHost> httpHosts = new ArrayList<>();
httpHosts.add(new HttpHost("hadoop102",9200));
//定义
ElasticsearchSinkFunction<Event> elasticsearchSinkFunction = new ElasticsearchSinkFunction<Event>() {
@Override
public void process(Event event, RuntimeContext runtimeContext, RequestIndexer requestIndexer) {
HashMap<String, String> map = new HashMap<>();
map.put(event.user, event.url);
//构建一个IndexRequest
IndexRequest request = Requests.indexRequest()
.index("clicks")
.type("type")
.source(map);
requestIndexer.add(request);
}
};
//写入Es
stream.addSink(new ElasticsearchSink.Builder<Event>(httpHosts,elasticsearchSinkFunction).build());
env.execute();
}
}
输出到MySQL(JDBC)
import com.yingzi.chapter05.Source.Event;
import org.apache.flink.connector.jdbc.JdbcConnectionOptions;
import org.apache.flink.connector.jdbc.JdbcSink;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class SinkToMySQL {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//从元素中读取数据
DataStreamSource<Event> stream = env.fromElements(
new Event("Mary", "./home", 1000L),
new Event("Bob", "./cart", 2000L),
new Event("Alice", "./prod?id=100", 3000L),
new Event("Bob", "./prod?id=1", 3300L),
new Event("Bob", "./home", 3500L),
new Event("Alice", "./prod?id=200", 3200L),
new Event("Bob", "./prod?id=2", 3800L),
new Event("Bob", "./prod?id=3", 4200L)
);
stream.addSink(JdbcSink.sink(
"INSERT INTO clicks (user,url) VALUES (?,?)",
((statement,event) -> {
statement.setString(1,event.user);
statement.setString(2,event.url);
}),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://localhost:3306/test")
.withDriverName("com.mysql.jdbc.Driver")
.withUsername("root")
.withPassword("000000")
.build()
));
env.execute();
}
}