在测试使用Flink的时候,往往需要写很多的代码去造数据,会浪费很多时间,Flink为DataStream和Table&SQL Api提供了生成数据的Source,可以方便的使用
DataGeneratorSource
DataGeneratorSource 抽象了数据生成器,可以轻松的生成数据。
DataGeneratorSource 有状态、可以并行。
DataGeneratorSource 的构造器需要传入 dataGenerator 和 rowsPerSecond 两个参数 :
/**
* @param generator DataGenerator是一个数据生成器接口,Flink内部提供 SequenceGenerator和RandomGenerator,分别用来生成序列数据和随机数据
* @param rowsPerSecond 每秒生成的行数,默认是Long.MAX_VALUE
*/
public DataGeneratorSource(DataGenerator<T> generator, long rowsPerSecond) {
this.generator = generator;
this.rowsPerSecond = rowsPerSecond;
}
使用DataGeneratorSource的时候,就可以利用SequenceGenerator和RandomGenerator提供的方法来生成数据。
在Flink Table的DataGenTableSourceFactory源码中实现了RowGenerator来生成RowData,我们可以参考其源码自定义DataGenerator来生成我们自己的类数据。
样例代码 : 自定义一个DataGenerator类来生成TrafficData流量数据
import org.apache.commons.math3.random.RandomDataGenerator;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.runtime.state.FunctionInitializationContext;
import org.apache.flink.streaming.api.functions.source.datagen.DataGenerator;
import java.io.Serializable;
public class TrafficData implements Serializable {
private static final long serialVersionUID = 1L;
/** 用户id */
private long userId;
/** 用户所属城市id */
private int cityId;
/** 流量时间 */
private long trafficTime;
/** 流量大小 */
private final double traffic;
public TrafficData(long userId, int cityId, long trafficTime, double traffic) {
this.userId = userId;
this.cityId = cityId;
this.trafficTime = trafficTime;
this.traffic = traffic;
}
/**
* 自定义的数据生成器,用于生成随机的TrafficData对象
*/
static class TrafficDataGenerator implements DataGenerator<TrafficData> {
/** 随机数据生成器对象 */
private RandomDataGenerator generator;
@Override
public void open(String name, FunctionInitializationContext context, RuntimeContext runtimeContext) throws Exception {
// 实例化生成器对象
generator = new RandomDataGenerator();
}
/**
* 是否有下一个
*
* @return
*/
@Override
public boolean hasNext() {
return true;
}
@Override
public TrafficData next() {
// 使用随机生成器生成数据,构造流量对象
return new TrafficData(
generator.nextInt(1, 100),
generator.nextInt(1, 10),
System.currentTimeMillis(),
generator.nextUniform(0, 1)
);
}
}
public long getUserId() {
return userId;
}
public void setUserId(long userId) {
this.userId = userId;
}
public int getCityId() {
return cityId;
}
public void setCityId(int cityId) {
this.cityId = cityId;
}
public long getTrafficTime() {
return trafficTime;
}
public void setTrafficTime(long trafficTime) {
this.trafficTime = trafficTime;
}
public double getTraffic() {
return traffic;
}
@Override
public String toString() {
return "TrafficData{" +
"userId=" + userId +
", cityId=" + cityId +
", trafficTime=" + trafficTime +
", traffic=" + traffic +
'}';
}
}
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.datagen.DataGeneratorSource;
/**
* 使用数据生成器
*/
public class TestSource {
public static void main(String[] args) throws Exception {
// 创建env对象
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 创建DataGeneratorSource,传入上面自定义的数据生成器
DataGeneratorSource<TrafficData> trafficDataDataGeneratorSource = new DataGeneratorSource<>(new TrafficData.TrafficDataGenerator());
// 添加source
env.addSource(trafficDataDataGeneratorSource)
// 指定返回类型
.returns(new TypeHint<TrafficData>() {
})
// 输出
.print();
env.execute();
}
}
DataGen Connector
DataGen Connector用于在Table和SQL API中生成数据,底层实现就用到了上面的DataGenerator
连接器参数 :
参数 | 是否必选 | 默认参数 | 数据类型 | 描述 |
---|---|---|---|---|
connector | 必须 | (none) | String | 指定要使用的连接器,这里是 ‘datagen’ |
rows-per-second | 可选 | 10000 | Long | 每秒生成的行数,用以控制数据发出速率 |
fields.#.kind | 可选 | random | String | 指定 ‘#’ 字段的生成器。可以是 ‘sequence’ 或 ‘random’ |
fields.#.min | 可选 | (Minimum value of type) | (Type of field) | 随机生成器的最小值,适用于数字类型 |
fields.#.max | 可选 | (Maximum value of type) | (Type of field) | 随机生成器的最大值,适用于数字类型 |
fields.#.length | 可选 | 100 | Integer | 随机生成器生成字符的长度,适用于 char、varchar、string |
fields.#.start | 可选 | (none) | (Type of field) | 序列生成器的起始值 |
fields.#.end | 可选 | (none) | (Type of field) | 序列生成器的结束值 |
使用案例 :
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.types.Row;
public class TestSQLSource {
public static void main(String[] args) throws Exception {
// 创建env
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
// 创建Table Env
StreamTableEnvironment streamTableEnvironment = StreamTableEnvironment.create(executionEnvironment);
// 执行上面的sql
streamTableEnvironment.executeSql(""
+ " CREATE TABLE source_table ("
+ " user_id INT,"
+ " cost DOUBLE,"
+ " ts AS LOCALTIMESTAMP,"
+ " WATERMARK FOR ts AS ts"
+ " ) WITH ("
+ " 'connector'='datagen',"
+ " 'rows-per-second'='1',"
+ " 'fields.user_id.kind'='random',"
+ " 'fields.user_id.min'='1',"
+ " 'fields.user_id.max'='10',"
+ " 'fields.cost.kind'='random',"
+ " 'fields.cost.min'='1',"
+ " 'fields.cost.max'='100'"
+ " )");
// 执行查询
Table table = streamTableEnvironment.sqlQuery("SELECT * FROM source_table");
streamTableEnvironment.toAppendStream(table, Row.class).print("sql");
executionEnvironment.execute();
}
}