在flink中针对读取csv文件的输出可以有3种格式,都是通过引用inputFormat来控制的,分别为 PojoCsvInputFormat输出类型为pojo, RowCsvInputFormat输出类型为Row, TupleCsvInputFormat输出类型为Tuple。本例子就用RowCsvInputFormat。
可以进入到RowCsvInputFormat 看看其构造函数都有哪些
public RowCsvInputFormat(
Path filePath,
TypeInformation[] fieldTypeInfos,
String lineDelimiter,
String fieldDelimiter,
int[] selectedFields,
boolean emptyColumnAsNull) {
super(filePath);
this.arity = fieldTypeInfos.length;
if (arity != selectedFields.length) {
throw new IllegalArgumentException("Number of field types and selected fields must be the same");
}
this.fieldTypeInfos = fieldTypeInfos;
this.fieldPosMap = toFieldPosMap(selectedFields);
this.emptyColumnAsNull = emptyColumnAsNull;
boolean[] fieldsMask = toFieldMask(selectedFields);
setDelimiter(lineDelimiter);
setFieldDelimiter(fieldDelimiter);
setFieldsGeneric(fieldsMask, extractTypeClasses(fieldTypeInfos));
}
public RowCsvInputFormat(Path filePath, TypeInformation[] fieldTypes, String lineDelimiter, String fieldDelimiter, int[] selectedFields) {
this(filePath, fieldTypes, lineDelimiter, fieldDelimiter, selectedFields, false);
}
public RowCsvInputFormat(Path filePath, TypeInformation[] fieldTypes, String lineDelimiter, String fieldDelimiter) {
this(filePath, fieldTypes, lineDelimiter, fieldDelimiter, sequentialScanOrder(fieldTypes.length));
}
public RowCsvInputFormat(Path filePath, TypeInformation[] fieldTypes, int[] selectedFields) {
this(filePath, fieldTypes, DEFAULT_LINE_DELIMITER, DEFAULT_FIELD_DELIMITER, selectedFields);
}
public RowCsvInputFormat(Path filePath, TypeInformation[] fieldTypes, boolean emptyColumnAsNull) {
this(filePath, fieldTypes, DEFAULT_LINE_DELIMITER, DEFAULT_FIELD_DELIMITER, sequentialScanOrder(fieldTypes.length), emptyColumnAsNull);
}
public RowCsvInputFormat(Path filePath, TypeInformation[] fieldTypes) {
this(filePath, fieldTypes, false);
}
从上面的构造函数中可以看出可以接受的参数为:
Path filePath,csv文件的路径,
TypeInformation[] fieldTypeInfos, 输入的数据的类型,这个也是按列来对应的
String lineDelimiter, 行分隔符
String fieldDelimiter, 列分隔符
int[] selectedFields, 选择的列,当不是全部都选择的情况下,就在这个参数中输入选择的列的index就可以了。
boolean emptyColumnAsNull , 当当前得的值为空的情况,这个值设置为true。
例子如下:
//值得注意的是 选择的行数和每行对应的类型数要匹配。
String pathString = "hdfs://hostname:9000/flinkCsv/csvTese.csv";
String charsetName = "UTF-8";
int interval = 10;
Path path = new Path(pathString);
String lineDelimiter = "\n";
String fieldDelimiter = ",";
Boolean setEmptyColumnAsNull = true;
// 5行的类型都为string
TypeInformation[] fieldTypeInfos = new TypeInformation[5];
for(int i = 0; i < 5; i++)
{
fieldTypeInfos[i] = TypeINformation.of(String.class);
}
//选择前5行输出
int[] select = new int[5];
for(int i = 0; i< 5; i++)
{
select[i] = i;
}
RowCsvInputFormat format = new RowCsvInputFormat(newPath(path),fieldTypeInfos,select,lineDelimiter, fieldDelimiter,setEmptyColumnAsNull);
format.setNestedFileEnumeration(true);
DataStreamSource<ROW> text = env.readTextFile(format,path,FileProcessingMode.PROCESS_CONTINUOUSLY,interval)
最后输出的类型为flink中的row类型。
开启流处理就是format.setNestedFileEnumeration(true);设置为true并FileProcessingMode.PROCESS_CONTINUOUSLY 为continuoulsy