在DB2数据库中,在导出DEL文件时,默认的字符分隔符是"、字段分隔符是, (逗号)。在hive中字段分割符是以‘001’,特殊空白字符分割。有个需求要将120G的del文件导入到大数据环境。如果将del导入db2,再用sqoop抽的话,就不用考虑文件中的分隔符问题。但是因为数据量很大,db2导入会报错。所以想办法直接put到hadoop上。上传完后,在hive表发现错列了,原因是db2字字符串类型的字段值会存在(逗号),并且值会有双引号。这里写了一个java程序,对del文件做处理,将字段分隔符转化成‘\001’,并去除双引号。
package utils;
import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HiveDataETL {
public static void main(String[] args) throws IOException {
int line_num = 0;
InputStreamReader isr = new InputStreamReader(new FileInputStream(new File(args[0])), "gbk");
OutputStreamWriter ots = new OutputStreamWriter(new FileOutputStream(new File(args[1])), "gbk");
BufferedWriter bw = new BufferedWriter(ots);
BufferedReader br = new BufferedReader(isr);
String reg = "\".*?\"";
Pattern compile = Pattern.compile(reg);
Matcher matcher = null;
String line = "";
while ((line = br.readLine()) != null) {
matcher = compile.matcher(line);
line_num++;
StringBuilder sb = new StringBuilder();
int start = 0, end = 0;
while (matcher.find()) {
if (start == 0 & end == 0) {
sb.append(line, 0, matcher.start());
} else {
sb.append(line, end + 1, matcher.start());
}
start = matcher.start();
end = matcher.end();
if (end == line.length())
sb.append(matcher.group(0).replace(",", "|||"));
else
sb.append(matcher.group(0).replace(",", "|||") + ",");
}
if (end != line.length())
sb.append(line.substring(end + 1));
String result = sb.toString().replace(',', '\01')
.replaceAll("\"", "")
.replace("|||", ",");
bw.write(result);
bw.newLine();
bw.flush();
}
bw.close();
br.close();
System.out.println("行数:" + line_num);
}
}
最后在idea里打包,上传到服务器,执行java -jar XXXX。对del文件进行清洗转化。重新上传到hdfs,数据正常导入完成。