关于hive RegexSerDe的源码分析

最新推荐文章于 2022-05-24 15:39:48 发布

huaixixi1988

最新推荐文章于 2022-05-24 15:39:48 发布

阅读量302

点赞数

文章标签： hadoop 源码 session select

　　最近有个业务建表使用了 RegexSerDe，之前虽然也它来解析nginx日志，但是没有做深入的了解。这次看了下其实现方式。

　　建表语句：

　　17CREATE external TABLE ods_cart_log

　　(

　　time_local STRING,

　　request_json STRING,

　　trace_id_num STRING

　　)

　　PARTITIONED BY

　　(

　　dt string,

　　hour string

　　)

　　ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

　　WITH SERDEPROPERTIES

　　("input.regex" =

　　"\\\[(.*?)\\\] .*\\\|(.*?) (.*?) \\\[(.*?)\\\]",

　　"output.format.string" ="%1$s %2$s %4$s")

　　STORED AS TEXTFILE;

　　测试数据：

　　5[2014-07-24 15:54:54] [6] OperationData.php:

　　:89|{"action":"add","redis_key_hash":9,"time":"1406188494.73745500","source":"web",

　　"mars_cid":"","session_id":"","info":{"cart_id":26885,"user_id":4,"size_id":"2784145",

　　"num":"1","warehouse":"VIP_NH","brand_id":"7379","cart_record_id":26885,"channel":"te"}}

　　trace_id [40618849399972881308]

　　这里trace_id_num按照猜想应该是第4个字段(即40618849399972881308)，但是实际输出了第3个字段(trace_id)

　　查看其代码实现：

　　RegexSerDe主要由下面三个参数：

　　1)input.regex 正则

　　2)output.format.string 输出格式

　　3)input.regex.case.insensitive 大小写是否敏感

　　其中input.regex用在反序列化方法中，即数据的读取(hive读取hdfs文件)，相对的output.format.string 用在序列化的方法中，即数据的写入(hive写入hdfs文件)。

　　在反序列化的方法deserialize中有如下代码，用于返回代表匹配字段的数据：

　　17 for (int c = 0; c < numColumns; c++) { //numColumns是按表中column的数量算的(

　　比如这个例子columnNames 是[time_local, request_json, trace_id_num] | numColumns = columnNames.size();

　　try {

　　row.set(c, m.group(c + 1)); //可以看到字段的匹配从0开始，中间不会有跳跃，

　　所以这里select trace_id_num 字段是正则里面的第3个组，而和output.format.string没有关系

　　} catch (RuntimeException e) {

　　partialMatchedRows++;

　　if (partialMatchedRows >= nextPartialMatchedRows) {

　　nextPartialMatchedRows = getNextNumberToDisplay(nextPartialMatchedRows);

　　// Report the row

　　LOG.warn("" + partialMatchedRows

　　+ " partially unmatched rows are found, " + " cannot find group "

　　+ c + ": " + rowText);

　　}

　　row.set(c, null);

　　}

　　work around的方法有两个，1个是把所有正则匹配的字段列出，另一个就是更改正则的分组，只拿自己care的分组，比如上面可以改为

　　1\\\[(.*?)\\\] .*\\\|(.*?) .*? \\\[(.*?)\\\]

　　这里output.format.string的设置仔细想想貌似没什么用，首先RegexSerDe的方式只在textfile下生效，即可以用load向hive的表中导入数据，但是load是一个hdfs层面的文件操作，不涉及到序列化，如果想使用序列化，需要使用insert into select的方式插入数据，但是这种方式插入的数据又和select的数据有关系，和output.format.string没什么关系了。。

　　其实regexserde类有两个

　　分别位于

　　1./serde/src/java/org/apache/hadoop/hive/serde2/RegexSerDe.java 和

　　1./contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

　　都是扩展了AbstractSerDe这个抽象类。通过代码可以看到contrib下的这个类是实现了serialize 和 deserialize 方法，而上面这个只实现了deserialize 方法，由此看来RegexSerDe中的serialize 方法可能是没什么用的。。

　　另外需要注意几点：

　　1.如果一行匹配不上，整个行的字段输出都是null

　　if (!m.matches()) {

　　unmatchedRows++;

　　if (unmatchedRows >= nextUnmatchedRows) {

　　nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows);

　　// Report the row