一、时间格式的转换
实现将日志中的字段转换为需要的字符串。
二、实现方法
1.UDF
(1)目的:将日期"31/Aug/2015:00:04:37 +0800" 转换为"2015-08-31 00:04:37"
(2)创建类TestDateUDF
package com.ray.hive.mapreduce;
import java.util.Date;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Locale;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class TestDateUDF extends UDF {
//记得一定做trycatch
public SimpleDateFormat inputDate = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
public SimpleDateFormat outputDate = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
public Text evaluate(Text time)
{
String date = null;
if (time == null) {
return null;
}
if (StringUtils.isBlank(time.toString())) {
return null;
}
String parse = time.toString().replaceAll("\"", "");
try
{
Date parseDate = this.inputDate.parse(parse);
date = this.outputDate.format(parseDate);
}
catch (ParseException e)
{
e.printStackTrace();
//syso:日志格式解析异常
}
return new Text(date);
}
}
(3)将代码打成jar包上传到linux服务器里
/opt/datas/hivedateudf.jar
(4)将jar包和hive进行关联
add jar /opt/datas/hivedateudf.jar;
(5)创建函数
create temporary function pdate as 'com.ray.hive.mapreduce.TestDateUDF';
(6)检验udf是否生效
select pdate(time_local) from log_src limit 10;
结果:
2015-08-31 00:04:37
2015-08-31 00:04:37
2015-08-31 00:04:53
2015-08-31 00:04:53
2015-08-31 00:04:53
2015-08-31 00:04:53
2015-08-31 00:04:53
2015-08-31 00:04:53
2015-08-31 00:04:53
2015-08-31 00:04:53
(7)永久生效!放到hdfs上面!
create function pdate2 as 'com.ray.hive.HiveDateTest' using jar 'hdfs://hadoop:8020/date_udf.jar';
select pdate2(time_local) from log_src limit 10;
2.使用python脚本预处理数据
官网:https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-SimpleExampleUseCases
(1)创建表
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
(2)获得数据,解压
wget -P /opt/datas/movie http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
(3)加载数据
LOAD DATA LOCAL INPATH '/opt/datas/movie/ml-100k/u.data' OVERWRITE INTO TABLE u_data;
(4)编写python脚本/opt/datas/movie/weekday_mapper.py
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
(5)创建新表
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
(6)脚本与hive进行关联:跟add jar 类似
add FILE /opt/datas/movie/weekday_mapper.py;
(7)使用
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python /opt/datas/weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
(8)查看数据
SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
结果:
weekday _c1
1 12254
2 13579
3 14430
4 15114
5 14743
6 18229
7 11651
3.hive本身自带时间转换函数
3.1 unix_timestamp
(1)获取当前时间时间戳
hive (temp)> select unix_timestamp();
OK
_c0
1542026782
(2)获取某一个指定日期时间戳
select unix_timestamp("2017-12-16 11:38:55");
注意:转换格式为"yyyy-MM-dd HH:mm:ss";如果格式不正确,返回的是null值
错误示范:select unix_timestamp("20171216 11:38:55");
(3)指定格式日期转时间戳
select unix_timestamp("20171216 11:38:55","yyyyMMdd HH:mm:ss");
结果:
OK
_c0
1513395535
备注:将"yyyyMMdd HH:mm:ss"格式的日期转换为int时间戳。
3.2 from_unixtime
将unixtime转换成标准时区格式:from_unixtime 将1513395535 转换为标准
select from_unixtime(881250949,"yyyy-MM-dd HH:mm:ss");