离线日志分析平台day03

最新推荐文章于 2023-08-31 11:33:23 发布

KujyouRuri

最新推荐文章于 2023-08-31 11:33:23 发布

阅读量110

点赞数

本文链接：https://blog.csdn.net/KujyouRuri/article/details/116668664

版权

hive与hbase的整合

在hive中如果关联hbase中已经存在的表，要使用外部表，反之使用内部关联的话，会在hbase中创建对应的新表:
实例：

CREATE EXTERNAL TABLE hhh1 (
id int, 
name string,
age int
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:name,f:age")
TBLPROPERTIES ("hbase.table.name" = "xyz");

将hbase当中的表与hive相关联


CREATE EXTERNAL TABLE event_log_hourly (
rowkey string, 
u_ud string,
pl string,
s_time bigint,
bt string,
bv string,
province string
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:u_ud, f:pl,f:s_time,f:bt, f:bv, f:province")
TBLPROPERTIES ("hbase.table.name" = "nginx_js");

需求：
每天每个小时（24小时）每个平台每个浏览器每个浏览器版本的活跃用户数

create temporary table event_log_hourly_tmp (
	date_ string comment 'yyyy-MM-dd',
	hour string comment '00|01...23',
	pl string, 
	bt string,
	bv string,
	u_ud string
);

常见hive函数：
1、current_database()—>查看当前的数据库

		select current_database() from event_log_hourly limit 1;
		OK
		event_log
		在mysql中查看当前数据库使用select database();

2、current_date--->查看当前日期

		select current_date() from event_log_hourly limit 1;
		OK
		2018-08-08

3、current_timestamp--->查看当前时间

		hive (event_log)> select current_timestamp() from event_log_hourly limit 1;
		OK
		2018-08-08 19:01:30.75

获取日期

		select from_unixtime(cast((s_time/1000) as int), 'yyyy-MM-dd') from event_log_hourly limit 1;

获取小时

		select from_unixtime(cast((s_time/1000) as int), 'yyyy-MM-dd HH:mm:ss') from event_log_hourly limit 1;

临时表：

create table event_log_hourly_tmp (
	date_ string comment "yyyy-MM-dd",
	hour string comment "00|01...23",
	pl string, 
	bt string,
	bv string,
	u_ud string
); 

	insert into event_log_hourly_tmp select 
		from_unixtime(cast((s_time/1000) as int), 'yyyy-MM-dd'),
		from_unixtime(cast((s_time/1000) as int), 'yyyy-MM-dd HH:mm:ss'),
		pl, 
		bt,
		bv,
		u_ud 
	from event_log_hourly;

说明：为了获取用户的操作时间戳和u_ud，完全可以从rowkey中进行拆分获取，这样减少数据的冗余。

	select split(rowkey, "_")[0], split(rowkey, "_")[1] from event_log_hourly limit 1;

每天 每个小时（24小时） 每个平台 每个浏览器 每个浏览器版本 的活跃用户数

	select date_, hour, pl, bt, bv, count(distinct u_ud) as u_count from event_log_hourly_tmp group by date_, hour, pl, bt, bv;

echarts呈现（因为要从mysql中读取对应的数据，所以需要从hive将结果迁移至mysql）
mysql表

		create table event_log_hourly(
		date_ varchar(20) comment "yyyy-MM-dd",
		hour_  varchar(20) comment "00|01...23",
		pl 	  varchar(100), 
		bt    varchar(100),
		bv    varchar(100),
		u_count int
	);

	create table event_log_hourly_tmp1 (
		date_ string comment "yyyy-MM-dd",
		hour_ string  comment "00|01...23",
		pl 	  string , 
		bt    string ,
		bv    string ,
		u_count int
	) as select date_, hour, pl, bt, bv, count(distinct u_ud) as u_count from event_log_hourly_tmp group by date_, hour, pl, bt, bv;

insert <into | overwrite table> tblName

insert overwrite table event_log_hourly_tmp1 select date_, hour, pl, bt, bv, count(distinct u_ud) as u_count from event_log_hourly_tmp t group by date_, hour, pl, bt, bv;

自动化脚本


#!/bin/sh

###############
##
##  collect mysql data import into hdfs 
##  约定：变量都用大写，多个单词之间使用下划线分割
##	mysql: test/t_user 
##  hdfs : /input/data-clean/t_user
##  在""中会执行``命令，但是在''中，无法解析${}
###############

##source配置文件


function run() {
    source /home/bigdata/shells/db-conf.conf
    
    HIVE_BIN=/home/bigdata/app/hive/bin/hive
    SQOOP_BIN=/home/bigdata/app/sqoop/bin/sqoop
    
    
    ##创建hive中外部表关联hbase
    
    ##查询对应的数据，生成临时表
    
    ##统计数据，生成临时表1
    
       sql="
    		set hive.exec.mode.local.auto=true;     ## 本地模式
    		USE event_log;
    		CREATE EXTERNAL TABLE event_log_hourly ( rowkey string,  u_ud string, pl string, s_time bigint, bt string, bv string, province string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (\"hbase.columns.mapping\" = \":key,f:u_ud, f:pl,f:s_time,f:bt, f:bv, f:province\") TBLPROPERTIES (\"hbase.table.name\" = \"nginx_js\");
    		CREATE TABLE event_log_hourly_tmp(date_ string, hour string, pl string, bt string, bv string, u_ud string); 
    		insert overwrite table event_log_hourly_tmp select from_unixtime(cast((s_time/1000) as int), 'yyyy-MM-dd'), from_unixtime(cast((s_time/1000) as int), 'yyyy-MM-dd HH:mm:ss'), pl, bt, bv, u_ud from event_log_hourly;
    		CREATE TABLE event_log_hourly_tmp1 (date_ string, hour_ string, pl string, bt string , bv string, u_count int);
    		insert overwrite table event_log_hourly_tmp1 select date_, hour, pl, bt, bv, count(distinct u_ud) as u_count from event_log_hourly_tmp t group by date_, hour, pl, bt, bv;
       "
    	
    	${HIVE_BIN} -e "${sql}"
    
    ##将结果通过sqoop迁移到mysql
    	
    ${SQOOP_BIN} export \
    --connect jdbc:mysql://${DB_PRODUCT_HOST}:${DB_PRODUCT_PORT}/test \
    --username ${DB_PRODUCT_USER} \
    --password ${DB_PRODUCT_PASSWORD} \
    --export-dir hdfs://ns1/user/hive/warehouse/event_log.db/event_log_hourly_tmp1 \
    --table event_log_hourly \
    --fields-terminated-by '\001';
    
	
	
}

run $*

KujyouRuri

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
离线日志分析平台day03

hive与hbase的整合在hive中如果关联hbase中已经存在的表，要使用外部表，反之使用内部关联的话，会在hbase中创建对应的新表:实例：CREATE EXTERNAL TABLE hhh1 (id int, name string,age int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:na
复制链接

扫一扫