电商数据仓库(四)

欢迎

你好!这是我历经1个半月的学习(Apache和CDH),做完的一个项目,本次和你们分享一下Apache版。
感谢您的阅读!

第1章~第4章在电商数据仓库(一)
第5章~第5章在电商数据仓库(二)
第6章~第8章在电商数据仓库(三)
第8章~第9章在电商数据仓库(四)
第10章~第12章在电商数据仓库(五)
第13章~第15章在电商数据仓库(六)
第16章~第17章在电商数据仓库(七)
第18章~第19章在电商数据仓库(八)
第20章~第21章在电商数据仓库(九)
第22章~第24章在电商数据仓库(十)
第24章 完结

第8章 ODS层

8.1 创建数据库

create database gmall;
use gmall;

8.2 创建start日志表

8.1.1 创建输入数据是lzo输出,并且支持json解析的分区表
CREATE EXTERNAL TABLE ods_start_log (`line` string)
PARTITIONED BY (`dt` string)
STORED AS
  INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_start_log';
8.1.2 加载数据
load data inpath '/origin_data/gmall/log/topic_start/2021-xx-xx' into table gmall.ods_start_log partition(dt='2021-xx-xx');
2021-xx-xx是你生成日志的那天日期
8.1.3 查看是否加载成功
select * from ods_start_log limit 10;
8.1.4 为lzo压缩文件创建索引
hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /warehouse/gmall/ods/ods_start_log/dt=2021-xx-xx

8.2 创建event日志表

8.2.1 创建输入数据是lzo输出,支持json解析的分区表
CREATE EXTERNAL TABLE ods_event_log(`line` string)
PARTITIONED BY (`dt` string)
STORED AS
  INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_event_log';
8.2.2 加载数据
load data inpath '/origin_data/gmall/log/topic_event/2021-02-08' into table gmall.ods_event_log partition(dt='2021-xx-xx');
2021-xx-xx是你生成日志的那天日期
8.2.3 查看是否加载成功
select * from ods_event_log limit 10;
8.2.4 为lzo压缩文件创建索引
hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /warehouse/gmall/ods/ods_event_log/dt=2021-xx-xx

8.3 生成脚本

#!/bin/bash
if [ -n "$1" ]
then
	 do_date=$1
else
	do_date=$(date -d yesterday +%F)
fi

echo ===日志日期为$do_date===

APP=gmall
sql="
load data inpath '/origin_data/gmall/log/topic_start/$do_date' into table $APP.ods_start_log partition(dt='$do_date');
load data inpath '/origin_data/gmall/log/topic_event/$do_date' into table $APP.ods_event_log partition(dt='$do_date');
"
hive  -e "$sql"

hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /warehouse/gmall/ods/ods_start_log/dt=$do_date
hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /warehouse/gmall/ods/ods_event_log/dt=$do_date

第9章 DWD层搭建

9.1 start表操作(主讲)

9.1.1 创建启动表
CREATE EXTERNAL TABLE dwd_start_log(
`mid_id` string,
`user_id` string, 
`version_code` string, 
`version_name` string, 
`lang` string, 
`source` string, 
`os` string, 
`area` string, 
`model` string,
`brand` string, 
`sdk_version` string, 
`gmail` string, 
`height_width` string,  
`app_time` string,
`network` string, 
`lng` string, 
`lat` string, 
`entry` string, 
`open_ad_type` string, 
`action` string, 
`loading_time` string, 
`detail` string, 
`extend1` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_start_log/'
TBLPROPERTIES('parquet.compression'='lzo');
9.1.2 导入数据
insert overwrite table dwd_start_log
PARTITION (dt='2021-xx-xx')
select 
    get_json_object(line,'$.mid') mid_id,
    get_json_object(line,'$.uid') user_id,
    get_json_object(line,'$.vc') version_code,
    get_json_object(line,'$.vn') version_name,
    get_json_object(line,'$.l') lang,
    get_json_object(line,'$.sr') source,
    get_json_object(line,'$.os') os,
    get_json_object(line,'$.ar') area,
    get_json_object(line,'$.md') model,
    get_json_object(line,'$.ba') brand,
    get_json_object(line,'$.sv') sdk_version,
    get_json_object(line,'$.g') gmail,
    get_json_object(line,'$.hw') height_width,
    get_json_object(line,'$.t') app_time,
    get_json_object(line,'$.nw') network,
    get_json_object(line,'$.ln') lng,
    get_json_object(line,'$.la') lat,
    get_json_object(line,'$.entry') entry,
    get_json_object(line,'$.open_ad_type') open_ad_type,
    get_json_object(line,'$.action') action,
    get_json_object(line,'$.loading_time') loading_time,
    get_json_object(line,'$.detail') detail,
    get_json_object(line,'$.extend1') extend1
from ods_start_log 
where dt='2021-xx-xx';
9.1.3 查看数据
select * from dwd_start_log limit 10;
9.1.4 生成脚本
#!/bin/bash
if [ -n "$1" ]
then
	 do_date=$1
else
	do_date=$(date -d yesterday +%F)
fi

echo ===日志日期为$do_date===

APP=gmall
sql="
#这是公共脚本,这里面填上导入数据sql语句就行
#把里面日期改成$do_date
#就直说演示这一次脚本,后面类似
"
hive  -e "$sql"

9.2 event表操作

9.2.1 创建时间日志表
CREATE EXTERNAL TABLE dwd_base_event_log(
`mid_id` string,
`user_id` string, 
`version_code` string, 
`version_name` string, 
`lang` string, 
`source` string, 
`os` string, 
`area` string, 
`model` string,
`brand` string, 
`sdk_version` string, 
`gmail` string, 
`height_width` string, 
`app_time` string, 
`network` string, 
`lng` string, 
`lat` string, 
`event_name` string, 
`event_json` string, 
`server_time` string)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_base_event_log/'
TBLPROPERTIES('parquet.compression'='lzo');
9.2.2 分析
  1. 需求

将事件数据先根据事件名和事件json展开

  1. 相关表

ods_event_log

  1. 思路

使用udf函数base_analizer取出原始数据中的cm中的属性,时间戳和et对应的JSON数组字符串;
使用udtf函数flat_analizer取出每一个事件名和事件JSON。

9.2.3 UDF
public class MyUDF extends UDF{
	public String evaluate(String source,String param) throws JSONException {
		
		//参数的检查  source中,没有param,param也不是ts
		if (!source.contains(param) && !"ts".equals(param)) {
			
			return "";
		}
		
		//合法,取参数
		String[] words = source.split("\\|");

		JSONObject root = new JSONObject(words[1]);
		
		//取出时间戳
		if ("ts".equals(param)) {
			
			return words[0].trim();
			
		}else if ("ap".equals(param)) {
			
			return root.getString("ap");
			
		}else if ("et".equals(param)) {
			
			return root.getString("et");
			
		}else {
			return root.getJSONObject("cm").getString(param);
			
		}
		
		
	}

}
9.2.4 UDTF

public class MyUDTF extends GenericUDTF{

	@Override
	public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
	
		//当前返回的两列的字段别名
		 List<String> fieldNames=new ArrayList<>();
		 
		 fieldNames.add("event_name");
		 fieldNames.add("event_json");

		List<ObjectInspector> fieldOIs=new ArrayList<>();
		
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
		
		return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
		
	}
	
	//执行函数的功能,处理数据后调用forward()返回结果
	@Override
	public void process(Object[] args) throws HiveException {
		
		if (args[0]==null || args.length==0) {
			return ;
		}
		
		try {
			//构建JSON数组对象
			JSONArray jsonArray = new JSONArray(args[0].toString());
			
			if (jsonArray==null || jsonArray.length()==0) {
				return;
			}
			
			for (int i=0;i<jsonArray.length();i++) {
				
				try {
					String [] result=new String [2];
					
					JSONObject jsonObject = jsonArray.getJSONObject(i);
					

					result[0]=jsonObject.getString("en");
					
					result[1]=jsonObject.toString();

					forward(result);
				} catch (Exception e) {

					continue;
					
				}
			}
			
		} catch (JSONException e) {
			e.printStackTrace();
		}
	}
	@Override
	public void close() throws HiveException {
		
	}

}
9.2.5 打成jar包

上传到HDFS上的/user/hive/jars

9.2.6 创建永久函数与jar关联
hive(gmail)
create function base_analizer as 'com.atguigu.udf.MyUDF'
create function flat_analizer as 'com.atguigu.udtf.MyUDTF'
9.2.7 进行数据插入
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_base_event_log partition(dt='2021-xx-xx')
select
    base_analizer(line,'mid') as mid_id,
    base_analizer(line,'uid') as user_id,
    base_analizer(line,'vc') as version_code,
    base_analizer(line,'vn') as version_name,
    base_analizer(line,'l') as lang,
    base_analizer(line,'sr') as source,
    base_analizer(line,'os') as os,
    base_analizer(line,'ar') as area,
    base_analizer(line,'md') as model,
    base_analizer(line,'ba') as brand,
    base_analizer(line,'sv') as sdk_version,
    base_analizer(line,'g') as gmail,
    base_analizer(line,'hw') as height_width,
    base_analizer(line,'t') as app_time,
    base_analizer(line,'nw') as network,
    base_analizer(line,'ln') as lng,
    base_analizer(line,'la') as lat,
    event_name,
    event_json,
    base_analizer(line,'ts') as server_time
from ods_event_log lateral view flat_analizer(base_analizer(line,'et')) tmp_flat as event_name,event_json
where dt='2021-xx-xx' and base_analizer(line,'et')<>'';
9.2.8 写成脚本

看上面公共脚本

9.3 将event_name,event_json,event_time里面的数据打开到表中

9.3.1 点击表
CREATE EXTERNAL TABLE dwd_display_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`action` string,
`goodsid` string,
`place` string,
`extend1` string,
`category` string,
`server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_display_log/'
TBLPROPERTIES('parquet.compression'='lzo');
9.3.2 生成脚本

第10章 DWS层搭建

请移步至基于Apache的电商数据仓库(五)

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

黑色的小德牧

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值