数仓项目一览
1. 项目架构
2. 需求分析
- 了解数据
维度表 dim_city.txt
bj,bj01,朝阳
bj,bj02,海淀
js,js01,南京
js,js02,宿迁
zj,zj01,杭州
zj,zj02,嘉兴
sh,sh01,徐汇
sh,sh02,虹口
gz,gz01,广州
gz,gz02,海珠33饿
维度表 dim_province.txt
bj,北京
js,江苏
zj,浙江
sh,上海
gz,广州
事实表 dw_user_click_d.txt 清洗过后得用户行为点击表
day user_id province_id city_id flow os pv
pv:页面访问量,或者说是点击量 用户去访问www.ruozedata.com 在今天访问了几次 count() sum()
uv:unique visitor 独立的访客 是需要进行去重的 今天1天www.ruozedata.com访问该网站的人数有多少(是需要进行去重的) count(distinct)
2020-04-04,1QJK6U3V94KBO5HEF8,gz,gz01,4210,Android,7
2020-04-11,PBWQC2T3L60ZYOHYTY,sh,sh01,8273,Android,2
2020-04-02,FWSNWBSNNLF347BORU,gz,gz01,2749,Android,7
2020-04-09,P715SX31XFNR7B3CM8,sh,sh01,189,Mac OS,9
2020-04-25,UT85MZURQT4FTZQ5KU,sh,sh01,8176,Mac OS,3
2020-04-05,S3DW8JL5G5NU5UN6I4,zj,zj01,1927,Android,9
2020-04-18,F41NSFYSF3CDJQNH8R,bj,bj01,9132,Android,7
2020-04-12,G6LZMMZBGWO4YSJW6X,sh,sh01,6733,Android,9
2020-04-23,58ZRRKRPCXPV0P3RK6,gz,gz01,7876,Android,5
2020-04-19,USZYXFMODFQ4Q9RP2C,gz,gz01,7702,Android,1
2020-04-27,GFNUCJ8TN9E0I1K56M,gz,gz01,5026,Android,2
2020-04-04,EPFD3I6HNOD9ZTBVJQ,gz,gz01,4963,Mac OS,5
2020-04-22,6KJKD01KVUEN9UO652,gz,gz01,2166,Mac OS,6
2020-04-25,57W2J6CKFL1Y345E3L,bj,bj01,6900,Mac OS,9
2020-04-09,0VRRDG4ZFKPXWE9JIZ,zj,zj01,2250,Mac OS,4
2020-04-17,POEYZVCWXS66N6L44V,sh,sh01,1843,Android,5
2020-04-27,Q1YSVRX8WGEYUL1C9N,sh,sh01,6886,Android,8
2020-04-04,TYBW7SP3C1VJ3G1X7M,bj,bj01,7922,Android,7
2020-04-09,Y97X1SVNMI48ST4YLY,gz,gz01,3146,Mac OS,6
2020-04-15,9WFC8KZFU83521P331,sh,sh01,2390,Android,9
2020-04-17,P1BL3NFXWFBC1POL88,bj,bj01,9838,Android,1
2020-04-24,XHQ0OWRXZY92URV96N,bj,bj01,9190,Android,9
2020-04-01,CD1MPF4GH4H66CCVUW,zj,zj01,2312,Mac OS,6
2020-04-02,XTYZ8G99GNKHTNVKZP,sh,sh01,6048,Mac OS,4
2020-04-03,C13XSESM5P5MXIUTP8,zj,zj01,3032,Mac OS,6
2020-04-24,MTQ3YNFW0IIS8LNX1L,gz,gz01,1554,Mac OS,9
2020-04-01,HYWBBQED30LVEJOQ5R,bj,bj01,6303,Mac OS,2
2020-04-12,HD6XHZ2HMZ4DQM90JO,bj,bj01,4871,Android,3
2020-04-21,46G7Z6Y7LI9HUV1O56,zj,zj01,6812,Mac OS,5
2020-04-09,D09Q5T1BV0FT18PRDG,sh,sh01,352,Android,6
2020-04-06,GPQBZKI2K28L615DUR,sh,sh01,5891,Mac OS,7
2020-04-18,M7WX6KLWB4T6YDGNK3,sh,sh01,9265,Mac OS,6
2020-04-17,379B6K879718ROC8ZI,gz,gz01,4868,Android,4
2张维度表来源于mysql业务库 日志表源于采集系统采集到hive得数据 经过etl之后得汇总数据
2.最终需要展示的数据指标/报表形式/可视化的形式
日期 省份 城市 pv uv
3. 数据源导入Hive数仓
1. 建表
create table dim_province(
province_id string comment '省份ID',
province_name string comment '省份名字'
)
row format delimited fields terminated by ','
stored as textfile;
load data local inpath '/home/hadoop/data/dim_province.txt' into table dim_province;
create table dim_city(
province_id string comment '省份ID',
city_id string comment '城市ID',
city_name string comment '城市名字'
)
row format delimited fields terminated by ','
stored as textfile;
load data local inpath '/home/hadoop/data/dim_city.txt' into table dim_city;
create table dw_user_click_d(
day date comment '日期',
user_id string comment '用户ID',
province_id string comment '省份ID',
city_id string comment '城市ID',
flow bigint comment '流量',
os string comment '操作系统',
pv bigint comment '页面访问量'
)
row format delimited fields terminated by ','
stored as textfile;
load data local inpath '/home/hadoop/data/dw_user_click_d.txt' into table dw_user_click_d;
4. 数据建模
4.1 创建project
4.2 导入数据
load table from tree
4.3 创建model
先有数据源,再有model, model建立在datasource上
现有model 再有cube,cube 建立在model上得
Data Source -- model -- cube
1. model info 创建
3. 选择 事实表: dw_user_click_d
分别去join两张维度表 , 建立起星型模型
3. 维度展示
4. 度量指标
5. setting
4.4 创建cube
KYLIN 去重方式有两种
bitmap 精准去重
hyperloglog 误差去重
4.5 构建cube
- cube 界面点击action
action – build – 时间范围选择 - monitor可以查看进度
手动web ui 构建cube 手动输入start data 和 end date
cube得自动调度
rest api 地址:http://kylin.apache.org/docs/howto/howto_use_restapi.html#list-cubes
List cubes http://hadoop003:7070/kylin/api/cubes/hpznyf_user_click_cube
hpznyf_user_click_cube http://hadoop001:7070/kylin/api/cube_desc/hpznyf_user_click_cube
Build cube
PUT /kylin/api/cubes/{cubeName}/build
Path Variable
cubeName - required string Cube name.
Request Body
startTime - required long Start timestamp of data to build, e.g. 1388563200000 for 2014-1-1
endTime - required long End timestamp of data to build
buildType - required string Supported build type: ‘BUILD’, ‘MERGE’, ‘REFRESH’
Curl Example
curl -X PUT -H "Authorization: Basic XXXXXXXXX" -H 'Content-Type: application/json' -d '{"startTime":'1423526400000', "endTime":'1423612800000', "buildType":"BUILD"}' http://<host>:<port>/kylin/api/cubes/{cubeName}/build
直接执行报错:
[hadoop@hadoop003 script]$ curl -X PUT --user ADMIN:KYLIN -H 'Content-Type: application/json' -d '{"startTime":'$start_day', "endTime":'$end_day', "buildType":"BUILD"}' http://hadoop003:7070/kylin/api/cubes/$v_cubename/build
查询cube:
展示数据:
日期 省份 城市 pv uv
select
"DAY",
PROVINCE_NAME,
CITY_NAME,
sum(PV),
count(distinct USER_ID)
from DW_USER_CLICK_D as user_click
join DIM_PROVINCE as province on province.PROVINCE_ID=user_click.PROVINCE_ID
join DIM_CITY as city on city.CITY_ID=user_click.CITY_ID
group by "DAY",PROVINCE_NAME,CITY_NAME
4. zeeplin
安装直接解压 http://zeppelin.apache.org/download.html
如果端口有问题,需要修改conf内得zeplin-site
1.create notebook
ruozedata_kylin
Interpreter:kylin
2.Interpreters界面 搜索 kylin 配置对应的内容
kylin.query.project learn_kylin(默认) 我们需要修改为:ruozedata_kylin
使用
文档:http://zeppelin.apache.org/docs/0.7.3/interpreter/kylin.html
%kylin
select
"DAY",
PROVINCE_NAME,
CITY_NAME,
sum(PV),
count(distinct USER_ID)
from DW_USER_CLICK_D as user_click
join DIM_PROVINCE as province on province.PROVINCE_ID=user_click.PROVINCE_ID
join DIM_CITY as city on city.CITY_ID=user_click.CITY_ID
group by "DAY",PROVINCE_NAME,CITY_NAME
运行查询就报错:
Failed : HTTP error code 500
修改为:
%kylin select \"DAY\", PROVINCE_NAME, CITY_NAME, sum(PV) as \"PV\", count(distinct USER_ID) as \"UV\" from DW_USER_CLICK_D as user_click join DIM_PROVINCE as province on province.PROVINCE_ID=user_click.PROVINCE_ID join DIM_CITY as city on city.CITY_ID=user_click.CITY_ID group by \"DAY\",PROVINCE_NAME,CITY_NAME
注意点:在配置kylin sql时,对于""需要进行转义,否则是查询不出来数据的
4.Run note with cron scheduler.
配置定时的数据刷新策略