Project:Snowflake处理json完整

系列文章目录

提示:这里可以添加系列文章的所有文章的目录,目录需要自己手动添加
例如:第一章 Python 机器学习入门之pandas的使用



一、创建项目所需的snowflake环境

1. 创建数据库和schema

  1. 创建数据库
create database if not exists cricket;
  1. 使用当前数据库
use database cricket
  1. 创建所需的schema
create or replace schema land;
create or replace schema raw;
create or replace schema clean;
create or replace schema consumption;

2. 创建json的格式和internal stage

  1. 创建本次项目所需要处理的json的格式
create or replace file format my_json_format
 type = json
 null_if = ('\\n', 'null', '')
    strip_outer_array = true
    comment = 'Json File Format with outer stip array flag true'; 
  1. 创建internal stage用来将我们的文件上传至此(PS:一般创建的是external stage来自外部的)
create or replace stage my_stg; 

二、Landing Layer 原始数据的存放层(存放数据的源文件)

Landing层是将数据源保存在snowflake里的最初层,这里由于是internal stage,类似于将snowflake当成了一个数据库;如果我们使用external stage 外部cloud上的数据,就不需要直接将数据保存在snowflake里,该层则只是一个引用external stage的层。

1. 上传文件到internal stage里

  1. 在UI里直接上传到internal stage里的/cricket/json(PS:可以使用snowflake cli 结合airflow来上传)
    在这里插入图片描述
  2. 查看刚才上传的文件
list @my_stg;

2. 查看上传的文件

  • 这时候landing层的文件已经存放完毕
select 
        t.$1:meta::variant as meta, 
        t.$1:info::variant as info, 
        t.$1:innings::array as innings, 
        metadata$filename as file_name,
        metadata$file_row_number int,
        metadata$file_content_key text,
        metadata$file_last_modified stg_modified_ts
     from  @my_stg/cricket/json/1384405.json (file_format => 'my_json_format') t;

注意:如果json里的是对象,则它在snowflake里的格式是object,取值时使用[ ];如果是一个数组,它在snowflake里则是array,取值时使用:

三、Raw Layer 数据初始处理(只转为可用的snowflake数据类型)

对应snowflake里schema的raw,这里我们将landing层的最原始的数据只进行了snowflake里的数据类型转换,这样以后的操作都是snowflake里的数据对象。
由于该层是从原始数据转为snowflake 的类型,所以需要添加上文件源的信息

1. 创建表match_raw_tbl用来存放所有json file

  1. 创建一个表,将所有的json文件存放进来; meta/info/innings 是Json文件的3个主要的大结构
  • 只有大括号的形式:类型是object
  • 大括号开始,里面包含中括号:类型是variant,英语变体的意思
  • 中括号开始:array类型
-- lets create a table inside the raw layer
create or replace transient table cricket.raw.match_raw_tbl (
    meta object not null,
    info variant not null,
    innings ARRAY not null,
    stg_file_name text not null,
    stg_file_row_number int not null,
    stg_file_hashkey text not null,
    stg_modified_ts timestamp not null
)
comment = 'This is raw table to store all the json data file with root elements extracted'
;

2. 复制land层的里的所有json file到该表match_raw_tbl里

  • 其中metadata$filenam/file_row_number/file_content_key/file_last_modified 系统函数
copy into cricket.raw.match_raw_tbl from 
    (
    select 
        t.$1:meta::object as meta, 
        t.$1:info::variant as info, 
        t.$1:innings::array as innings, 
        --
        metadata$filename,
        metadata$file_row_number,
        metadata$file_content_key,
        metadata$file_last_modified
    from @cricket.land.my_stg/cricket/json (file_format => 'cricket.land.my_json_format') t
    )
    on_error = continue;

3. 查看数据

  • 此时,所有6个文件的json主要分支已经存放在该原始表里,该表没有对json数据进行任何的处理,只是将数据源的数据进行了snowflake的数据类型转换。
    在这里插入图片描述

四、Clean层将json转换为我们可以使用表

将多层结构的json字段进行扁平化和字段的类型调整,这一阶段,我们将json文件正式的转成了table
该层由于是生成了信息的表,所以原表的信息一定要记录。

1. 对meta字段清洗

  • 读取meta里的数据,因为meta是object 所以需要用[ ] 来读取
select 
    meta['data_version']:: text as data_version,
    meta['created']:: date as data_version,
    meta['revision']:: number as revision
from cricket.raw.match_raw_tbl

2. 对info字段清洗

2.1 清洗match字段,生成match_detail_clean 表

  • 由于info是vairent 类型,所以使用:取数
create or replace transient table cricket.clean.match_detail_clean as
select
    info:match_type_number::int as match_type_number, 
    info:event.name::text as event_name,
    case
    when 
        info:event.match_number::text is not null then info:event.match_number::text
    when 
        info:event.stage::text is not null then info:event.stage::text
    else
        'NA'
    end as match_stage,   
    info:dates[0]::date as event_date,
    date_part('year',info:dates[0]::date) as event_year,
    date_part('month',info:dates[0]::date) as event_month,
    date_part('day',info:dates[0]::date) as event_day,
    info:match_type::text as match_type,
    info:season::text as season,
    info:team_type::text as team_type,
    info:overs::text as overs,
    info:city::text as city,
    info:venue::text as venue, 
    info:gender::text as gender,
    info:teams[0]::text as first_team,
    info:teams[1]::text as second_team,
    case 
        when info:outcome.winner is not null then 'Result Declared'
        when info:outcome.result = 'tie' then 'Tie'
        when info:outcome.result = 'no result' then 'No Result'
        else info:outcome.result
    end as matach_result,
    case 
        when info:outcome.winner is not null then info:outcome.winner
        else 'NA'
    end as winner,   

    info:toss.winner::text as toss_winner,
    initcap(info:toss.decision::text) as toss_decision,
    --
    stg_file_name ,
    stg_file_row_number,
    stg_file_hashkey,
    stg_modified_ts
    from 
    cricket.raw.match_raw_tbl;

取数的核心:varient用: ;object类型可以使用:也可以使用[]

2.2 清洗players字段,生成player_clean_tbl表

  • 从player的结构看出来,每个国家包含11个队员的名称,所以需要进行扁平化处理
  1. 查看player的结构,player是一个value是array
select 
raw.info:match_type_number:: int as match_type_number,
raw.info:players,
from cricket.raw.match_raw_tbl raw
where match_type_number = 4669
  1. 扁平化处理player,可以看出p的key是国家名称,value则是 每个运动员的名称
select 
raw.info:match_type_number:: int as match_type_number,
p.*
from cricket.raw.match_raw_tbl raw,
lateral flatten(input => raw.info:players) p,
where match_type_number = 4669

在这里插入图片描述
3. 接着对p.value进行flatten,就可以获得里面的运动员名称 ,至此就完成了一个Json文件的展开

select 
raw.info:match_type_number:: int as match_type_number,
p.key:: text as country,
team.value :: text as player_name,
from cricket.raw.match_raw_tbl raw,
lateral flatten(input => raw.info:players) p,
lateral flatten(input => p.value) team,
where match_type_number = 4669
  1. 对表里的所有文件的player进行flatten,生成最终的player_clean_tbl表
create or replace table player_clean_tbl as 
select 
raw.info:match_type_number:: int as match_type_number,
p.key:: text as country,
team.value :: text as player_name,
stg_file_name ,
stg_file_row_number,
stg_file_hashkey,
stg_modified_ts
from cricket.raw.match_raw_tbl raw,
lateral flatten(input => raw.info:players) p,
lateral flatten(input => p.value) team

#在这里插入图片描述

2.3 为上面的表生成关系

  • 上面的表在创建的时候,都是独立的没有通过键来进行链接,所以我们需要找到一个建立练习的key;不难看出来,这个是比赛的信息,所以比赛的ID是唯一的且和这些球员是有联系的,所以我们使用MATCH_DETAIL_CLEAN表里的MATCH_TYPE_NUMBER当做外键
  1. 设置MATCH_TYPE_NUMBER 为primary key
alter table cricket.clean.match_detail_clean
add constraint pk_match_type_number primary key (match_type_number);
  1. PLAYER_CLEAN_TBL 里的match_type_number添加外键
alter table player_clean_tbl
add constraint fk_match_id
foreign key (match_type_number)
references MATCH_DETAIL_CLEAN (match_type_number);
  1. 检查外键是否添加成功,查看建表语句
 select get_ddl('table','PLAYER_CLEAN_TBL')

在这里插入图片描述

3.对innings字段进行清洗

3.1 Delivery表的数据

  1. innings字段进行faltten
select 
    raw.info:match_type_number:: text as match_type_number,
    i.value:team :: text as country,
    i.*,
from cricket.raw.match_raw_tbl raw,
lateral flatten (input => raw.innings) i,
where match_type_number = 4669

在这里插入图片描述
2. 对 innings里的value的overs进行flatten

select 
    raw.info:match_type_number:: text as match_type_number,
    i.value:team :: text as country,
    o.*
from cricket.raw.match_raw_tbl raw,
lateral flatten (input => raw.innings) i,
lateral flatten (input => i.value:overs) o,
where match_type_number = 4669

在这里插入图片描述

3.对overs的value里的deliveries进行flatten

select 
    raw.info:match_type_number:: text as match_type_number,
    i.value:team :: text as country,
    d.*
from cricket.raw.match_raw_tbl raw,
lateral flatten (input => raw.innings) i,
lateral flatten (input => i.value:overs) o,
lateral flatten (input => o.value:deliveries) d, 
where match_type_number = 4669

在这里插入图片描述
4. 此时,我们需要的deliveries里的信息已经全部显示出来,我们只需要根据最后的结构取数

{
  "batter": "Abdullah Shafique",
  "bowler": "JJ Bumrah",
  "non_striker": "Imam-ul-Haq",
  "runs": {
    "batter": 4,
    "extras": 0,
    "total": 4
  }
}
select 
    raw.info:match_type_number:: text as match_type_number,
    i.value:team::text as country,
    o.value:over::int as over,
    d.value:bowler::text as bowler,
    d.value:batter::text as batter,
    d.value:non_striker::text as non_striker,
    d.value:runs.batter::text as runs,
    d.value:runs.extras::text as extras,
    d.value:runs.total::text as total,
from cricket.raw.match_raw_tbl raw,
lateral flatten (input => raw.innings) i,
lateral flatten (input => i.value:overs) o,
lateral flatten (input => o.value:deliveries) d, 
where match_type_number = 4669
  1. 添加deliveries的foreign key与delivery_clean_tbl相关联
-- fk relationship
alter table cricket.clean.delivery_clean_tbl
add constraint fk_delivery_match_id
foreign key (match_type_number)
references cricket.clean.match_detail_clean (match_type_number);

五、Consumption层,生成Fact Table和Dimention Table

  • 该层是根据上面我们生成的clean后的table,生成实际项目所需要的dimention table 和 fact table

在这里插入图片描述

1. 创建Date Deimension

  • 一般情况下,项目都会有一个date表,该表里存放的是年月日,季度,周,周末,公众假期,以及是否是周末的标志,这样也方便之后的数据分析使用,不需要每次计算,增加计算时间
create or replace table date_dim (
    date_id int primary key autoincrement,
    full_dt date,
    day int,
    month int,
    year int,
    quarter int,
    dayofweek int,
    dayofmonth int,
    dayofyear int,
    dayofweekname varchar(3), -- to store day names (e.g., "Mon")
    isweekend boolean -- to indicate if it's a weekend (True/False Sat/Sun both falls under weekend)
);

2. 创建 裁判referee/球队team/球员player 的Dimention Table

  1. 创建referee_dim
create or replace table referee_dim (
    referee_id int primary key autoincrement,
    referee_name text not null,
    referee_type text not null
);
  1. 创建team_dim player_dim 表,并且添加两者直接的关系team_id
create or replace table team_dim (
    team_id int primary key autoincrement,
    team_name text not null
);

-- player..
create or replace table player_dim (
    player_id int primary key autoincrement,
    team_id int not null,
    player_name text not null
);

alter table cricket.consumption.player_dim
add constraint fk_team_player_id
foreign key (team_id)
references cricket.consumption.team_dim (team_id);

3. 创建 场地venue/比赛match 的Dimention Table,Fact Table

  1. 创建venue_dim
create or replace table venue_dim (
    venue_id int primary key autoincrement,
    venue_name text not null,
    city text not null,
    state text,
    country text,
    continent text,
    end_Names text,
    capacity number,
    pitch text,
    flood_light boolean,
    established_dt date,
    playing_area text,
    other_sports text,
    curator text,
    lattitude number(10,6),
    longitude number(10,6)
);
  1. match_type_dim
create or replace table match_type_dim (
    match_type_id int primary key autoincrement,
    match_type text not null
);
  1. match_fact
CREATE or replace TABLE match_fact (
    match_id INT PRIMARY KEY,
    date_id INT NOT NULL,
    referee_id INT NOT NULL,
    team_a_id INT NOT NULL,
    team_b_id INT NOT NULL,
    match_type_id INT NOT NULL,
    venue_id INT NOT NULL,
    total_overs number(3),
    balls_per_over number(1),

    overs_played_by_team_a number(2),
    bowls_played_by_team_a number(3),
    extra_bowls_played_by_team_a number(3),
    extra_runs_scored_by_team_a number(3),
    fours_by_team_a number(3),
    sixes_by_team_a number(3),
    total_score_by_team_a number(3),
    wicket_lost_by_team_a number(2),

    overs_played_by_team_b number(2),
    bowls_played_by_team_b number(3),
    extra_bowls_played_by_team_b number(3),
    extra_runs_scored_by_team_b number(3),
    fours_by_team_b number(3),
    sixes_by_team_b number(3),
    total_score_by_team_b number(3),
    wicket_lost_by_team_b number(2),

    toss_winner_team_id int not null, 
    toss_decision text not null, 
    match_result text not null, 
    winner_team_id int not null,

    CONSTRAINT fk_date FOREIGN KEY (date_id) REFERENCES date_dim (date_id),
    CONSTRAINT fk_referee FOREIGN KEY (referee_id) REFERENCES referee_dim (referee_id),
    CONSTRAINT fk_team1 FOREIGN KEY (team_a_id) REFERENCES team_dim (team_id),
    CONSTRAINT fk_team2 FOREIGN KEY (team_b_id) REFERENCES team_dim (team_id),
    CONSTRAINT fk_match_type FOREIGN KEY (match_type_id) REFERENCES match_type_dim (match_type_id),
    CONSTRAINT fk_venue FOREIGN KEY (venue_id) REFERENCES venue_dim (venue_id),

    CONSTRAINT fk_toss_winner_team FOREIGN KEY (toss_winner_team_id) REFERENCES team_dim (team_id),
    CONSTRAINT fk_winner_team FOREIGN KEY (winner_team_id) REFERENCES team_dim (team_id)
);
  • 写入数据
insert into cricket.consumption.match_fact 
select 
    m.match_type_number as match_id,
    dd.date_id as date_id,
    0 as referee_id,
    ftd.team_id as first_team_id,
    std.team_id as second_team_id,
    mtd.match_type_id as match_type_id,
    vd.venue_id as venue_id,
    50 as total_overs,
    6 as balls_per_overs,
    max(case when d.team_name = m.first_team then  d.over else 0 end ) as OVERS_PLAYED_BY_TEAM_A,
    sum(case when d.team_name = m.first_team then  1 else 0 end ) as balls_PLAYED_BY_TEAM_A,
    sum(case when d.team_name = m.first_team then  d.extras else 0 end ) as extra_balls_PLAYED_BY_TEAM_A,
    sum(case when d.team_name = m.first_team then  d.extra_runs else 0 end ) as extra_runs_scored_BY_TEAM_A,
    0 fours_by_team_a,
    0 sixes_by_team_a,
    (sum(case when d.team_name = m.first_team then  d.runs else 0 end ) + sum(case when d.team_name = m.first_team then  d.extra_runs else 0 end ) ) as total_runs_scored_BY_TEAM_A,
    sum(case when d.team_name = m.first_team and player_out is not null then  1 else 0 end ) as wicket_lost_by_team_a,    
    
    max(case when d.team_name = m.second_team then  d.over else 0 end ) as OVERS_PLAYED_BY_TEAM_B,
    sum(case when d.team_name = m.second_team then  1 else 0 end ) as balls_PLAYED_BY_TEAM_B,
    sum(case when d.team_name = m.second_team then  d.extras else 0 end ) as extra_balls_PLAYED_BY_TEAM_B,
    sum(case when d.team_name = m.second_team then  d.extra_runs else 0 end ) as extra_runs_scored_BY_TEAM_B,
    0 fours_by_team_b,
    0 sixes_by_team_b,
    (sum(case when d.team_name = m.second_team then  d.runs else 0 end ) + sum(case when d.team_name = m.second_team then  d.extra_runs else 0 end ) ) as total_runs_scored_BY_TEAM_B,
    sum(case when d.team_name = m.second_team and player_out is not null then  1 else 0 end ) as wicket_lost_by_team_b,
    tw.team_id as toss_winner_team_id,
    m.toss_decision as toss_decision,
    m.matach_result as matach_result,
    mw.team_id as winner_team_id
     
from 
    cricket.clean.match_detail_clean m
    join date_dim dd on m.event_date = dd.full_dt
    join team_dim ftd on m.first_team = ftd.team_name 
    join team_dim std on m.second_team = std.team_name 
    join match_type_dim mtd on m.match_type = mtd.match_type
    join venue_dim vd on m.venue = vd.venue_name and m.city = vd.city
    join cricket.clean.delivery_clean_tbl d  on d.match_type_number = m.match_type_number 
    join team_dim tw on m.toss_winner = tw.team_name 
    join team_dim mw on m.winner= mw.team_name 
    --where m.match_type_number = 4686
    group by
        m.match_type_number,
        date_id,
        referee_id,
        first_team_id,
        second_team_id,
        match_type_id,
        venue_id,
        total_overs,
        toss_winner_team_id,
        toss_decision,
        matach_result,
        winner_team_id
        ;
  • 15
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值