Project：Snowflake处理json完整

山猪打不过家猪

已于 2024-06-29 06:27:33 修改

阅读量710

点赞数 15

分类专栏： DE 文章标签： json

于 2024-06-25 19:17:38 首次发布

本文链接：https://blog.csdn.net/weixin_42067536/article/details/139891976

版权

DE 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

系列文章目录

提示：这里可以添加系列文章的所有文章的目录，目录需要自己手动添加
例如：第一章 Python 机器学习入门之pandas的使用

文章目录

系列文章目录
一、创建项目所需的snowflake环境
- 1. 创建数据库和schema
- 2. 创建json的格式和internal stage
二、Landing Layer 原始数据的存放层（存放数据的源文件）
- 1. 上传文件到internal stage里
- 2. 查看上传的文件
三、Raw Layer 数据初始处理（只转为可用的snowflake数据类型）
四、Clean层将json转换为我们可以使用表
五、Consumption层，生成Fact Table和Dimention Table

一、创建项目所需的snowflake环境

项目地址medium

1. 创建数据库和schema

创建数据库

create database if not exists cricket;

使用当前数据库

use database cricket

创建所需的schema

create or replace schema land;
create or replace schema raw;
create or replace schema clean;
create or replace schema consumption;

2. 创建json的格式和internal stage

创建本次项目所需要处理的json的格式

create or replace file format my_json_format
 type = json
 null_if = ('\\n', 'null', '')
    strip_outer_array = true
    comment = 'Json File Format with outer stip array flag true';

创建internal stage用来将我们的文件上传至此（PS：一般创建的是external stage来自外部的）

create or replace stage my_stg;

二、Landing Layer 原始数据的存放层（存放数据的源文件）

Landing层是将数据源保存在snowflake里的最初层，这里由于是internal stage，类似于将snowflake当成了一个数据库；如果我们使用external stage 外部cloud上的数据，就不需要直接将数据保存在snowflake里，该层则只是一个引用external stage的层。

1. 上传文件到internal stage里

在UI里直接上传到internal stage里的/cricket/json（PS：可以使用snowflake cli 结合airflow来上传）
查看刚才上传的文件

list @my_stg;

2. 查看上传的文件

这时候landing层的文件已经存放完毕

select 
        t.$1:meta::variant as meta, 
        t.$1:info::variant as info, 
        t.$1:innings::array as innings, 
        metadata$filename as file_name,
        metadata$file_row_number int,
        metadata$file_content_key text,
        metadata$file_last_modified stg_modified_ts
     from  @my_stg/cricket/json/1384405.json (file_format => 'my_json_format') t;

注意：如果json里的是对象，则它在snowflake里的格式是object，取值时使用[ ]；如果是一个数组，它在snowflake里则是array，取值时使用：

三、Raw Layer 数据初始处理（只转为可用的snowflake数据类型）

对应snowflake里schema的raw，这里我们将landing层的最原始的数据只进行了snowflake里的数据类型转换，这样以后的操作都是snowflake里的数据对象。
由于该层是从原始数据转为snowflake 的类型，所以需要添加上文件源的信息

1. 创建表match_raw_tbl用来存放所有json file

创建一个表，将所有的json文件存放进来； meta/info/innings 是Json文件的3个主要的大结构

只有大括号的形式：类型是object
大括号开始，里面包含中括号：类型是variant，英语变体的意思
中括号开始：array类型

-- lets create a table inside the raw layer
create or replace transient table cricket.raw.match_raw_tbl (
    meta object not null,
    info variant not null,
    innings ARRAY not null,
    stg_file_name text not null,
    stg_file_row_number int not null,
    stg_file_hashkey text not null,
    stg_modified_ts timestamp not null
)
comment = 'This is raw table to store all the json data file with root elements extracted'
;

2. 复制land层的里的所有json file到该表match_raw_tbl里

其中metadata$filenam/file_row_number/file_content_key/file_last_modified 系统函数

copy into cricket.raw.match_raw_tbl from 
    (
    select 
        t.$1:meta::object as meta, 
        t.$1:info::variant as info, 
        t.$1:innings::array as innings, 
        --
        metadata$filename,
        metadata$file_row_number,
        metadata$file_content_key,
        metadata$file_last_modified
    from @cricket.land.my_stg/cricket/json (file_format => 'cricket.land.my_json_format') t
    )
    on_error = continue;

3. 查看数据

此时，所有6个文件的json主要分支已经存放在该原始表里，该表没有对json数据进行任何的处理，只是将数据源的数据进行了snowflake的数据类型转换。

四、Clean层将json转换为我们可以使用表

将多层结构的json字段进行扁平化和字段的类型调整，这一阶段，我们将json文件正式的转成了table
该层由于是生成了信息的表，所以原表的信息一定要记录。

1. 对meta字段清洗

读取meta里的数据，因为meta是object 所以需要用[ ] 来读取

select 
    meta['data_version']:: text as data_version,
    meta['created']:: date as data_version,
    meta['revision']:: number as revision
from cricket.raw.match_raw_tbl

2. 对info字段清洗

2.1 清洗match字段，生成match_detail_clean 表

由于info是vairent 类型，所以使用：取数

create or replace transient table cricket.clean.match_detail_clean as
select
    info:match_type_number::int as match_type_number, 
    info:event.name::text as event_name,
    case
    when 
        info:event.match_number::text is not null then info:event.match_number::text
    when 
        info:event.stage::text is not null then info:event.stage::text
    else
        'NA'
    end as match_stage,   
    info:dates[0]::date as event_date,
    date_part('year',info:dates[0]::date) as event_year,
    date_part('month',info:dates[0]::date) as event_month,
    date_part('day',info:dates[0]::date) as event_day,
    info:match_type::text as match_type,
    info:season::text as season,
    info:team_type::text as team_type,
    info:overs::text as overs,
    info:city::text as city,
    info:venue::text as venue, 
    info:gender::text as gender,
    info:teams[0]::text as first_team,
    info:teams[1]::text as second_team,
    case 
        when info:outcome.winner is not null then 'Result Declared'
        when info:outcome.result = 'tie' then 'Tie'
        when info:outcome.result = 'no result' then 'No Result'
        else info:outcome.result
    end as matach_result,
    case 
        when info:outcome.winner is not null then info:outcome.winner
        else 'NA'
    end as winner,   

    info:toss.winner::text as toss_winner,
    initcap(info:toss.decision::text) as toss_decision,
    --
    stg_file_name ,
    stg_file_row_number,
    stg_file_hashkey,
    stg_modified_ts
    from 
    cricket.raw.match_raw_tbl;

取数的核心：varient用: ；object类型可以使用:也可以使用[]

2.2 清洗players字段，生成player_clean_tbl表

从player的结构看出来，每个国家包含11个队员的名称，所以需要进行扁平化处理

查看player的结构，player是一个value是array

select 
raw.info:match_type_number:: int as match_type_number,
raw.info:players,
from cricket.raw.match_raw_tbl raw
where match_type_number = 4669

扁平化处理player，可以看出p的key是国家名称，value则是每个运动员的名称

select 
raw.info:match_type_number:: int as match_type_number,
p.*
from cricket.raw.match_raw_tbl raw,
lateral flatten(input => raw.info:players) p,
where match_type_number = 4669

在这里插入图片描述
3. 接着对p.value进行flatten，就可以获得里面的运动员名称，至此就完成了一个Json文件的展开

select 
raw.info:match_type_number:: int as match_type_number,
p.key:: text as country,
team.value :: text as player_name,
from cricket.raw.match_raw_tbl raw,
lateral flatten(input => raw.info:players) p,
lateral flatten(input => p.value) team,
where match_type_number = 4669

对表里的所有文件的player进行flatten，生成最终的player_clean_tbl表

create or replace table player_clean_tbl as 
select 
raw.info:match_type_number:: int as match_type_number,
p.key:: text as country,
team.value :: text as player_name,
stg_file_name ,
stg_file_row_number,
stg_file_hashkey,
stg_modified_ts
from cricket.raw.match_raw_tbl raw,
lateral flatten(input => raw.info:players) p,
lateral flatten(input => p.value) team

# 在这里插入图片描述

2.3 为上面的表生成关系

上面的表在创建的时候，都是独立的没有通过键来进行链接，所以我们需要找到一个建立练习的key；不难看出来，这个是比赛的信息，所以比赛的ID是唯一的且和这些球员是有联系的，所以我们使用MATCH_DETAIL_CLEAN表里的MATCH_TYPE_NUMBER当做外键

设置MATCH_TYPE_NUMBER 为primary key

alter table cricket.clean.match_detail_clean
add constraint pk_match_type_number primary key (match_type_number);

给PLAYER_CLEAN_TBL 里的match_type_number添加外键

alter table player_clean_tbl
add constraint fk_match_id
foreign key (match_type_number)
references MATCH_DETAIL_CLEAN (match_type_number);

检查外键是否添加成功,查看建表语句

 select get_ddl('table','PLAYER_CLEAN_TBL')

在这里插入图片描述

3.对innings字段进行清洗

3.1 Delivery表的数据

对innings字段进行faltten

select 
    raw.info:match_type_number:: text as match_type_number,
    i.value:team :: text as country,
    i.*,
from cricket.raw.match_raw_tbl raw,
lateral flatten (input => raw.innings) i,
where match_type_number = 4669

在这里插入图片描述
2. 对 innings里的value的overs进行flatten

select 
    raw.info:match_type_number:: text as match_type_number,
    i.value:team :: text as country,
    o.*
from cricket.raw.match_raw_tbl raw,
lateral flatten (input => raw.innings) i,
lateral flatten (input => i.value:overs) o,
where match_type_number = 4669

在这里插入图片描述

3.对overs的value里的deliveries进行flatten

select 
    raw.info:match_type_number:: text as match_type_number,
    i.value:team :: text as country,
    d.*
from cricket.raw.match_raw_tbl raw,
lateral flatten (input => raw.innings) i,
lateral flatten (input => i.value:overs) o,
lateral flatten (input => o.value:deliveries) d, 
where match_type_number = 4669

在这里插入图片描述
4. 此时，我们需要的deliveries里的信息已经全部显示出来，我们只需要根据最后的结构取数

{
  "batter": "Abdullah Shafique",
  "bowler": "JJ Bumrah",
  "non_striker": "Imam-ul-Haq",
  "runs": {
    "batter": 4,
    "extras": 0,
    "total": 4
  }
}

select 
    raw.info:match_type_number:: text as match_type_number,
    i.value:team::text as country,
    o.value:over::int as over,
    d.value:bowler::text as bowler,
    d.value:batter::text as batter,
    d.value:non_striker::text as non_striker,
    d.value:runs.batter::text as runs,
    d.value:runs.extras::text as extras,
    d.value:runs.total::text as total,
from cricket.raw.match_raw_tbl raw,
lateral flatten (input => raw.innings) i,
lateral flatten (input => i.value:overs) o,
lateral flatten (input => o.value:deliveries) d, 
where match_type_number = 4669

添加deliveries的foreign key与delivery_clean_tbl相关联

-- fk relationship
alter table cricket.clean.delivery_clean_tbl
add constraint fk_delivery_match_id
foreign key (match_type_number)
references cricket.clean.match_detail_clean (match_type_number);

五、Consumption层，生成Fact Table和Dimention Table

该层是根据上面我们生成的clean后的table,生成实际项目所需要的dimention table 和 fact table

在这里插入图片描述

1. 创建Date Deimension

一般情况下，项目都会有一个date表，该表里存放的是年月日，季度，周，周末，公众假期，以及是否是周末的标志，这样也方便之后的数据分析使用，不需要每次计算，增加计算时间

create or replace table date_dim (
    date_id int primary key autoincrement,
    full_dt date,
    day int,
    month int,
    year int,
    quarter int,
    dayofweek int,
    dayofmonth int,
    dayofyear int,
    dayofweekname varchar(3), -- to store day names (e.g., "Mon")
    isweekend boolean -- to indicate if it's a weekend (True/False Sat/Sun both falls under weekend)
);

2. 创建裁判referee/球队team/球员player 的Dimention Table

创建referee_dim表

create or replace table referee_dim (
    referee_id int primary key autoincrement,
    referee_name text not null,
    referee_type text not null
);

创建team_dim 和player_dim 表，并且添加两者直接的关系team_id

create or replace table team_dim (
    team_id int primary key autoincrement,
    team_name text not null
);

-- player..
create or replace table player_dim (
    player_id int primary key autoincrement,
    team_id int not null,
    player_name text not null
);

alter table cricket.consumption.player_dim
add constraint fk_team_player_id
foreign key (team_id)
references cricket.consumption.team_dim (team_id);

3. 创建场地venue/比赛match 的Dimention Table,Fact Table

创建venue_dim 表

create or replace table venue_dim (
    venue_id int primary key autoincrement,
    venue_name text not null,
    city text not null,
    state text,
    country text,
    continent text,
    end_Names text,
    capacity number,
    pitch text,
    flood_light boolean,
    established_dt date,
    playing_area text,
    other_sports text,
    curator text,
    lattitude number(10,6),
    longitude number(10,6)
);

match_type_dim表

create or replace table match_type_dim (
    match_type_id int primary key autoincrement,
    match_type text not null
);

match_fact 表

CREATE or replace TABLE match_fact (
    match_id INT PRIMARY KEY,
    date_id INT NOT NULL,
    referee_id INT NOT NULL,
    team_a_id INT NOT NULL,
    team_b_id INT NOT NULL,
    match_type_id INT NOT NULL,
    venue_id INT NOT NULL,
    total_overs number(3),
    balls_per_over number(1),

    overs_played_by_team_a number(2),
    bowls_played_by_team_a number(3),
    extra_bowls_played_by_team_a number(3),
    extra_runs_scored_by_team_a number(3),
    fours_by_team_a number(3),
    sixes_by_team_a number(3),
    total_score_by_team_a number(3),
    wicket_lost_by_team_a number(2),

    overs_played_by_team_b number(2),
    bowls_played_by_team_b number(3),
    extra_bowls_played_by_team_b number(3),
    extra_runs_scored_by_team_b number(3),
    fours_by_team_b number(3),
    sixes_by_team_b number(3),
    total_score_by_team_b number(3),
    wicket_lost_by_team_b number(2),

    toss_winner_team_id int not null, 
    toss_decision text not null, 
    match_result text not null, 
    winner_team_id int not null,

    CONSTRAINT fk_date FOREIGN KEY (date_id) REFERENCES date_dim (date_id),
    CONSTRAINT fk_referee FOREIGN KEY (referee_id) REFERENCES referee_dim (referee_id),
    CONSTRAINT fk_team1 FOREIGN KEY (team_a_id) REFERENCES team_dim (team_id),
    CONSTRAINT fk_team2 FOREIGN KEY (team_b_id) REFERENCES team_dim (team_id),
    CONSTRAINT fk_match_type FOREIGN KEY (match_type_id) REFERENCES match_type_dim (match_type_id),
    CONSTRAINT fk_venue FOREIGN KEY (venue_id) REFERENCES venue_dim (venue_id),

    CONSTRAINT fk_toss_winner_team FOREIGN KEY (toss_winner_team_id) REFERENCES team_dim (team_id),
    CONSTRAINT fk_winner_team FOREIGN KEY (winner_team_id) REFERENCES team_dim (team_id)
);

写入数据

insert into cricket.consumption.match_fact 
select 
    m.match_type_number as match_id,
    dd.date_id as date_id,
    0 as referee_id,
    ftd.team_id as first_team_id,
    std.team_id as second_team_id,
    mtd.match_type_id as match_type_id,
    vd.venue_id as venue_id,
    50 as total_overs,
    6 as balls_per_overs,
    max(case when d.team_name = m.first_team then  d.over else 0 end ) as OVERS_PLAYED_BY_TEAM_A,
    sum(case when d.team_name = m.first_team then  1 else 0 end ) as balls_PLAYED_BY_TEAM_A,
    sum(case when d.team_name = m.first_team then  d.extras else 0 end ) as extra_balls_PLAYED_BY_TEAM_A,
    sum(case when d.team_name = m.first_team then  d.extra_runs else 0 end ) as extra_runs_scored_BY_TEAM_A,
    0 fours_by_team_a,
    0 sixes_by_team_a,
    (sum(case when d.team_name = m.first_team then  d.runs else 0 end ) + sum(case when d.team_name = m.first_team then  d.extra_runs else 0 end ) ) as total_runs_scored_BY_TEAM_A,
    sum(case when d.team_name = m.first_team and player_out is not null then  1 else 0 end ) as wicket_lost_by_team_a,    
    
    max(case when d.team_name = m.second_team then  d.over else 0 end ) as OVERS_PLAYED_BY_TEAM_B,
    sum(case when d.team_name = m.second_team then  1 else 0 end ) as balls_PLAYED_BY_TEAM_B,
    sum(case when d.team_name = m.second_team then  d.extras else 0 end ) as extra_balls_PLAYED_BY_TEAM_B,
    sum(case when d.team_name = m.second_team then  d.extra_runs else 0 end ) as extra_runs_scored_BY_TEAM_B,
    0 fours_by_team_b,
    0 sixes_by_team_b,
    (sum(case when d.team_name = m.second_team then  d.runs else 0 end ) + sum(case when d.team_name = m.second_team then  d.extra_runs else 0 end ) ) as total_runs_scored_BY_TEAM_B,
    sum(case when d.team_name = m.second_team and player_out is not null then  1 else 0 end ) as wicket_lost_by_team_b,
    tw.team_id as toss_winner_team_id,
    m.toss_decision as toss_decision,
    m.matach_result as matach_result,
    mw.team_id as winner_team_id
     
from 
    cricket.clean.match_detail_clean m
    join date_dim dd on m.event_date = dd.full_dt
    join team_dim ftd on m.first_team = ftd.team_name 
    join team_dim std on m.second_team = std.team_name 
    join match_type_dim mtd on m.match_type = mtd.match_type
    join venue_dim vd on m.venue = vd.venue_name and m.city = vd.city
    join cricket.clean.delivery_clean_tbl d  on d.match_type_number = m.match_type_number 
    join team_dim tw on m.toss_winner = tw.team_name 
    join team_dim mw on m.winner= mw.team_name 
    --where m.match_type_number = 4686
    group by
        m.match_type_number,
        date_id,
        referee_id,
        first_team_id,
        second_team_id,
        match_type_id,
        venue_id,
        total_overs,
        toss_winner_team_id,
        toss_decision,
        matach_result,
        winner_team_id
        ;