大数据工具之dbt

大数据工具之dbt

简介

ETL,是英文Extract-Transform-Load的缩写,用来描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程。ETL一词较常用在数据仓库.

dbt 是一个很不错的进行ETL中的T处理的工具,灵活简单,我们需要写的就是select 语句,剩下的dbt 帮助我们进行处理

官网地址:https://docs.getdbt.com/

环境要求

1、Python版本需要3.7及以上

2、需要拉取国外镜像,有可能失败,添加国内镜像 daocloud镜像加速器:

curl -sSL https://get.daocloud.io/daotools/set_mirror.sh | sh -s http://d1d9aef0.m.daocloud.io  该脚本可以将 --registry-mirror 加入到你的 Docker 配置文件 /etc/docker/daemon.json 中

Docker安装

github地址:https://github.com/dbt-labs/dbt-core/tree/v1.0.0/docker

因为dbt一般依赖定时任务驱动执行数据转换,所以在官方的基础镜像上将crontab打包进去,一共四个文件(注意路径目录):

crontab

# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
#17 *   * * *   root    cd / && run-parts --report /etc/cron.hourly
#25 6   * * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
#47 6   * * 7   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
#52 6   1 * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )

#每小时执行一次dbt项目
0 0 0/1 * * root cd /usr/app/dbt_project && dbt run > /var/log/dbt-crontab.log 2>&1

Dockerfile

ARG BASE_IMAGE=python:3.8-slim-bullseye

FROM $BASE_IMAGE

RUN apt-get update \
  && apt-get dist-upgrade -y \
  && apt-get install -y --no-install-recommends \
  git \
  ssh-client \
  software-properties-common \
  make \
  build-essential \
  ca-certificates \
  libpq-dev \
  cron \
  vim \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

COPY ./requirements.txt /
COPY ./entrypoint.sh /
RUN  chmod +x entrypoint.sh
RUN pip install --upgrade pip setuptools
RUN pip install --requirement requirements.txt
ENV PYTHONIOENCODING=utf-8
ENV LANG C.UTF-8
WORKDIR /usr/app
VOLUME /usr/app
RUN /etc/init.d/cron start
ENTRYPOINT ["/entrypoint.sh"]

entrypoint.sh

#!/bin/bash
set -x

# 保存环境变量,开启crontab服务
env >> /etc/default/locale
/etc/init.d/cron start

/bin/bash

requirements.txt

agate==1.6.3
attrs==21.2.0
Babel==2.9.1
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.8
click==8.0.3
colorama==0.4.4
dbt-core==1.0.0
dbt-extractor==0.4.0
dbt-postgres==1.0.0
future==0.18.2
hologram==0.0.14
idna==3.3
importlib-metadata==4.8.2
isodate==0.6.0
Jinja2==2.11.3
jsonschema==3.1.1
leather==0.3.4
Logbook==1.5.3
MarkupSafe==2.0.1
mashumaro==2.9
minimal-snowplow-tracker==0.0.2
msgpack==1.0.3
networkx==2.6.3
packaging==21.3
parsedatetime==2.4
psycopg2-binary==2.9.2
pycparser==2.21
pyparsing==3.0.6
pyrsistent==0.18.0
python-dateutil==2.8.2
python-slugify==5.0.2
pytimeparse==1.1.8
pytz==2021.3
PyYAML==6.0
requests==2.26.0
six==1.16.0
sqlparse==0.4.2
text-unidecode==1.3
typing-extensions==3.10.0.2
urllib3==1.26.7
Werkzeug==2.0.2
zipp==3.6.0

执行docker build,制作镜像:

docker build --tag dbt:v1.0.0 .

docker compose启动:

version: '3'
services:
  dbt:
    image: dbt:v1.0.0
    container_name: dbt
    volumes:
      - /root/docker_vol/dbt/app:/usr/app	# dbt project
      - /root/docker_vol/dbt/dbt-profiles:/root/.dbt	# dbt profiles
      - /root/docker_vol/dbt/crontab-conf/crontab:/etc/crontab	# crontab config
    tty: true
    stdin_open: true

构建项目

创建一个新的 dbt 项目

[root@paratera128 dbt]# docker exec -it dbt /bin/bash
root@08fb74a70adf:/usr/app# dbt init dbt_project
07:05:56  Running with dbt=1.0.0
Which database would you like to use?
[1] mariadb
[2] mysql
[3] mysql5

(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)

Enter a number: 3
07:06:08  Profile dbt_project written to /root/.dbt/profiles.yml using target's sample configuration. Once updated, you'll be able to start developing with dbt.
07:06:08
Your new dbt project "dbt_project" was created!

For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:

  https://docs.getdbt.com/docs/configure-your-profile

One more thing:

Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:

  https://community.getdbt.com/

Happy modeling!

root@08fb74a70adf:/usr/app# ls
dbt_project  logs

修改配置文件

进入挂载卷vol下profiles目录,修改profiles.yml文件,根据target修改对应环境配置即可:

dbt_project:
  target: dev
  outputs:
    dev:
      type: mysql
      server: 192.168.137.128
      port: 3306
      database: matomo
      schema: matomo
      username: root
      password: root
      driver: MySQL ODBC 8.0 ANSI Driver
    prod:
      type: mysql
      server: [server/host]
      port: [port]  # optional
      database: [schema] # optional, should be same as schema
      schema: [schema]
      username: [username]
      password: [password]
      driver: MySQL ODBC 8.0 ANSI Driver

调试测试

root@08fb74a70adf:/usr/app# dbt debug
07:35:34  Running with dbt=1.0.0
dbt version: 1.0.0
python version: 3.8.12
python path: /usr/local/bin/python
os info: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.2.5
Using profiles.yml file at /root/.dbt/profiles.yml
Using dbt_project.yml file at /usr/app/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  server: 192.168.137.128
  port: 3306
  database: None
  schema: matomo
  user: root
  Connection test: [OK connection ok]

All checks passed!

应用范例

有需求如下:

有原始表:matomo_log_visit,希望对数据进行如下处理

1、过滤数据,作为下一步的数据依据,因为只是过程数据,所以生成视图表:matomo_log_visit_material_view

2、从上一步视图数据中摘取有意义的字段插入物理表:matomo_log_visit_material

3、为了优化性能,希望对于产生的数据希望是以增量的形式生成

从dbt_project.yml项目配置文件可以看出,dbt run执行的脚本路径如下:

models:
  dbt_project:
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: table #如果.sql脚本不指定类型,默认生成table

dbt 执行models脚本,生成的视图或者物理表名称都跟脚本名称一样,所以两步处理脚本如下:

matomo_log_visit_material_view.sql

{{ config(materialized='view')}}
SELECT a.*,now() as create_time FROM matomo.matomo_log_visit a WHERE idsite = 1 and visit_total_time > 0

matomo_log_visit_material.sql

select idvisit,idsite,user_id,visit_first_action_time,visit_total_time,visit_goal_buyer,referer_type,location_ip,create_time
from {{ ref('matomo_log_visit_material_view') }}

设置执行顺序:

schema.yml

version: 2
models:
  - name: matomo_log_visit_material_view
    description: "A starter dbt model"
    columns:
      - name: idvisit
        description: "The primary key for this table"
        tests:
          - unique
          - not_null

  - name: matomo_log_visit_material
    description: "A starter dbt model"
    columns:
      - name: idvisit
        description: "The primary key for this table"
        tests:
          - unique
          - not_null

执行dbt run

root@1dc4f79d8882:/usr/app/dbt_project# dbt run
07:16:40  Running with dbt=1.0.0
07:16:41  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 170 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
07:16:41
07:16:41  Concurrency: 1 threads (target='dev')
07:16:41
07:16:41  1 of 2 START view model matomo.matomo_log_visit_material_view................... [RUN]
07:16:41  1 of 2 OK created view model matomo.matomo_log_visit_material_view.............. [SUCCESS 0 in 0.09s]
07:16:41  2 of 2 START table model matomo.matomo_log_visit_material....................... [RUN]
07:16:41  2 of 2 OK created table model matomo.matomo_log_visit_material.................. [SUCCESS 8176 in 0.17s]
07:16:41
07:16:41  Finished running 1 view model, 1 table model in 0.32s.
07:16:41
07:16:41  Completed successfully
07:16:41
07:16:41  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

可以看到分别生成了视图和物理表:

在这里插入图片描述

到这里,其实已经完成了前两步,第三步还没有达到增量的效果,这一点可以通过两次执行dbt run看出来,数据量没变,create_time时间变了,说明是整体更新,不是增量更新!

怎么处理呢?

首先:对于视图表来说,他只是一个虚拟存在的表,不涉及到物理表譬如寻址那种性能损耗比较大的问题,所以我们明确的可以知道要进行修改的脚本是matomo_log_visit_material.sql

然后,对于增量更新来说,常用的参照物是有序id,时间等字段

1、如果使用id,那么我们取最大id跟新行数据id比较,我们修改matomo_log_visit_material.sql如下:

{{ config(materialized='incremental', unique_key='visit_first_action_time') }}
select idvisit,idsite,user_id,visit_first_action_time,visit_total_time,visit_goal_buyer,referer_type,location_ip,create_time
from {{ ref('matomo_log_visit_material_view') }}
{% if is_incremental() %}
        idvisit > (select max(idvisit) from {{ this }})
 {% endif %}

往原始表中新增一条数据,idvisit字段值比库中最大值还大,再次执行dbt run,你会发现并不会整体更新:

在这里插入图片描述

2、如果使用时间,我们可以取最大时间跟当前时间比较,我们修改matomo_log_visit_material.sql如下:

{{ config(materialized='incremental', unique_key='visit_first_action_time') }}
select idvisit,idsite,user_id,visit_first_action_time,visit_total_time,visit_goal_buyer,referer_type,location_ip,create_time
from {{ ref('matomo_log_visit_material_view') }}
where
{% if is_incremental() %}
        visit_first_action_time >= (select max(visit_first_action_time) from {{ this }})
 {% endif %}

往原始表新增一条数据,visit_first_action_time字段时间设置成比当前库中最大时间还大,再次执行dbt run,你会发现并不会整体更新:

在这里插入图片描述

项目文档

  • 生成
root@1dc4f79d8882:/usr/app/dbt_project# dbt docs generate
09:11:41  Running with dbt=1.0.0
09:11:41  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 170 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
09:11:41
09:11:42  Concurrency: 1 threads (target='dev')
09:11:42
09:11:42  Done.
09:11:42  Building catalog
09:11:42  Catalog written to /usr/app/dbt_project/target/catalog.json
  • 部署(静态服务)
root@1dc4f79d8882:/usr/app/dbt_project# dbt docs serve
09:13:45  Running with dbt=1.0.0
09:13:45  Serving docs at 0.0.0.0:8080
09:13:45  To access from your browser, navigate to:  http://localhost:8080
09:13:45
09:13:45
09:13:45  Press Ctrl+C to exit.
  • 访问
    在这里插入图片描述
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值