学习大数据DAY56 业务理解和第一次接入

作业1

1 了解行业名词 ERP CRM OA MES WMS RPA SAAS 了解每个系统的功能和应用
ERP 系统,(Enterprise Resource Planning,企业资源计划系统):ERP 系统
是一种用于管理企业各类资源的软件系统,包括生产管理、采购管理、库存管理、
财务管理等功能模块,帮助企业实现资源的优化配置和管理。
CRM 系统,(Customer Relationship Management,客户关系管理系统):CRM
系统是一种用于管理客户关系的软件系统,包括客户信息管理、销售管理、客户
服务管理等功能模块,帮助企业提高客户满意度和市场竞争力。
OA 系统,(Office Automation System,办公自动化系统):OA 系统是一种用
于协调、管理和优化办公流程的软件系统,包括电子邮件、日程安排、文档管理、
工作流程管理等功能模块,帮助企业提高工作效率和管理水平。
MES 系统(Manufacturing Execution System,制造执行系统):MES 系统是一
种用于管理制造过程的软件系统,包括生产计划管理、生产调度管理、工艺管理、
质量管理等功能模块,帮助企业提高生产效率和质量水平。
WMS 系统(Warehouse Management System,仓库管理系统):WMS 系统是一种用
于管理仓库的软件系统,包括入库管理、出库管理、库存管理、配送管理等功能
模块,帮助企业提高仓库管理效率和准确性。
SaaS(Software as a service),软件即服务,即通过网络提供软件服务,SaaS
平台供应商将应用软件统一部署在自己的服务器上,客户可以根据工作实际需
求,通过互联网向厂商定购所需的应用软件服务,按定购的服务多少和时间长短
向厂商支付费用,并通过互联网获得 Saas 平台供应商提供的服务。
2 模型设计 雪花模型 星型模型 ER 图 逻辑模型图...
星座模型
雪花模型
ER 图
逻辑模型图
3 检查自己的 Hadoop 集群,正常标准 444 进程。 myhadoop status

作业2

1 Hive 有哪些文件格式,为什么不用 textfile 格式
TextFile: 这是 Hadoop 的默认文件格式,它将数据以行的形式存储。每行由换
行符(\n)分隔。
SequenceFile: 这是 Hadoop 自带的一种二进制文件格式,它可以存储键值对,
序列化后存储。它支持压缩,而且有同步标记,容易分块,便于并行处理。
RCFile(Record Columnar File): 这是一种列存储格式,它将数据首先按列存
储,然后在每一列中再按行存储。这种格式适用于读操作多于写操作的场景,可
以有效提高查询性能。
ORC(Optimized Row Columnar): 这是 Hive 中支持的另一种列存储格式,相比
于 RCFile,ORC 文件格式具有更高的压缩比和查询效率。它支持 Zlib、Snappy
和 LZ4 压缩方式,并且提供了更高级别的向量化读取、投影和过滤优化。
Parquet:这是另一种列存储格式,它旨在提供跨平台的文件格式,可以很好地
与 Hadoop 生态系统中的其他组件(如 Impala、Presto 等)协同工作。Parquet
支持嵌套数据结构,并且可以与多种语言的客户端库一起使用。
不使用 TextFile 格式的原因主要有以下几点:
性能问题: TextFile 格式数据没有压缩,导致占用更多的存储空间,这会增加
存储成本和网络传输的负担。
读写效率低: TextFile 格式的数据不支持列存储,这对于列式存储优化的 Hive
查询引擎来说,效率较低,因为它不能有效地跳过不需要的列,从而增加了查询
时的 I/O 开销。
数据组织效率: 由于是按行存储,对于执行大量列筛选和聚合操作的查询,性能
不够优化。列存储格式如 ORC 和 Parquet 可以更有效地利用列索引、压缩和编码
技术,从而加快数据处理速度。
压缩支持: Hive 中的其他文件格式如 ORC 和 Parquet 提供了对数据压缩的支持,
可以有效减少存储空间并提高处理速度,而 TextFile 格式不支持压缩。
2 完成环境的部署
Python 安装:
hadoop100,hadoop101,hadoop102 都要安装
阿里云下载 python 安装包
wget
https://mirrors.aliyun.com/python-release/source/Python-3.9.10.tgz
在 hadoop100 下载完成后,通过 scp 传输到另外两个虚拟机
scp /root/Python-3.9.10.tgz root@192.168.200.101:/root scp /root/Python-3.9.10.tgz root@192.168.200.102:/root
我的 hadoop100 的 python 在上个阶段就装好了,直接在 hadoop101 和 hadoop102
再安装一次:
tar -zxvf Python-3.9.10.tgz
cd Python-3.9.10
安装依赖包:
yum install -y gcc patch libffi-devel python-devel zlib-devel
bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel
tk-devel gdbm-devel db4-devel libpcap-devel xz-devel
配置环境
./configure --prefix=/usr/include/openssl
make && make install
建立安装快捷指令
ln -s /usr/include/openssl/bin/python3 /usr/bin/python3
ln -s /usr/include/openssl/bin/pip3 /usr/bin/pip3
python3 -m pip install --upgrade pip
pip3 config set global.index-url
https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pymysql pyspark requests bs4 hdfs pandas numpy openpyxl
pip3 install urllib3==1.26.15
安装完成验证:
部署 Hive:
同步 Hive:
xsync /opt/module/apache-hive-3.1.2-bin
同步环境变量:
xsync /etc/profile.d/my_env.sh
部署调度平台:
cd /opt
mkdir xxljob
将‘Z:\班级\09_医药项目\xxl-job-student-20221220’中的所有文件上传到
xxljob 中
同步到其它虚拟机:
xsync xxljob
确保 mysql 账号密码均为 root
mysql 中执行 xxljob 建表文件
source /opt/xxljob/tables_xxl_job.sql;
编写启停脚本:
mkdir /root/bin
vim /root/bin/xxl
内容:
#!/bin/bash
act=$1
start()
{
echo "starting xxl-job"
ssh root@hadoop100 "cd /opt/xxljob; nohup java -jar
xxl-job-admin-2.3.0.jar > xxl-job.log 2>&1 &"
ssh root@hadoop100 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
ssh root@hadoop101 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
ssh root@hadoop102 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
}
stop(){
echo "stopping xxl-job"
ssh root@hadoop100 "ps -aux | grep xxl-job-admin | grep-v grep | awk '{print \$2}' | xargs kill -9"
ssh
root@hadoop100
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
ssh
root@hadoop101
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
ssh
root@hadoop102
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
}
status(){
echo "=============== hadoop102 ==============="
ssh
root@hadoop102
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep"
echo "=============== hadoop101 ==============="
ssh
root@hadoop101
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep"
echo "=============== hadoop100 ==============="
ssh root@hadoop100 "ps -aux | grep xxl-job | grep -v
grep"
}
case $act in
start)
start
status
;;
stop)
stop
status
;;
restart)
stop
start
;;
status)
status
;;
esac
增加执行权限:
chmod +x /root/bin/xxl
验证:
xxl start
netstat -nltp | grep 8080
安装 DataX:
wget
https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202308/datax.tar.gz
tar -zxvf datax.tar.gz
同步到其它节点:
xsync /opt/datax
3 完成第一个接入任务
客户提供的连接信息
需求:使用 DataX 每天凌晨 1 点定时抽取 c_org_busi 门店信息表
编写配置文件
打开 VScode:
mkdir -p /zhiyun/shihaihong
cd /zhiyun/shihaihong
mkdir data jobs shell sql python 读取
mysql https://github.com/alibaba/DataX/blob/master/mysqlreader/doc
/mysqlreader.md
写入
hive https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdf
swriter.md
job 目录中的 c_org_busi.json 文件下:
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": ["jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print":true
}
}
}
]
}
}
运行抽取:
python /opt/datax/bin/datax.py /zhiyun/shihaihong/jobs/c_org_busi.json
写入到 Hive
json 文件改为:
{
"job": {"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": [
"jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://hadoop100:8020",
"fileType": "orc",
"path":
"/zhiyun/shihaihong/ods/c_org_busi",
"fileName": "c_org_busi.data",
"column": [
{"name":"id","type": "int"},
{"name":"busno","type": "string"},{"name":"orgname","type": "string"},
{"name":"orgsubno","type":
"string"},
{"name":"orgtype","type": "string"},
{"name":"salegroup","type":
"string"},
{"name":"org_tran_code","type":
"string"},
{"name":"accno","type": "string"},
{"name":"sendtype","type":
"string"},
{"name":"sendday","type": "string"},
{"name":"maxday","type": "string"},
{"name":"minday","type": "string"},
{"name":"notes","type": "string"},
{"name":"stamp","type": "string"},
{"name":"status","type": "string"},
{"name":"customid","type":
"string"},
{"name":"whl_vendorno","type":
"string"},
{"name":"whlgroup","type":
"string"},
{"name":"rate","type": "string"},
{"name":"creditamt","type":
"string"},
{"name":"creditday","type":
"string"},
{"name":"peoples","type": "string"},
{"name":"area","type": "string"},
{"name":"abc","type": "string"},
{"name":"address","type": "string"},
{"name":"tel","type": "string"},
{"name":"principal","type":
"string"},
{"name":"identity_card","type":
"string"},
{"name":"mobil","type": "string"},
{"name":"corporation","type":
"string"},
{"name":"saler","type": "string"},
{"name":"createtime","type":
"string"},{"name":"bank","type": "string"},
{"name":"bankno","type": "string"},
{"name":"bak1","type": "string"},
{"name":"bak2","type": "string"},
{"name":"a_bak1","type": "string"},
{"name":"aa_bak1","type": "string"},
{"name":"b_bak1","type": "string"},
{"name":"bb_bak1","type": "string"},
{"name":"y_bak1","type": "string"},
{"name":"t_bak1","type": "string"},
{"name":"ym_bak1","type": "string"},
{"name":"tm_bak1","type": "string"},
{"name":"supervise_code","type":
"string"},
{"name":"monthrent","type":
"string"},
{"name":"wms_warehid","type":
"string"},
{"name":"settlement_cycle","type":
"string"},
{"name":"apply_cycle","type":
"string"},
{"name":"applydate","type":
"string"},
{"name":"accounttype","type":
"string"},
{"name":"applydate_last","type":
"string"},
{"name":"paymode","type": "string"},
{"name":"yaolian_flag","type":
"string"},
{"name":"org_longitude","type":
"string"},
{"name":"org_latitude","type":
"string"},
{"name":"org_province","type":
"string"},
{"name":"org_city","type":
"string"},
{"name":"org_area","type":
"string"},
{"name":"business_time","type":
"string"},{"name":"yaolian_group","type":
"string"},
{"name":"pacard_storeid","type":
"string"},
{"name":"opening_time","type":
"string"},
{"name":"ret_ent_id","type":
"string"},
{"name":"ent_id","type": "string"}
],
"writeMode": "truncate",
"fieldDelimiter": "\t"
}
}
}
]
}
}
打开 myhadoop 集群和 hive:
创建 hdfs 文件:
hadoop fs -mkdir -p /zhiyun/shihaihong/ods/c_org_busi
运行抽取:
python
/opt/datax/bin/datax.py
/zhiyun/shihaihong/jobs/c_org_busi.json
Hive 建表
-- 创建数据库
create database if not exists ods_shihaihong location
"/zhiyun/shihaihong/ods";
-- 创建对应的数据表
-- ODS 表都应该是外部表 防止建错表然后删表的情况create external table if not exists ods_shihaihong.c_org_busi(
id int,
busno string,
orgname string,
orgsubno string,
orgtype string,
salegroup string,
org_tran_code string,
accno string,
sendtype string,
sendday string,
maxday string,
minday string,
notes string,
stamp string,
status string,
customid string,
whl_vendorno string,
whlgroup string,
rate string,
creditamt string,
creditday string,
peoples string,
area string,
abc string,
address string,
tel string,
principal string,
identity_card string,
mobil string,
corporation string,
saler string,
createtime string,
bank string,
bankno string,
bak1 string,
bak2 string,
a_bak1 string,
aa_bak1 string,
b_bak1 string,
bb_bak1 string,
y_bak1 string,
t_bak1 string,ym_bak1 string,
tm_bak1 string,
supervise_code string,
monthrent string,
wms_warehid string,
settlement_cycle string,
apply_cycle string,
applydate string,
accounttype string,
applydate_last string,
paymode string,
yaolian_flag string,
org_longitude string,
org_latitude string,
org_province string,
org_city string,
org_area string,
business_time string,
yaolian_group string,
pacard_storeid string,
opening_time string,
ret_ent_id string,
ent_id string
) row format delimited fields terminated by "\t"
lines terminated by "\n"
stored as orc
location "/zhiyun/shihaihong/ods/c_org_busi";

验证数据

调度脚本
自动运行,自动更新
c_org_busi.sh:
#!/bin/bash
# 作用: 完成从编写配置文件到验证数据的整个过程
# 需要在任何节点都可以执行
echo "开始抽取 c_org_busi 门店信息表"
echo "生成配置文件"mkdir -p /zhiyun/shihaihong/jobs
echo '
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": [
"jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://hadoop100:8020",
"fileType": "orc","path":
"/zhiyun/shihaihong/ods/c_org_busi",
"fileName": "c_org_busi.data",
"column": [
...
],
"writeMode": "truncate",
"fieldDelimiter": "\t"
}
}
}
]
}
}' > /zhiyun/shihaihong/jobs/c_org_busi.json
echo "生成 HDFS 路径"
hadoop fs -mkdir -p /zhiyun/shihaihong/ods/c_org_busi
echo "抽取数据"
python /opt/datax/bin/datax.py
/zhiyun/shihaihong/jobs/c_org_busi.json
echo "hive 建表"
hive -e '
-- 创建数据库
create database if not exists ods_shihaihong location
"/zhiyun/shihaihong/ods";
-- 创建对应的数据表
-- ODS 表都应该是外部表 防止建错表然后删表的情况
create external table if not exists ods_shihaihong.c_org_busi(
id int,
busno string,
orgname string,
orgsubno string,
orgtype string,
salegroup string,
org_tran_code string,
accno string,
sendtype string,
sendday string,
maxday string,
minday string,
notes string,stamp string,
status string,
customid string,
whl_vendorno string,
whlgroup string,
rate string,
creditamt string,
creditday string,
peoples string,
area string,
abc string,
address string,
tel string,
principal string,
identity_card string,
mobil string,
corporation string,
saler string,
createtime string,
bank string,
bankno string,
bak1 string,
bak2 string,
a_bak1 string,
aa_bak1 string,
b_bak1 string,
bb_bak1 string,
y_bak1 string,
t_bak1 string,
ym_bak1 string,
tm_bak1 string,
supervise_code string,
monthrent string,
wms_warehid string,
settlement_cycle string,
apply_cycle string,
applydate string,
accounttype string,
applydate_last string,
paymode string,
yaolian_flag string,
org_longitude string,
org_latitude string,org_province string,
org_city string,
org_area string,
business_time string,
yaolian_group string,
pacard_storeid string,
opening_time string,
ret_ent_id string,
ent_id string
) row format delimited fields terminated by "\t"
lines terminated by "\n"
stored as orc
location "/zhiyun/shihaihong/ods/c_org_busi";
'
echo "验证数据"
hive -e '
select count(1) from ods_shihaihong.c_org_busi;
select * from ods_shihaihong.c_org_busi limit 2;
'
echo "抽取完成!"
赋权:
chmod u=rwx c_org_busi.sh
./c_org_busi.sh
设置定时:
执行一次,查看执行日志:
执行一次
4 作业标准
4.1 所有截图必须全屏截图
4.2 把调度平台的日志复制到作业中
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值