1. 数仓概念
2. 项目需求及架构设计
2.1 项目需求分析
2.1.1 项目需求
2.1.2 离线需求
2.1.3 实时需求
2.2 项目框架
2.2.1 技术选型
2.2.2 系统数据流程设计
2.2.3 测试集群服务器规划
服务名称 |
子服务 |
服务器 hadoop111 |
服务器 hadoop112 |
服务器 hadoop113 |
HDFS |
NameNode |
√ |
||
DataNode |
√ |
√ |
||
SecondaryNameNode |
√ |
|||
Yarn |
Resourcemanager |
√ |
||
NodeManager |
√ |
√ |
||
Zookeeper |
Zookeeper Server |
√ |
√ |
√ |
Flume(采集日志) |
Flume |
√ |
||
Kafka |
Kafka |
√ |
√ |
√ |
Flume(Kafka日志) |
Flume |
√ |
||
Flume(Kafka业务) |
Flume |
√ |
||
Hive |
√ |
|||
MySQL |
MySQL |
√ |
||
DataX |
√ |
|||
Spark |
√ |
√ |
√ |
|
DolphinScheduler |
ApiApplicationServer |
√ |
||
AlertServer |
√ |
|||
MasterServer |
√ |
|||
WorkerServer |
√ |
√ |
√ |
|
LoggerServer |
√ |
√ |
√ |
|
Superset |
Superset |
√ |
||
Flink |
√ |
|||
ClickHouse |
√ |
√ |
√ |
|
Redis |
√ |
|||
Hbase |
√ |
√ |
√ |
|
服务数总计 |
19 |
9 |
12 |
3. 行为数据采集平台搭建
3.1 jps
# jps -mv 查看jvm进程详细启动信息,包括堆内存大小配置等
[seven@hadoop111 bin]$ jps
12961 Kafka
2418 JournalNode
2021 NameNode
2919 JobHistoryServer
23095 Jps
1786 QuorumPeerMain
2139 DataNode
2639 DFSZKFailoverController
2831 NodeManager
# jps -mv 查看jvm进程详细启动信息,包括堆内存大小配置等
[seven@hadoop111 bin]$ jps -mv
12961 Kafka /opt/module/kafka/config/server.properties -Xmx1G -Xms1G -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/opt/module/kafka/bin/../logs/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/opt/module/kafka/bin/../logs -Dlog4j.configuration=file:/opt/module/kafka/bin/../config/log4j.properties
2418 JournalNode -Dproc_journalnode -Djava.net.preferIPv4Stack=true -Dyarn.log.dir=/opt/module/hadoop-3.3.6/logs -Dyarn.log.file=hadoop-seven-journalnode-hadoop111.log -Dyarn.home.dir=/opt/module/hadoop-3.3.6 -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/module/hadoop-3.3.6/lib/native -Dhadoop.log.dir=/opt/module/hadoop-3.3.6/logs -Dhadoop.log.file=hadoop-seven-journalnode-hadoop111.log -Dhadoop.home.dir=/opt/module/hadoop-3.3.6 -Dhadoop.id.str=seven -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender
2021 NameNode -Dproc_namenode -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=/opt/module/hadoop-3.3.6/logs -Dyarn.log.file=hadoop-seven-namenode-hadoop111.log -Dyarn.home.dir=/opt/module/hadoop-3.3.6 -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/module/hadoop-3.3.6/lib/native -Dhadoop.log.dir=/opt/module/hadoop-3.3.6/logs -Dhadoop.log.file=hadoop-seven-namenode-hadoop111.log -Dhadoop.home.dir=/opt/module/hadoop-3.3.6 -Dhadoop.id.str=seven -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml
23142 Jps -mv -Dapplication.home=/opt/module/jdk1.8.0_371 -Xms8m
2919 JobHistoryServer -Dproc_historyserver -Djava.net.preferIPv4Stack=true -Dmapred.jobsummary.logger=INFO,RFA -Dyarn.log.dir=/opt/module/hadoop-3.3.6/logs -Dyarn.log.file=hadoop-seven-historyserver-hadoop111.log -Dyarn.home.dir=/opt/module/hadoop-3.3.6 -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/module/hadoop-3.3.6/lib/native -Dhadoop.log.dir=/opt/module/hadoop-3.3.6/logs -Dhadoop.log.file=hadoop-seven-historyserver-hadoop111.log -Dhadoop.home.dir=/opt/module/hadoop-3.3.6 -Dhadoop.id.str=seven -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender
3.2 xsync
#!/bin/bash
#1. check param numbers
if [ $# -lt 1 ]
then
echo "Not Enough Arguement!"
echo "command example: xsync --host=hadoop112,hadoop113 /src_dir/f_name or xsync /src_dir/f_name"
exit;
fi
#2. Check if rsync command exists,if not,install it
if ! command -v rsync &> /dev/null; then
if command -v yum &> /dev/null; then
sudo yum install -y rsync
elif command -v apt-get &> /dev/null; then
sudo apt-get install -y rsync
else
echo "Cannot install rsync,please install it manually."
exit 1
fi
fi
#3. Check if --host argument is provided
if [[ "$1" == "--host="* ]]; then
#Parse hosts from argument
hosts=$(echo ${1#*=}| tr ',' '\n')
else
hosts=$(cat ~/bin/host_ips)
fi
#4.Traverse through each host
for host in $hosts
do
echo ==================== $host ====================
#5.Check if file exists
# shellcheck disable=SC2068
for file in $@
do
#6.Ignore --host argument
if [[ "$file" == "--host="* ]]
then
continue
fi
if [ -e "$file" ]; then
# Get parent directory
pdir=$(cd -P $(dirname $file); pwd)
#Get file name
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
3.3 xcall
#! /bin/bash
if [ $# -lt 1 ]
then
echo "Not Enough Arguement!"
echo "command example: xcall -h=hadoop112,hadoop113 ls or xcall ls"
exit;
fi
#Check if -h argument is provided
if [[ "$1" == "-h="* ]]; then
#Parse hosts from argument
hosts=$(echo "${1#*=}"| tr ',' '\n')
else
hosts=$(cat ~/bin/host_ips)
fi
for i in $hosts
do
echo --------- "$i" ----------
ssh "$i" "$*"
done
3.4 hdp.sh
#!/bin/bash
if [ $# -lt 1 ]
then
echo "No Args Input..."
exit ;
fi
case $1 in
"start")
echo " =================== 启动 hadoop集群 ==================="
echo " --------------- 启动 hdfs ---------------"
ssh hadoop111 "/opt/module/hadoop-3.3.1/sbin/start-dfs.sh"
echo " --------------- 启动 yarn ---------------"
ssh hadoop111 "/opt/module/hadoop-3.3.1/sbin/start-yarn.sh"
echo " --------------- 启动 historyserver ---------------"
ssh hadoop111 "/opt/module/hadoop-3.3.1/bin/mapred --daemon start historyserver"
;;
"stop")
echo " =================== 关闭 hadoop集群 ==================="
echo " --------------- 关闭 historyserver ---------------"
ssh hadoop111 "/opt/module/hadoop-3.3.1/bin/mapred --daemon stop historyserver"
echo " --------------- 关闭 yarn ---------------"
ssh hadoop111 "/opt/module/hadoop-3.3.1/sbin/stop-yarn.sh"
echo " --------------- 关闭 hdfs ---------------"
ssh hadoop111 "/opt/module/hadoop-3.3.1/sbin/stop-dfs.sh"
;;
*)
echo "Input Args Error..."
;;
esac
3.4.1 core-site.xml
<!-- 指定NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop111:8020</value>
</property>
<!-- 指定hadoop数据的临时存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.3.1/tmp/</value>
</property>
<!-- 配置HDFS网页登录使用的静态用户为root -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>root</value>
</property>
<!-- 配置该root(superUser)允许通过代理访问的主机节点 -->
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<!-- 配置该root(superUser)允许通过代理用户所属组 -->
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<!-- 配置该root(superUser)允许通过代理的用户-->
<property>
<name>hadoop.proxyuser.root.users</name>
<value>*</value>
</property>
<!-- OBS -->
<property>
<name>fs.obs.impl</name>
<value>org.apache.hadoop.fs.obs.OBSFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.obs.impl</name>
<value>org.apache.hadoop.fs.obs.OBS</value>
</property>
<property>
<name>fs.obs.access.key</name>
<value>xxx</value>
</property>
<property>
<name>fs.obs.secret.key</name>
<value>xxx</value>
</property>
<property>
<name>fs.obs.endpoint</name>
<value>obs.cn-north-4.myhuaweicloud.com</value>
</property>
<!--
<property>
<name>fs.obs.buffer.dir</name>
<value>/srv/Bigdata/hadoop/data1/obs,
/srv/Bigdata/hadoop/data2/obs,
/srv/Bigdata/hadoop/data3/obs
</value>
</property>
-->
<property>
<name>fs.obs.bufferdir.verify.enable</name>
<value>FALSE</value>
</property>
<property>
<name>fs.obs.readahead.policy</name>
<value>advance</value>
</property>
<property>
<name>fs.obs.readahead.range</name>
<value>4194304</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1440</value>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>1440</value>
</property>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop101:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop/data/tmp</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>root</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.users</name>
<value>*</value>
</property>
<!-- OBS start -->
<property>
<name>fs.obs.impl</name>
<value>org.apache.hadoop.fs.obs.OBSFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.obs.impl</name>
<value>org.apache.hadoop.fs.obs.OBS</value>
</property>
<property>
<name>fs.obs.access.key</name>
<value>JYLLB78GVN5IOIS1TVNU</value>
</property>
<property>
<name>fs.obs.secret.key</name>
<value>ektm4sRurdlePtE4vGguTyUY0HCdqrYWJZ3oX22w</value>
</property>
<!--临时Token -->
<!--
<property>
<name>fs.obs.session.token</name>
<value></value>
</property>
-->
<!--
<property>
<name>fs.obs.security.provider</name>
<value>com.obs.services.EcsObsCredentialsProvider</value>
</property>
-->
<property>
<name>fs.obs.endpoint</name>
<value>obs.cn-north-9.myhuaweicloud.com</value>
</property>
<property>
<name>fs.obs.buffer.dir</name>
<value>/opt/module/hadoop/data/obs</value>
</property>
<property>
<name>fs.obs.bufferdir.verify.enable</name>
<value>FALSE</value>
</property>
<!--OBSA Read -->
<property>
<name>fs.obs.readahead.policy</name>
<value>advance</value>
</property>
<property>
<name>fs.obs.readahead.range</name>
<value>1048576</value>
<!--<value>4194304</value>-->
</property>
<property>
<name>fs.obs.readahead.max.number</name>
<value>4</value>
</property>
<!-- OBSA Write -->
<property>
<name>fs.obs.fast.upload.buffer</name>
<value>disk</value>
<!--<value>bytebuffer</value>-->
</property>
<property>
<name>fs.obs.multipart.size</name>
<value>104857600</value>
</property>
<property>
<name>fs.obs.fast.upload.active.blocks</name>
<value>4</value>
</property>
<!-- OBSA Delete -->
<property>
<name>fs.obs.trash.enable</name>
<value>true</value>
</property>
<property>
<name>fs.obs.trash.dir</name>
<value>/user/root/.Trash/Current</value>
</property>
<!-- OBS end -->
<!-- 回收站 -->
<property>
<name>fs.trash.interval</name>
<value>1440</value>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>1440</value>
</property>
<!-- OSS-HDFS -->
<property>
<name>fs.AbstractFileSystem.oss.impl</name>
<value>com.aliyun.jindodata.oss.JindoOSS</value>
</property>
<property>
<name>fs.oss.impl</name>
<value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
</property>
<property>
<name>fs.oss.accessKeyId</name>
<value>LTAI5tRk9h5PyxvrfSvvdLwm</value>
</property>
<property>
<name>fs.oss.accessKeySecret</name>
<value>8uRVoh4DFNcEKTFl27NJ8aq5eIoIkt</value>
</property>
<property>
<name>fs.oss.endpoint</name>
<value>oss-cn-hangzhou.aliyuncs.com</value>
</property>
<!-- TOS config -->
<!--Required properties-->
<!--Filesystem implementation class for TOS-->
<property>
<name>fs.AbstractFileSystem.tos.impl</name>
<value>io.proton.fs.ProtonFS</value>
</property>
<property>
<name>fs.tos.impl</name>
<value>io.proton.fs.ProtonFileSystem</value>
</property>
<property>
<name>fs.tos.endpoint</name>
<value>http://tos-cn-beijing.volces.com</value>
</property>
<property>
<name>proton.cache.enable</name>
<value>false</value>
</property>
<!-- Credential provider mode 2: static AK/SK for all buckets -->
<property>
<name>fs.tos.credentials.provider</name>
<value>io.proton.common.object.tos.auth.SimpleCredentialsProvider</value>
</property>
<property>
<name>fs.tos.access-key-id</name>
<value>AKLTYTJiOWYwZTY2ZDQ0NDM4ODkwZWQ2YzljODBlZTQzZDk</value>
</property>
<property>
<name>fs.tos.secret-access-key</name>
<value>WVdVNFpUQTNOVFl5T0Rjd05EZzBNbUU0TTJJd05ERXpOamN6WkdJd09Uaw==</value>
</property>
<!--TOS Client configuration-->
<!--we can overwrite these properties to optimize TOS read and write performance-->
<property>
<name>fs.tos.http.maxConnections</name>
<value>1024</value>
</property>
<!-- JuiceFS-Local -->
<property>
<name>fs.jfs.impl</name>
<value>io.juicefs.JuiceFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.jfs.impl</name>
<value>io.juicefs.JuiceFS</value>
</property>
<property>
<name>juicefs.meta</name>
<value>mysql://root:000000@(hadoop101:3306)/jfs</value>
<!--<value>redis://hadoop101:6379/1</value>-->
</property>
<property>
<name>juicefs.cache-dir</name>
<value>/opt/module/hadoop/data/jfs</value>
</property>
<property>
<name>juicefs.cache-size</name>
<value>1024</value>
</property>
<property>
<name>juicefs.access-log</name>
<value>/tmp/juicefs.access.log</value>
</property>
<!-- JuiceFS-Online -->
<!--
<property>
<name>fs.jfs.impl</name>
<value>com.juicefs.JuiceFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.jfs.impl</name>
<value>com.juicefs.JuiceFS</value>
</property>
<property>
<name>juicefs.token</name>
<value>bb596db1e2019ae5f1ce48d3c950439a426df327</value>
</property>
<property>
<name>juicefs.accesskey</name>
<value>PHIEYFVTHHPA2SW5U446</value>
</property>
<property>
<name>juicefs.secretkey</name>
<value>aQxf03r1AE2FyvbMsLV3tDfxHQigClXWWNjq2712</value>
</property>
-->
<!-- CHDFS -->
<!--
<property>
<name>fs.AbstractFileSystem.ofs.impl</name>
<value>com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter</value>
</property>
<property>
<name>fs.ofs.impl</name>
<value>com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter</value>
</property>
<property>
<name>fs.ofs.tmp.cache.dir</name>
<value>/opt/module/hadoop/data/chdfs</value>
</property>
<property>
<name>fs.ofs.user.appid</name>
<value>1303872390</value>
</property>
<property>
<name>fs.ofs.upload.flush.flag</name>
<value>false</value>
</property>
-->
<!-- COSN-->
<property>
<name>fs.cosn.impl</name>
<value>org.apache.hadoop.fs.CosFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.cosn.impl</name>
<value>org.apache.hadoop.fs.CosN</value>
</property>
<property>
<name>fs.cosn.bucket.region</name>
<value>ap-beijing</value>
</property>
<property>
<name>fs.cosn.trsf.fs.ofs.bucket.region</name>
<value>ap-beijing</value>
</property>
<property>
<name>fs.ofs.bucket.region</name>
<value>ap-beijing</value>
</property>
<property>
<name>fs.cosn.credentials.provider</name>
<value>org.apache.hadoop.fs.auth.SimpleCredentialProvider</value>
</property>
<property>
<name>fs.cosn.userinfo.secretId</name>
<value>AKIDNfpmUyEF2jDWpwrWw8IPLGzTiaVSK3e0</value>
</property>
<property>
<name>fs.cosn.userinfo.secretKey</name>
<value>AIGB75ze5QEsKlCgyJTQQSUZghZEewjL</value>
</property>
<!--配置账户的 appid,可登录腾讯云控制台(https://console.cloud.tencent.com/developer)查看-->
<property>
<name>fs.cosn.trsf.fs.ofs.user.appid</name>
<value>1303872390</value>
</property>
<property>
<name>fs.cosn.trsf.fs.ofs.tmp.cache.dir</name>
<value>/opt/module/hadoop/data/cos</value>
</property>
<property>
<name>fs.cosn.tmp.dir</name>
<value>/opt/module/hadoop/data/hadoop_cos</value>
</property>
<property>
<name>fs.cosn.trsf.fs.AbstractFileSystem.ofs.impl</name>
<value>com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter</value>
</property>
<property>
<name>fs.cosn.trsf.fs.ofs.impl</name>
<value>com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter</value>
</property>
<!-- ViewFS -->
<property>
<name>fs.viewfs.mounttable.cluster.link./hdfs</name>
<value>hdfs://hadoop101:8020/hdfs</value>
</property>
<property>
<name>fs.viewfs.mounttable.cluster.link./obs</name>
<value>obs://bigdata-teach/obs</value>
</property>
</configuration>
3.4.2 hdfs-site.xml
<!-- nn web端访问地址-->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop111:9870</value>
</property>
<!-- 2nn web端访问地址-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop111:9868</value>
</property>
<!-- 测试环境指定HDFS副本的数量 -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!-- NameNode 数据存储目录,默认:file://${hadoop.tmp.dir}/dfs/name -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/module/hadoop-3.3.1/nn</value>
</property>
<!-- DataNode 数据存储目录,默认:file://${hadoop.tmp.dir}/dfs/data -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/module/hadoop-3.3.1/dn</value>
</property>
3.4.3 yarn-site.xml
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/opt/module/hadoop-3.3.1/nm-local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/opt/module/hadoop-3.3.1/nm-log</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>24576</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>24576</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>8</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop111</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
</property>
<!--
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
-->
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop111:19888/jobhistory/logs</value>
</property>
3.4.4 mapred-site.xml
<!--配置mapreduce框架用yarn启动-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop111:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop111:19888</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xms1024M -Xmx3584M</value>
</property>
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>1</value>
</property>
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.task.timeout</name>
<value>1200000</value>
</property>
3.5 zk.sh
#!/bin/bash
hosts=$(cat ~/bin/host_ips)
case $1 in
"start") {
for host in $hosts; do
echo "-------- start zookeeper $host --------"
ssh "$host" "zkServer.sh start"
done
} ;;
"stop") {
for host in $hosts; do
echo "-------- stop zookeeper $host --------"
ssh "$host" "zkServer.sh stop"
done
} ;;
"status") {
for host in $hosts; do
echo "-------- status zookeeper $host --------"
ssh "$host" "zkServer.sh status"
done
} ;;
esac
3.5.1 zookeeper 安装配置
tar -zxvf zookeeper-3.5.7.tar.gz -C /opt/module/
mv apache-zookeeper-3.5.7-bin/ zookeeper
xsync zookeeper-3.5.7/
mkdir -p /opt/module/zookeeper/zkData
vim /opt/module/zookeeper/zkData/myid
三台服务器分别输入1,2,3
mv /opt/module/zookeeper/conf/zoo_sample.cfg /opt/module/zookeeper/conf/zoo.cfg
vim /opt/module/zookeeper/conf/zoo.cfg
########cluster###########
dataDir=/opt/module/zookeeper/zkData
server.1=hadoop111:2888:3888
server.2=hadoop112:2888:3888
server.3=hadoop113:2888:3888
3.5.2 zookeeper 客户端命令行
bin/zkCli.sh -server hadoop111:2181
命令基本语法 |
功能描述 |
help |
显示所有操作命令 |
ls path |
使用 ls 命令来查看当前znode的子节点 -w 监听子节点变化 -s 附加次级信息 |
create |
普通创建 -s 含有序列 -e 临时(重启或者超时消失) 创建一个名为
|
get path |
获得节点的值 -w 监听节点内容变化 -s 获取更多关于节点的信息,比如版本、ACL(访问控制列表)等。 [zk: localhost:2181(CONNECTING) 0] ls / |
set |
设置节点的具体值 |
stat |
查看节点状态: stat /my_node |
delete |
删除节点 |
deleteall |
递归删除节点 |
3.6 lg.sh
#!/bin/bash
for i in hadoop111; do
echo "========== $i =========="
ssh $i "cd /opt/module/applog/; java -jar gmall-remake-mock-2023-05-15-3.jar $1 $2 $3 >/dev/null 2>&1 &"
done
G:\Bigdata\Projects\大数据项目之电商数仓V6.0\12.mock
将application.yml,gmall-remake-mock-2023-05-15-3.jar,logback.xml,path.json 上传到hadoop111的 /opt/module/applog/ 中
测试:lg.sh test 10, 只生成60条日志数据,并写到 /opt/module/applog/ log/app.log 中。
java -jar gmall-remake-mock-2023-05-15-3.jar test 100 2022-06-08
① 增加test参数为测试模式,只生成用户行为数据不生成业务数据。
② 100 为产生的用户session数一个session默认产生1条启动日志和5条页面方法日志。
③ 第三个参数为日志数据的日期,测试模式下不会加载配置文件,要指定数据日期只能通过命令行传参实现。
④ 三个参数的顺序必须与示例保持一致
⑤ 第二个参数和第三个参数可以省略,如果test后面不填写参数,默认为1000.
lg.sh 不接参数会生成日志数据和数据库日志。
====生成模拟数据====
① 修改hadoop111节点的/opt/module/applog/application.yml文件,将mock.date,mock.clear,mock.clear.user,mock.new.user,mock.log.db.enable五个参数调整为如下的值。
#业务日期
mock.date: "2022-06-04"
#是否重置业务数据
mock.clear.busi: 1
#是否重置用户数据
mock.clear.user: 1
# 批量生成新用户数量
mock.new.user: 100
# 日志是否写入数据库一份 写入z_log表中
mock.log.db.enable: 0
② 执行数据生成脚本,生成第一天2022-06-04的历史数据。
$ lg.sh
③ 修改/opt/module/applog/application.properties文件,将mock.date、mock.clear,mock.clear.user,mock.new.user四个参数调整为如图所示的值。
#业务日期
mock.date: "2022-06-05"
#是否重置业务数据
mock.clear.busi: 0
#是否重置用户数据
mock.clear.user: 0
# 批量生成新用户
mock.new.user: 0
④ 执行数据生成脚本,生成第二天2022-06-05的历史数据。
$ lg.sh
⑤ 之后只修改/opt/module/applog/application.properties文件中的mock.date参数,依次改为2022-06-06,2022-06-07,并分别生成对应日期的数据。
⑥ 删除/origin_data/gmall/log目录,将⑤中提到的参数修改为2022-06-08,并生成当日模拟数据。
3.7 kf.sh
#!/bin/bash
if [ $# -lt 1 ]
then
echo "Usage: kf.sh {start|stop|kc [topic]|kp [topic] |list |delete [topic] |describe [topic]}"
exit
fi
case $1 in
start)
for i in hadoop111 hadoop112 hadoop113
do
echo "====================> START $i KF <===================="
ssh $i kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties
done
;;
stop)
for i in hadoop111 hadoop112 hadoop113
do
echo "====================> STOP $i KF <===================="
ssh $i kafka-server-stop.sh
done
;;
create)
if [ $# -eq 1 ]; then
echo "Usage: kf.sh create topic_name [num_partitions [num_replications]]"
elif [ $# -eq 2 ]; then
kafka-topics.sh --bootstrap-server hadoop111:9092,hadoop112:9092,hadoop113:9092 --create --topic $2
elif [ $# -eq 3 ]; then
kafka-topics.sh --bootstrap-server hadoop111:9092,hadoop112:9092,hadoop113:9092 --create --partitions $3 --topic $2
elif [ $# -eq 4 ]; then
kafka-topics.sh --bootstrap-server hadoop111:9092,hadoop112:9092,hadoop113:9092 --create --partitions $3 --replication-factor $4 --topic $2
else
echo "Incorrect number of arguments. Usage: kf.sh create topic_name [num_partitions [num_replications]]"
fi
;;
kc)
if [ $# -eq 2 ]
then
kafka-console-consumer.sh --bootstrap-server hadoop111:9092,hadoop112:9092,hadoop113:9092 --topic $2
elif [ $# -ge 3 ]
then
topic=$2
shift 2
kafka-console-consumer.sh --bootstrap-server hadoop111:9092,hadoop112:9092,hadoop113:9092 --topic $topic "$@"
else
echo "Usage: kf.sh {start|stop|kc [topic]|kp [topic] | list |delete [topic] |describe [topic]}"
fi
;;
kp)
if [ $2 ]
then
kafka-console-producer.sh --broker-list hadoop111:9092,hadoop112:9092,hadoop113:9092 --topic $2
else
echo "Usage: kf.sh {start|stop|kc [topic]|kp [topic] |list |delete [topic] |describe [topic]}"
fi
;;
list)
kafka-topics.sh --list --bootstrap-server hadoop111:9092,hadoop112:9092,hadoop113:9092
;;
describe)
if [ $2 ]
then
kafka-topics.sh --describe --bootstrap-server hadoop111:9092,hadoop112:9092,hadoop113:9092 --topic $2
else
echo "Usage: kf.sh {start|stop|kc [topic]|kp [topic] |list |delete [topic] |describe [topic]}"
fi
;;
delete)
if [ $2 ]
then
kafka-topics.sh --delete --bootstrap-server hadoop111:9092,hadoop112:9092,hadoop113:9092 --topic $2
else
echo "Usage: kf.sh {start|stop|kc [topic]|kp [topic] |list |delete [topic] |describe [topic]}"
fi
;;
*)
echo "Usage: kf.sh {start|stop|kc [topic]|kp [topic] |list |delete [topic] |describe [topic]}"
exit
;;
esac
3.7.1 脚本测试
kf.sh list
kf.sh create topic_db
kf.sh kc topic_log
kf.sh kp topic_log
kf.sh describe topic_log
kf.sh delete topic_db
3.7.2 Kafka安装配置
tar -zxvf kafka_2.12-3.3.1.tgz -C /opt/module/
cd /opt/module/
mv kafka_2.12-3.3.1/ kafka
[seven@hadoop102 kafka]$ cd config/
[seven@hadoop102 config]$ vim server.properties
输入以下内容:
# broker的全局唯一编号,不能重复,只能是数字,比如 1,2,3。
broker.id=1
#broker对外暴露的IP和端口 (每个节点单独配置)
advertised.listeners=PLAINTEXT://hadoop111:9092
#advertised.listeners=PLAINTEXT://hadoop112:9092
#advertised.listeners=PLAINTEXT://hadoop113:9092
#处理网络请求的线程数量
num.network.threads=3
#用来处理磁盘IO的线程数量
num.io.threads=8
#发送套接字的缓冲区大小
socket.send.buffer.bytes=102400
#接收套接字的缓冲区大小
socket.receive.buffer.bytes=102400
#请求套接字的缓冲区大小
socket.request.max.bytes=104857600
#kafka运行日志(数据)存放的路径,路径不需要提前创建,kafka自动帮你创建,可以配置多个磁盘路径,路径与路径之间可以用","分隔
log.dirs=/opt/module/kafka/datas
#topic在当前broker上的分区个数
num.partitions=1
#用来恢复和清理data下数据的线程数量
num.recovery.threads.per.data.dir=1
# 每个topic创建时的副本数,默认时1个副本
offsets.topic.replication.factor=1
#segment文件保留的最长时间,超时将被删除
log.retention.hours=168
#每个segment文件的大小,默认最大1G
log.segment.bytes=1073741824
# 检查过期数据的时间,默认5分钟检查一次是否数据过期
log.retention.check.interval.ms=300000
#配置连接Zookeeper集群地址(在zk根目录下创建/kafka,方便管理)
zookeeper.connect=hadoop111:2181,hadoop112:2181,hadoop113:2181/kafka
分发安装包,并修改/opt/module/kafka/config/server.properties中的broker.id及advertised.listeners
xsync kafka/
vim kafka/config/server.properties
修改:
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=2
#broker对外暴露的IP和端口 (每个节点单独配置)
advertised.listeners=PLAINTEXT://hadoop112:9092
[seven@hadoop104 module]$ vim kafka/config/server.properties
修改:
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=3
#broker对外暴露的IP和端口 (每个节点单独配置)
advertised.listeners=PLAINTEXT://hadoop113:9092
配置环境变量:vim /etc/profile.d/my_env.sh
#KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka
export PATH=$PATH:$KAFKA_HOME/bin
3.7.3 Kafka主题命令行
/opt/module/kafka/bin/kafka-topics.sh
参数 |
描述 |
--bootstrap-server <String: server toconnect to> |
连接的Kafka Broker主机名称和端口号。 |
--topic <String: topic> |
操作的topic名称。 |
--create |
创建主题。 kafka-topics.sh --bootstrap-server hadoop102:9092 --create --partitions 1 --replication-factor 3 --topic first |
--delete |
删除主题。 |
--alter |
修改主题。(注意:分区数只能增加,不能减少) kafka-topics.sh --bootstrap-server hadoop102:9092 --alter --topic first --partitions 3 |
--list |
查看所有主题。 kafka-topics.sh --bootstrap-server hadoop102:9092 --list |
--describe |
查看主题详细描述。 kafka-topics.sh --bootstrap-server hadoop102:9092 --describe --topic first |
--partitions <Integer: # of partitions> |
设置分区数。 |
--replication-factor<Integer: replication factor> |
设置分区副本。 kafka-topics.sh --create --topic test_topic --bootstrap-server hadoop111:9092 --partitions 1 --replication-factor 1 |
--config <String: name=value> |
更新系统默认的配置。 |
kafka-topic.sh --config <String: name=value> 更新系统默认的配置。 举例说明
1. 创建主题
首先,确保您有一个主题可以进行配置更新:
bin/kafka-topics.sh --create --topic test_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
2. 查看当前主题配置
在更新之前,您可以查看当前主题的配置:
bin/kafka-topics.sh --describe --topic test_topic --bootstrap-server localhost:9092
3. 更新主题配置
使用 --config
选项更新主题配置。例如,将消息保留时间设置为 7 天(604800000 毫秒),并将清理策略设置为 compact
:
bin/kafka-topics.sh --alter --topic test_topic --bootstrap-server localhost:9092 --config retention.ms=604800000 --config cleanup.policy=compact
4. 验证更新
更新配置后,您可以再次查看主题的配置,以确认更改是否生效:
bin/kafka-topics.sh --describe --topic test_topic --bootstrap-server localhost:9092
配置选项解释
- retention.ms: 消息保留时间,以毫秒为单位,指示 Kafka 在删除消息之前可以保存的时间。
- cleanup.policy: 指定清理策略,
compact
意味着 Kafka 将保留每个键的最新消息,而不是按照时间清理。
完整示例
以下是整个流程的完整命令:
# 创建主题
bin/kafka-topics.sh --create --topic test_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
# 查看当前主题配置
bin/kafka-topics.sh --describe --topic test_topic --bootstrap-server localhost:9092
# 更新主题配置
bin/kafka-topics.sh --alter --topic test_topic --bootstrap-server localhost:9092 --config retention.ms=604800000 --config cleanup.policy=compact
# 验证更新
bin/kafka-topics.sh --describe --topic test_topic --bootstrap-server localhost:9092
3.7.4 Kafka生产者命令行
/opt/module/kafka/bin/kafka-console-producer.sh
(1) 创建 Kafka 主题
在发送消息之前,需要创建一个主题。例如,创建一个名为 test_topic
的主题:
kafka-topics.sh --create --topic test_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
(2)发送消息
使用 Kafka 提供的命令行生产者工具发送消息到指定主题。可以使用 kafka-console-producer.sh
命令。
kafka-console-producer.sh --topic test_topic --bootstrap-server localhost:9092
(3)发送带有键的消息
如果你希望发送带有键的消息,可以使用 --property
选项指定键和值的分隔符。以下示例将键值对消息发送到主题:
kafka-console-producer.sh --topic test_topic --bootstrap-server localhost:9092 --property "parse.key=true" --property "key.separator=:"
在输入时,使用 :
分隔键和值。例如:
key1:Hello, Kafka with key!
key2:Another message with key.
消费到的消息:
(4)发送 JSON 格式的消息
如果你需要发送 JSON 格式的消息,可以设置 value.converter
属性。例如,使用以下命令:
kafka-console-producer.sh --topic test_topic --bootstrap-server localhost:9092 --property "value.schema='{\"type\":\"record\",\"name\":\"testRecord\",\"fields\":[{\"name\":\"message\",\"type\":\"string\"}]}'" --property "value.converter=org.apache.kafka.connect.json.JsonConverter"
然后输入 JSON 格式的消息:
{"message": "Hello, JSON Kafka!"}
消费到的消息:
{"message": "Hello, JSON Kafka!"}
3.7.5 Kafka消费者命令行
/opt/module/kafka/bin/kafka-console-consumer.sh
参数 |
描述 |
--bootstrap-server <String: server toconnect to> |
连接的Kafka Broker主机名称和端口号。 |
--topic <String: topic> |
操作的topic名称。 kafka-console-consumer.sh --bootstrap-server hadoop111:9092 --topic first |
--from-beginning |
从头开始消费。 kafka-console-consumer.sh --bootstrap-server hadoop111:9092 --from-beginning --topic first --offset: 设置要消费的偏移量。可以使用以下值: earliest: 从最早的可用消息开始消费。 latest: 从最新的消息开始消费。 具体偏移量值: 可以指定要从哪个具体偏移量开始消费。 kafka-console-consumer.sh --bootstrap-server hadoop111:9092 --topic first --partition 2 --offset 1234 |
--group <String: consumer group id> |
指定消费者组名称。 |
3.8 f1.sh Flume采集原始日志
3.8.1 Flume安装配置
tar -zxf /opt/software/apache-flume-1.10.1-bin.tar.gz -C /opt/module/
mv /opt/module/apache-flume-1.10.1-bin /opt/module/flume
修改conf目录下的log4j2.xml配置文件,配置日志文件路径
vim log4j2.xml
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="ERROR">
<Properties>
<Property name="LOG_DIR">/opt/module/flume/log</Property>
</Properties>
<Appenders>
<Console name="Console" target="SYSTEM_ERR">
<PatternLayout pattern="%d (%t) [%p - %l] %m%n" />
</Console>
<RollingFile name="LogFile" fileName="${LOG_DIR}/flume.log" filePattern="${LOG_DIR}/archive/flume.log.%d{yyyyMMdd}-%i">
<PatternLayout pattern="%d{dd MMM yyyy HH:mm:ss,SSS} %-5p [%t] (%C.%M:%L) %equals{%x}{[]}{} - %m%n" />
<Policies>
<!-- Roll every night at midnight or when the file reaches 100MB -->
<SizeBasedTriggeringPolicy size="100 MB