文章目录
1. 环境安装
1.1 准备三台虚拟机,分别是hadoop102、hadoop103、hadoop104
1.1.1 vagrant虚拟机配置
- hadoop102
Vagrant.configure("2") do |config|
config.vm.box = "centos/7"
config.vm.hostname = "hadoop102"
config.vm.network "private_network", ip: "192.168.10.102"
# 虚拟机配置
config.vm.provider "virtualbox" do |vb|
vb.gui = false
vb.memory = "16384"
vb.cpus = 2
end
# 初始化安装脚本
config.vm.provision "shell", inline: <<-SHELL
yum update
yum install -y vim wget
sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
systemctl restart sshd
# 关闭防火墙
systemctl stop firewalld.service
systemctl disable firewalld.service
# 关闭selinux
sed -i 's/enforcing/disabled/' /etc/selinux/config
sysctl --system
# 开启时间同步
yum install ntpdate -y
ntpdate time.windows.com
# 时区
timedatectl set-timezone Asia/Shanghai
# 安装lrzsz命令
yum install -y lrzsz
# 修改源为阿里云
mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
yum makecache
SHELL
end
- hadoop103
Vagrant.configure("2") do |config|
config.vm.box = "centos/7"
config.vm.hostname = "hadoop103"
config.vm.network "private_network", ip: "192.168.10.103"
# 虚拟机配置
config.vm.provider "virtualbox" do |vb|
vb.gui = false
vb.memory = "8192"
vb.cpus = 2
end
# 初始化安装脚本
config.vm.provision "shell", inline: <<-SHELL
yum update
yum install -y vim wget
sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
systemctl restart sshd
# 关闭防火墙
systemctl stop firewalld.service
systemctl disable firewalld.service
# 关闭selinux
sed -i 's/enforcing/disabled/' /etc/selinux/config
sysctl --system
# 开启时间同步
yum install ntpdate -y
ntpdate time.windows.com
# 时区
timedatectl set-timezone Asia/Shanghai
# 安装lrzsz命令
yum install -y lrzsz
# 修改源为阿里云
mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
yum makecache
SHELL
end
- hadoop104
Vagrant.configure("2") do |config|
config.vm.box = "centos/7"
config.vm.hostname = "hadoop104"
config.vm.network "private_network", ip: "192.168.10.104"
# 虚拟机配置
config.vm.provider "virtualbox" do |vb|
vb.gui = false
vb.memory = "8192"
vb.cpus = 2
end
# 初始化安装脚本
config.vm.provision "shell", inline: <<-SHELL
yum update
yum install -y vim wget
sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
systemctl restart sshd
# 关闭防火墙
systemctl stop firewalld.service
systemctl disable firewalld.service
# 关闭selinux
sed -i 's/enforcing/disabled/' /etc/selinux/config
sysctl --system
# 开启时间同步
yum install ntpdate -y
ntpdate time.windows.com
# 时区
timedatectl set-timezone Asia/Shanghai
# 安装lrzsz命令
yum install -y lrzsz
# 修改源为阿里云
mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
yum makecache
SHELL
end
1.1.2 虚拟机安装
vagrant up
1.2 修改root用户密码
[root@hadoop102 ~]# passwd
Changing password for user root.
New password:
BAD PASSWORD: The password is shorter than 8 characters
Retype new password:
passwd: all authentication tokens updated successfully.
1.3 修改host配置文件
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
1.4 新增hadoop用户
# 新增用户
useradd hadoop
# 设置密码
passwd hadoop
vim /etc/sudoers
# 配置sudo免密模式
hadoop ALL=(ALL) NOPASSWD: ALL
1.5 配置免密登录
三台服务都需要做免密登陆,包括服务器本身。
- hadoop102
[hadoop@hadoop102 ~]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:rdNMUzGFduBxFh8fSavrF9px7CnLmDnHldiNnAG3zjg hadoop@hadoop102
The key's randomart image is:
+---[RSA 2048]----+
| =+*=.|
| .+*oo=|
| .o+ oo|
| . . + |
| S + *o=o|
| = .E.B*+|
| o o .o+.=|
| . .*= +.|
| +o++ |
+----[SHA256]-----+
[hadoop@hadoop102 ~]$ ssh-copy-id hadoop102
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop102's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop102'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@hadoop102 ~]$ ssh-copy-id hadoop103
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop103 (192.168.10.103)' can't be established.
ECDSA key fingerprint is SHA256:31Xoy7utrPqMjmdoHyUW5dUKuUc+v53ynTGDq7EUGuE.
ECDSA key fingerprint is MD5:e3:78:13:54:24:0c:98:57:7b:2a:d9:ef:b6:9d:50:e0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop103's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop103'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@hadoop102 ~]$ ssh-copy-id hadoop104
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop104 (192.168.10.104)' can't be established.
ECDSA key fingerprint is SHA256:ifCGrWm+9p1fAWo6In8zFGePF4jAZVJXKJ9FRTHt8mU.
ECDSA key fingerprint is MD5:56:a8:07:d6:52:8d:13:7b:57:e0:fd:3d:1c:96:46:a3.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop104's password:
Permission denied, please try again.
hadoop@hadoop104's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop104'"
and check to make sure that only the key(s) you wanted were added.
- hadoop103
[hadoop@hadoop103 ~]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:apWniTAC+OP8VkuBklBAtGj+dd5wGJ7NKUIA8vFEY3E hadoop@hadoop103
The key's randomart image is:
+---[RSA 2048]----+
|=*+o*.E |
|=..=.o |
|++...o . |
|ooo o o *.. |
| .+.oo OS=. |
| o.o.o*+=+ |
| o. o+oo. |
| .... |
| .. |
+----[SHA256]-----+
[hadoop@hadoop103 ~]$ ssh-copy-id hadoop102
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop102 (192.168.10.102)' can't be established.
ECDSA key fingerprint is SHA256:I2sJV0iLNvRNEeK50zp6/pYiwfmp+wiopWJiqiU2xrA.
ECDSA key fingerprint is MD5:94:99:9d:0f:a5:80:c4:2f:2c:af:88:cb:6a:12:91:65.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop102's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop102'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@hadoop103 ~]$ ssh-copy-id hadoop103
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop103 (127.0.1.1)' can't be established.
ECDSA key fingerprint is SHA256:31Xoy7utrPqMjmdoHyUW5dUKuUc+v53ynTGDq7EUGuE.
ECDSA key fingerprint is MD5:e3:78:13:54:24:0c:98:57:7b:2a:d9:ef:b6:9d:50:e0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop103's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop103'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@hadoop103 ~]$ ssh-copy-id hadoop104
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop104 (192.168.10.104)' can't be established.
ECDSA key fingerprint is SHA256:ifCGrWm+9p1fAWo6In8zFGePF4jAZVJXKJ9FRTHt8mU.
ECDSA key fingerprint is MD5:56:a8:07:d6:52:8d:13:7b:57:e0:fd:3d:1c:96:46:a3.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop104's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop104'"
and check to make sure that only the key(s) you wanted were added.
- hadoop104
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:Ke7bWIHCwv1Q1NEjCYENcRayseJCGgQqBqdXk2xomzU hadoop@hadoop104
The key's randomart image is:
+---[RSA 2048]----+
|+..oB*B=.+ |
|+oo.EX. + o |
|=+o=o.. . . |
|==o+ . . . |
|o + = o S |
| . . = . . |
| o . |
| . + |
| +.. |
+----[SHA256]-----+
[hadoop@hadoop104 ~]$ ssh-copy-id hadoop102
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop102 (192.168.10.102)' can't be established.
ECDSA key fingerprint is SHA256:I2sJV0iLNvRNEeK50zp6/pYiwfmp+wiopWJiqiU2xrA.
ECDSA key fingerprint is MD5:94:99:9d:0f:a5:80:c4:2f:2c:af:88:cb:6a:12:91:65.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop102's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop102'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@hadoop104 ~]$ ssh-copy-id hadoop103
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop103 (192.168.10.103)' can't be established.
ECDSA key fingerprint is SHA256:31Xoy7utrPqMjmdoHyUW5dUKuUc+v53ynTGDq7EUGuE.
ECDSA key fingerprint is MD5:e3:78:13:54:24:0c:98:57:7b:2a:d9:ef:b6:9d:50:e0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop103's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop103'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@hadoop104 ~]$ ssh-copy-id hadoop104
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop104 (127.0.1.1)' can't be established.
ECDSA key fingerprint is SHA256:ifCGrWm+9p1fAWo6In8zFGePF4jAZVJXKJ9FRTHt8mU.
ECDSA key fingerprint is MD5:56:a8:07:d6:52:8d:13:7b:57:e0:fd:3d:1c:96:46:a3.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop104's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'hadoop104'"
and check to make sure that only the key(s) you wanted were added.
1.6 配置xsync脚本
[hadoop@hadoop102 ~]$ mkdir /home/hadoop/bin
写入内容
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Arguement!
exit;
fi
#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
echo ==================== $host ====================
#3. 遍历所有目录,挨个发送
for file in $@
do
#4. 判断文件是否存在
if [ -e $file ]
then
#5. 获取父目录
pdir=$(cd -P $(dirname $file); pwd)
#6. 获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
[hadoop@hadoop102 ~]$ cd /home/hadoop/bin
[hadoop@hadoop102 bin]$ chmod 777 xsync
[hadoop@hadoop102 bin]$ ll
total 4
-rwxrwxrwx. 1 hadoop hadoop 624 Oct 2 17:56 xsync
1.6 创建工作目录
# 创建工作目录
sudo mkdir /opt/software
sudo mkdir /opt/modules
# 授权
sudo chown -R hadoop:hadoop module/ software/
1.6 JDK安装
# 查询所有已有的Java安装包并卸载
rpm -qa | grep -i java | xargs -n1 sudo rpm -e --nodeps
# 上传jdk安装包并解压到指定目录
tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/
- 添加环境变量
# 编辑my_env.sh脚本
[hadoop@hadoop102 jdk1.8.0_212]$ sudo vim /etc/profile.d/my_env.sh
# 内容
# JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin
# 使环境变量生效
[hadoop@hadoop102 jdk1.8.0_212]$ source /etc/profile.d/my_env.sh
# 查看是否安装成功
[hadoop@hadoop102 jdk1.8.0_212]$ java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
- hadoop102分发
[hadoop@hadoop102 module]$ sudo /home/hadoop/bin/xsync /opt/modulejdk1.8.0_212/
[hadoop@hadoop102 module]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh
- hadoop103
[hadoop@hadoop103 opt]$ sudo chown -R hadoop:hadoop
[hadoop@hadoop103 opt]$ ll
total 0
drwxr-xr-x. 3 hadoop hadoop 26 Oct 2 18:45 module
[hadoop@hadoop103 opt]$ source /etc/profile.d/my_env.sh
[hadoop@hadoop103 opt]$ java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
- hadoop104
[hadoop@hadoop104 opt]$ sudo chown -R hadoop:hadoop module/
[hadoop@hadoop104 opt]$ ll
total 0
drwxr-xr-x. 3 hadoop hadoop 26 Oct 2 18:45 module
[hadoop@hadoop104 opt]$ source /etc/profile.d/my_env.sh
[hadoop@hadoop104 opt]$ java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
1.7 bash脚本配置路径说明
- /etc/profile
- /etc/profile.d/*.sh
- ~/.bash_profile
使用用户名和密码登陆时会加载1、2、3配置文件,使用非登陆shell(SSH)时只会加载第二种。
1.8 生成模拟数据
# 授权
[hadoop@hadoop102 ~]$ sudo chown -R hadoop:hadoop /opt
# 创建日志目录
[hadoop@hadoop102 opt]$ mkdir applog
待补充
1.9 xcall脚本
# 编辑xcall.sh文件
[hadoop@hadoop102 bin]$ vim /home/hadoop/xcall.sh
# 内容
#! /bin/bash
for i in hadoop102 hadoop103 hadoop104
do
echo ============$i================
ssh $i "$*"
done
# 授权
[hadoop@hadoop102 bin]$ chmod 777 /home/hadoop/xcall.sh
# 测试脚本
[hadoop@hadoop102 bin]$ xcall.sh jps
============hadoop102================
1290 Jps
============hadoop103================
1099 Jps
============hadoop104================
1106 Jps
2. 安装分布式群集
2.1 安装Hadoop
2.1.1 安装
# 上传并解压hadoop安装包
[hadoop@hadoop102 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/
2.1.2 配置环境变量
sudo vim /etc/profile.d/my_env.sh
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
# 使生效
source /etc/profile.d/my_env.sh
2.1.3 分发
[hadoop@hadoop102 hadoop-3.1.3]$ sudo /home/hadoop/bin/xsync /opt/module/hadoop-3.1.3/
[hadoop@hadoop102 hadoop-3.1.3]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh
[hadoop@hadoop103 ~]$ source /etc/profile.d/my_env.sh
[hadoop@hadoop104 ~]$ source /etc/profile.d/my_env.sh
2.1.4 集群配置
hadoop102 | hadoop103 | hadoop104 |
---|---|---|
HDFS | NameNode、DataNode | DataNode |
YARN | NodeManger | ResourceManager、NodeManger |
NameNode、SecondNameNode、ResourceManager都不要放在同一台服务器上比较消耗内存
# 进入hadoop配置目录
/opt/module/hadoop-3.1.3/etc/hadoop
- core-site.xml
<!-- 指定NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop102:8020</value>
</property>
<!-- 指定hadoop数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.1.3/data</value>
</property>
<!-- 配置HDFS网页登录使用的静态用户为hadoop -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>hadoop</value>
</property>
<!-- 配置该hadoop(superUser)允许通过代理访问的主机节点 -->
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<!-- 配置该hadoop(superUser)允许通过代理用户所属组 -->
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<!-- 配置该hadoop(superUser)允许通过代理的用户-->
<property>
<name>hadoop.proxyuser.hadoop.users</name>
<value>*</value>
</property>
- hdfs-site.xml
<!-- nn web端访问地址-->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop102:9870</value>
</property>
<!-- 2nn web端访问地址-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:9868</value>
</property>
<!-- 测试环境指定HDFS副本的数量1 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
- yarn-site.xml
<!-- 指定MR走shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定ResourceManager的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<!-- 环境变量的继承 -->
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<!-- yarn容器允许分配的最大最小内存 -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
<!-- yarn容器允许管理的物理内存大小 -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<!-- 关闭yarn对虚拟内存的限制检查 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
- mapred-site.xml
<!-- 指定MapReduce程序运行在Yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
- workers
hadoop102
hadoop103
hadoop104
2.1.5 分发
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync core-site.xml
==================== hadoop102 ====================
sending incremental file list
sent 68 bytes received 12 bytes 160.00 bytes/sec
total size is 1,758 speedup is 21.98
==================== hadoop103 ====================
sending incremental file list
core-site.xml
sent 1,177 bytes received 47 bytes 2,448.00 bytes/sec
total size is 1,758 speedup is 1.44
==================== hadoop104 ====================
sending incremental file list
core-site.xml
sent 1,177 bytes received 47 bytes 2,448.00 bytes/sec
total size is 1,758 speedup is 1.44
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync hdfs-site.xml
==================== hadoop102 ====================
sending incremental file list
sent 68 bytes received 12 bytes 53.33 bytes/sec
total size is 1,246 speedup is 15.57
==================== hadoop103 ====================
sending incremental file list
hdfs-site.xml
sent 665 bytes received 47 bytes 1,424.00 bytes/sec
total size is 1,246 speedup is 1.75
==================== hadoop104 ====================
sending incremental file list
hdfs-site.xml
sent 665 bytes received 47 bytes 1,424.00 bytes/sec
total size is 1,246 speedup is 1.75
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync yarn-site.xml
==================== hadoop102 ====================
sending incremental file list
sent 67 bytes received 12 bytes 158.00 bytes/sec
total size is 1,909 speedup is 24.16
==================== hadoop103 ====================
sending incremental file list
yarn-site.xml
sent 2,023 bytes received 41 bytes 4,128.00 bytes/sec
total size is 1,909 speedup is 0.92
==================== hadoop104 ====================
sending incremental file list
yarn-site.xml
sent 2,023 bytes received 41 bytes 1,376.00 bytes/sec
total size is 1,909 speedup is 0.92
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync mapred-site.xml
==================== hadoop102 ====================
sending incremental file list
sent 70 bytes received 12 bytes 54.67 bytes/sec
total size is 913 speedup is 11.13
==================== hadoop103 ====================
sending incremental file list
mapred-site.xml
sent 334 bytes received 47 bytes 762.00 bytes/sec
total size is 913 speedup is 2.40
==================== hadoop104 ====================
sending incremental file list
mapred-site.xml
sent 334 bytes received 47 bytes 762.00 bytes/sec
total size is 913 speedup is 2.40
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync workers
==================== hadoop102 ====================
sending incremental file list
sent 62 bytes received 12 bytes 148.00 bytes/sec
total size is 30 speedup is 0.41
==================== hadoop103 ====================
sending incremental file list
workers
sent 139 bytes received 41 bytes 120.00 bytes/sec
total size is 30 speedup is 0.17
==================== hadoop104 ====================
sending incremental file list
workers
sent 139 bytes received 41 bytes 360.00 bytes/sec
total size is 30 speedup is 0.17
2.1.6 开启日志服务
2.1.6.1 配置
- mapred-site.xml
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop102:19888</value>
</property>
- yarn-site.xml
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
注意:开始日志聚集功能,需要重启NodeManager、ResourceManager、HistoryManager
2.1.6.1 分发
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync mapred-site.xml
==================== hadoop102 ====================
sending incremental file list
sent 70 bytes received 12 bytes 164.00 bytes/sec
total size is 1,211 speedup is 14.77
==================== hadoop103 ====================
sending incremental file list
mapred-site.xml
sent 632 bytes received 47 bytes 452.67 bytes/sec
total size is 1,211 speedup is 1.78
==================== hadoop104 ====================
sending incremental file list
mapred-site.xml
sent 632 bytes received 47 bytes 1,358.00 bytes/sec
total size is 1,211 speedup is 1.78
[hadoop@hadoop102 hadoop]$ /home/hadoop/bin/xsync yarn-site.xml
==================== hadoop102 ====================
sending incremental file list
sent 67 bytes received 12 bytes 158.00 bytes/sec
total size is 1,909 speedup is 24.16
==================== hadoop103 ====================
sending incremental file list
sent 67 bytes received 12 bytes 52.67 bytes/sec
total size is 1,909 speedup is 24.16
==================== hadoop104 ====================
sending incremental file list
sent 67 bytes received 12 bytes 158.00 bytes/sec
total size is 1,909 speedup is 24.16
2.1.7 格式化服务
[hadoop@hadoop102 hadoop-3.1.3]$ bin/hdfs namenode -format
WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
2021-10-02 23:24:53,367 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop102/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.1.3
STARTUP_MSG: classpath = /opt/module/hadoop-3.1.3/etc/hadoop:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/accessors-smart-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/animal-sniffer-annotations-1.17.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/asm-5.0.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/audience-annotations-0.5.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/avro-1.7.7.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/checker-qual-2.5.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-beanutils-1.9.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-cli-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-codec-1.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-collections-3.2.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-compress-1.18.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-configuration2-2.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-io-2.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-lang-2.6.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-lang3-3.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-logging-1.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-math3-3.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/commons-net-3.6.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/curator-client-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/curator-framework-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/curator-recipes-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/error_prone_annotations-2.2.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/failureaccess-1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/gson-2.2.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/guava-27.0-jre.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/hadoop-annotations-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/hadoop-auth-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/htrace-core4-4.1.0-incubating.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/httpclient-4.5.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/httpcore-4.4.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/j2objc-annotations-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-annotations-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-core-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-databind-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/javax.servlet-api-3.1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jaxb-api-2.2.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jcip-annotations-1.0-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jersey-core-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jersey-json-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jersey-server-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jersey-servlet-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jettison-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-http-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-io-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-security-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-server-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-servlet-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-util-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-webapp-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jetty-xml-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jsch-0.1.54.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/json-smart-2.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jsp-api-2.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jsr305-3.0.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jsr311-api-1.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-admin-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-client-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-common-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-core-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-crypto-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-identity-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-server-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-simplekdc-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerb-util-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-asn1-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-config-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-pkix-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-util-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/log4j-1.2.17.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/netty-3.10.5.Final.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/nimbus-jose-jwt-4.41.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/paranamer-2.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/re2j-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-api-1.7.25.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/snappy-java-1.0.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/stax2-api-3.1.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/token-provider-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/woodstox-core-5.0.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/zookeeper-3.4.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/jul-to-slf4j-1.7.25.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/metrics-core-3.2.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-common-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-common-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-nfs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-kms-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-util-ajax-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/netty-all-4.0.52.Final.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/okhttp-2.7.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/okio-1.6.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jersey-servlet-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jersey-json-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/hadoop-auth-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-codec-1.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/httpclient-4.5.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/httpcore-4.4.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/nimbus-jose-jwt-4.41.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jcip-annotations-1.0-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/json-smart-2.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/accessors-smart-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/asm-5.0.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/zookeeper-3.4.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/audience-annotations-0.5.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/netty-3.10.5.Final.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/curator-framework-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/curator-client-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/guava-27.0-jre.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/failureaccess-1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/checker-qual-2.5.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/error_prone_annotations-2.2.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/j2objc-annotations-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/animal-sniffer-annotations-1.17.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-simplekdc-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-client-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-config-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-core-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-pkix-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-asn1-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-util-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-common-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-crypto-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-io-2.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-util-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/token-provider-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-admin-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-server-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerb-identity-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/kerby-xdr-1.0.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jersey-core-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jsr311-api-1.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jersey-server-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/javax.servlet-api-3.1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/json-simple-1.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-server-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-http-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-util-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-io-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-webapp-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-xml-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-servlet-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jetty-security-9.3.24.v20180605.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/hadoop-annotations-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-math3-3.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-net-3.6.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-collections-3.2.2.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jettison-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jaxb-impl-2.2.3-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jaxb-api-2.2.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-jaxrs-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-xc-1.9.13.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-beanutils-1.9.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-configuration2-2.1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-lang3-3.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/avro-1.7.7.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/paranamer-2.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/snappy-java-1.0.5.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/commons-compress-1.18.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/re2j-1.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/gson-2.2.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jsch-0.1.54.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/curator-recipes-2.13.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/htrace-core4-4.1.0-incubating.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-databind-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-annotations-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/jackson-core-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/stax2-api-3.1.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/woodstox-core-5.0.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-nfs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-client-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-client-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-native-client-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-native-client-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-rbf-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-rbf-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/hadoop-hdfs-httpfs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/junit-4.11.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-app-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-common-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-nativetask-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-uploader-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/HikariCP-java7-2.4.12.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/aopalliance-1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/dnsjava-2.1.7.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/ehcache-3.3.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/fst-2.50.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/geronimo-jcache_1.0_spec-1.0-alpha-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/guice-4.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/guice-servlet-4.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jackson-jaxrs-base-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jackson-jaxrs-json-provider-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jackson-module-jaxb-annotations-2.7.8.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/java-util-1.9.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/javax.inject-1.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jersey-client-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/jersey-guice-1.19.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/json-io-2.5.1.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/metrics-core-3.2.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/mssql-jdbc-6.2.1.jre7.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/objenesis-1.0.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/snakeyaml-1.16.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/swagger-annotations-1.5.4.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-api-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-client-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-common-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-registry-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-common-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-nodemanager-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-router-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-tests-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-server-web-proxy-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-services-api-3.1.3.jar:/opt/module/hadoop-3.1.3/share/hadoop/yarn/hadoop-yarn-services-core-3.1.3.jar
STARTUP_MSG: build = https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579; compiled by 'ztang' on 2019-09-12T02:47Z
STARTUP_MSG: java = 1.8.0_212
************************************************************/
2021-10-02 23:24:53,376 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2021-10-02 23:24:53,454 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-dee766cc-c898-4983-b137-1902d4f74c0e
2021-10-02 23:24:53,884 INFO namenode.FSEditLog: Edit logging is async:true
2021-10-02 23:24:53,894 INFO namenode.FSNamesystem: KeyProvider: null
2021-10-02 23:24:53,895 INFO namenode.FSNamesystem: fsLock is fair: true
2021-10-02 23:24:53,895 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2021-10-02 23:24:53,909 INFO namenode.FSNamesystem: fsOwner = hadoop (auth:SIMPLE)
2021-10-02 23:24:53,909 INFO namenode.FSNamesystem: supergroup = supergroup
2021-10-02 23:24:53,909 INFO namenode.FSNamesystem: isPermissionEnabled = true
2021-10-02 23:24:53,909 INFO namenode.FSNamesystem: HA Enabled: false
2021-10-02 23:24:53,941 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2021-10-02 23:24:53,949 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
2021-10-02 23:24:53,956 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2021-10-02 23:24:53,961 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2021-10-02 23:24:53,961 INFO blockmanagement.BlockManager: The block deletion will start around 2021 Oct 02 23:24:53
2021-10-02 23:24:53,962 INFO util.GSet: Computing capacity for map BlocksMap
2021-10-02 23:24:53,962 INFO util.GSet: VM type = 64-bit
2021-10-02 23:24:53,964 INFO util.GSet: 2.0% max memory 3.4 GB = 70.6 MB
2021-10-02 23:24:53,964 INFO util.GSet: capacity = 2^23 = 8388608 entries
2021-10-02 23:24:53,974 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2021-10-02 23:24:54,029 INFO Configuration.deprecation: No unit for dfs.namenode.safemode.extension(30000) assuming MILLISECONDS
2021-10-02 23:24:54,029 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2021-10-02 23:24:54,029 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: defaultReplication = 1
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: maxReplication = 512
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: minReplication = 1
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: redundancyRecheckInterval = 3000ms
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: encryptDataTransfer = false
2021-10-02 23:24:54,030 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
2021-10-02 23:24:54,068 INFO namenode.FSDirectory: GLOBAL serial map: bits=24 maxEntries=16777215
2021-10-02 23:24:54,079 INFO util.GSet: Computing capacity for map INodeMap
2021-10-02 23:24:54,079 INFO util.GSet: VM type = 64-bit
2021-10-02 23:24:54,080 INFO util.GSet: 1.0% max memory 3.4 GB = 35.3 MB
2021-10-02 23:24:54,080 INFO util.GSet: capacity = 2^22 = 4194304 entries
2021-10-02 23:24:54,082 INFO namenode.FSDirectory: ACLs enabled? false
2021-10-02 23:24:54,082 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2021-10-02 23:24:54,082 INFO namenode.FSDirectory: XAttrs enabled? true
2021-10-02 23:24:54,082 INFO namenode.NameNode: Caching file names occurring more than 10 times
2021-10-02 23:24:54,086 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2021-10-02 23:24:54,088 INFO snapshot.SnapshotManager: SkipList is disabled
2021-10-02 23:24:54,092 INFO util.GSet: Computing capacity for map cachedBlocks
2021-10-02 23:24:54,092 INFO util.GSet: VM type = 64-bit
2021-10-02 23:24:54,092 INFO util.GSet: 0.25% max memory 3.4 GB = 8.8 MB
2021-10-02 23:24:54,092 INFO util.GSet: capacity = 2^20 = 1048576 entries
2021-10-02 23:24:54,099 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2021-10-02 23:24:54,099 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2021-10-02 23:24:54,099 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2021-10-02 23:24:54,102 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2021-10-02 23:24:54,102 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2021-10-02 23:24:54,103 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2021-10-02 23:24:54,103 INFO util.GSet: VM type = 64-bit
2021-10-02 23:24:54,103 INFO util.GSet: 0.029999999329447746% max memory 3.4 GB = 1.1 MB
2021-10-02 23:24:54,104 INFO util.GSet: capacity = 2^17 = 131072 entries
2021-10-02 23:24:54,126 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1474479574-127.0.1.1-1633188294119
2021-10-02 23:24:54,134 INFO common.Storage: Storage directory /opt/module/hadoop-3.1.3/data/dfs/name has been successfully formatted.
2021-10-02 23:24:54,153 INFO namenode.FSImageFormatProtobuf: Saving image file /opt/module/hadoop-3.1.3/data/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2021-10-02 23:24:54,224 INFO namenode.FSImageFormatProtobuf: Image file /opt/module/hadoop-3.1.3/data/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 393 bytes saved in 0 seconds .
2021-10-02 23:24:54,235 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2021-10-02 23:24:54,241 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid = 0 when meet shutdown.
2021-10-02 23:24:54,242 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop102/127.0.1.1
************************************************************/
[hadoop@hadoop102 hadoop-3.1.3]$
注意格式化之前,一定要先停止上次启动的所有namenode和datanode进程,然后再删除data和log数据
2.1.9 启动hdfs
[hadoop@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
Starting namenodes on [hadoop102]
Starting datanodes
hadoop103: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
hadoop104: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
Starting secondary namenodes [hadoop104]
[hadoop@hadoop102 hadoop-3.1.3]$ /home/hadoop/bin/xcall.sh jps
============hadoop102================
2610 DataNode
2838 Jps
2461 NameNode
============hadoop103================
1443 DataNode
1496 Jps
============hadoop104================
2002 Jps
1966 SecondaryNameNode
1871 DataNode
2.1.10 启动yarn
[hadoop@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
[hadoop@hadoop102 hadoop-3.1.3]$ /home/hadoop/bin/xcall.sh jps
============hadoop102================
2610 DataNode
3221 Jps
3079 NodeManager
2461 NameNode
============hadoop103================
1443 DataNode
1564 NodeManager
1660 Jps
============hadoop104================
2070 NodeManager
2166 Jps
1966 SecondaryNameNode
1871 DataNode
注意:启动服务和查看服务不在一台服务器上
2.1.11 访问web端
http://hadoop102:9870/
http://hadoop103:8088/cluster
注意:这里又个小坑,自安装的服务/etc/hosts中未删除localhost相关配置,导致一直无法从其他奇迹上访问到web页面。
2.1.12 创建启停脚本
vim /home/hadoop/bin/hdp.sh
#!/bin/bash
if [ $# -lt 1 ]
then
echo "No Args Input..."
exit ;
fi
case $1 in
"start")
echo " =================== 启动 hadoop集群 ==================="
echo " --------------- 启动 hdfs ---------------"
ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
echo " --------------- 启动 yarn ---------------"
ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
echo " --------------- 启动 historyserver ---------------"
ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
echo " =================== 关闭 hadoop集群 ==================="
echo " --------------- 关闭 historyserver ---------------"
ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
echo " --------------- 关闭 yarn ---------------"
ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
echo " --------------- 关闭 hdfs ---------------"
ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
echo "Input Args Error..."
;;
esac
[hadoop@hadoop102 hadoop-3.1.3]$ cd /home/hadoop/bin/
[hadoop@hadoop102 bin]$ ls
hdp.sh xcall.sh xsync
[hadoop@hadoop102 bin]$ chmod 777 hdp.sh
[hadoop@hadoop102 bin]$ ll
total 12
-rwxrwxrwx 1 hadoop hadoop 1158 Oct 3 09:36 hdp.sh
-rwxrwxrwx. 1 hadoop hadoop 112 Oct 2 21:58 xcall.sh
-rwxrwxrwx. 1 hadoop hadoop 624 Oct 2 17:56 xsync
[hadoop@hadoop102 bin]$ ./hdp.sh stop
=================== 关闭 hadoop集群 ===================
--------------- 关闭 historyserver ---------------
--------------- 关闭 yarn ---------------
Stopping nodemanagers
Stopping resourcemanager
--------------- 关闭 hdfs ---------------
Stopping namenodes on [hadoop102]
Stopping datanodes
Stopping secondary namenodes [hadoop104]
[hadoop@hadoop102 bin]$ xcall.sh jps
============hadoop102================
2239 Jps
============hadoop103================
2114 Jps
============hadoop104================
1451 Jps
2.1.13 Hadoop调优
2.1.13.1 项目经验之HDFS存储多目录
(1)给Linux系统新增加一块硬盘
参考:https://www.cnblogs.com/yujianadu/p/10750698.html
(2)生产环境服务器磁盘情况
(3)在hdfs-site.xml文件中配置多目录,注意新挂载磁盘的访问权限问题
HDFS的DataNode节点保存数据的路径由dfs.datanode.data.dir参数决定,其默认值为file://${hadoop.tmp.dir}/dfs/data,若服务器有多个磁盘,必须对该参数进行修改。如服务器磁盘如上图所示,则该参数应修改为如下的值。
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value>
</property>
注意:因为每台服务器节点的磁盘情况不同,所以这个配置配完之后,不需要分发
2.1.13.2 集群数据均衡
1)节点间数据均衡
(1)开启数据均衡命令
start-balancer.sh -threshold 10
对于参数10,代表的是集群中各个节点的磁盘空间利用率相差不超过10%,可根据实际情况进行调整。
(2)停止数据均衡命令
stop-balancer.sh
注意:于HDFS需要启动单独的Rebalance Server来执行Rebalance操作,所以尽量不要在NameNode上执行start-balancer.sh,而是找一台比较空闲的机器。
2)磁盘间数据均衡
(1)生成均衡计划(我们只有一块磁盘,不会生成计划)
hdfs diskbalancer -plan hadoop103
(2)执行均衡计划
hdfs diskbalancer -execute hadoop103.plan.json
(3)查看当前均衡任务的执行情况
hdfs diskbalancer -query hadoop103
(4)取消均衡任务
hdfs diskbalancer -cancel hadoop103.plan.json
2.1.13.3 项目经验之支持LZO压缩配置
1)hadoop-lzo编译
hadoop本身并不支持lzo压缩,故需要使用twitter提供的hadoop-lzo开源组件。hadoop-lzo需依赖hadoop和lzo进行编译,编译步骤如下。
Hadoop支持LZO
0. 环境准备
maven(下载安装,配置环境变量,修改sitting.xml加阿里云镜像)
gcc-c++
zlib-devel
autoconf
automake
libtool
通过yum安装即可,yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool
1. 下载、安装并编译LZO
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
tar -zxvf lzo-2.10.tar.gz
cd lzo-2.10
./configure -prefix=/usr/local/hadoop/lzo/
make
make install
2. 编译hadoop-lzo源码
2.1 下载hadoop-lzo的源码,下载地址:https://github.com/twitter/hadoop-lzo/archive/master.zip
2.2 解压之后,修改pom.xml
<hadoop.current.version>3.1.3</hadoop.current.version>
2.3 声明两个临时环境变量
export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include
export LIBRARY_PATH=/usr/local/hadoop/lzo/lib
2.4 编译
进入hadoop-lzo-master,执行maven编译命令
mvn package -Dmaven.test.skip=true
2.5 进入target,hadoop-lzo-0.4.21-SNAPSHOT.jar 即编译成功的hadoop-lzo组件
2)将编译好后的hadoop-lzo-0.4.20.jar 放入hadoop-3.1.3/share/hadoop/common/
[hadoop@hadoop102 common]$ pwd
/opt/module/hadoop-3.1.3/share/hadoop/common
[hadoop@hadoop102 common]$ ls
hadoop-lzo-0.4.20.jar
3)同步hadoop-lzo-0.4.20.jar到hadoop103、hadoop104
[hadoop@hadoop102 common]$ xsync hadoop-lzo-0.4.20.jar
4)core-site.xml增加配置支持LZO压缩
<configuration>
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
</configuration>
5)同步core-site.xml到hadoop103、hadoop104
[hadoop@hadoop102 hadoop]$ xsync core-site.xml
6)启动及查看集群
[hadoop@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
[hadoop@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
7)测试-数据准备
[hadoop@hadoop102 hadoop-3.1.3]$ hadoop fs -mkdir /input
[hadoop@hadoop102 hadoop-3.1.3]$ hadoop fs -put README.txt /input
8)测试-压缩
[hadoop@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec /input /output
2.1.13.4 项目经验之LZO创建索引
1)创建LZO文件的索引
LZO压缩文件的可切片特性依赖于其索引,故我们需要手动为LZO压缩文件创建索引。若无索引,则LZO文件的切片只有一个。
hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer big_file.lzo
2)测试
(1)将bigtable.lzo(200M)上传到集群的根目录
[hadoop@hadoop102 module]$ hadoop fs -mkdir /input
[hadoop@hadoop102 module]$ hadoop fs -put bigtable.lzo /input
(2)执行wordcount程序
[hadoop@hadoop102 module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /input /output1
(3)对上传的LZO文件建索引
[hadoop@hadoop102 module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /input/bigtable.lzo
(4)再次执行WordCount程序
[hadoop@hadoop102 module]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /input /output2
3)注意:如果以上任务,在运行过程中报如下异常
Container [pid=8468,containerID=container_1594198338753_0001_01_000002] is running 318740992B beyond the ‘VIRTUAL’ memory limit. Current usage: 111.5 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1594198338753_0001_01_000002 :
解决办法:在hadoop102的/opt/module/hadoop-3.1.3/etc/hadoop/yarn-site.xml文件中增加如下配置,然后分发到hadoop103、hadoop104服务器上,并重新启动集群。
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
2.1.13.5 项目经验之基准测试
在企业中非常关心每天从Java后台拉取过来的数据,需要多久能上传到集群?消费者关心多久能从HDFS上拉取需要的数据?
为了搞清楚HDFS的读写性能,生产环境上非常需要对集群进行压测。
HDFS的读写性能主要受网络和磁盘影响比较大。为了方便测试,将hadoop102、hadoop103、hadoop104虚拟机网络都设置为100mbps。
100Mbps单位是bit;10M/s单位是byte ; 1byte=8bit,100Mbps/8=12.5M/s。
测试网速:
(1)来到hadoop102的/opt/module目录,创建一个
[hadoop@hadoop102 software]$ python -m SimpleHTTPServer
(2)在Web页面上访问
hadoop102:8000
1)测试HDFS写性能
(1)写测试底层原理
(2)测试内容:向HDFS集群写10个128M的文件
[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 128MB
2021-02-09 10:43:16,853 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
2021-02-09 10:43:16,854 INFO fs.TestDFSIO: Date & time: Tue Feb 09 10:43:16 CST 2021
2021-02-09 10:43:16,854 INFO fs.TestDFSIO: Number of files: 10
2021-02-09 10:43:16,854 INFO fs.TestDFSIO: Total MBytes processed: 1280
2021-02-09 10:43:16,854 INFO fs.TestDFSIO: Throughput mb/sec: 1.61
2021-02-09 10:43:16,854 INFO fs.TestDFSIO: Average IO rate mb/sec: 1.9
2021-02-09 10:43:16,854 INFO fs.TestDFSIO: IO rate std deviation: 0.76
2021-02-09 10:43:16,854 INFO fs.TestDFSIO: Test exec time sec: 133.05
2021-02-09 10:43:16,854 INFO fs.TestDFSIO:
注意:nrFiles n为生成mapTask的数量,生产环境一般可通过hadoop103:8088查看CPU核数,设置为(CPU核数 - 1)
Number of files:生成mapTask数量,一般是集群中(CPU核数 - 1),我们测试虚拟机就按照实际的物理内存-1分配即可。(目标,让每个节点都参与测试)
Total MBytes processed:单个map处理的文件大小
Throughput mb/sec:单个mapTak的吞吐量
计算方式:处理的总文件大小/每一个mapTask写数据的时间累加
集群整体吞吐量:生成mapTask数量*单个mapTak的吞吐量
Average IO rate mb/sec::平均mapTak的吞吐量
计算方式:每个mapTask处理文件大小/每一个mapTask写数据的时间
全部相加除以task数量
IO rate std deviation:方差、反映各个mapTask处理的差值,越小越均衡
注意:如果测试过程中,出现异常
①可以在yarn-site.xml中设置虚拟内存检测为false
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
②分发配置并重启Yarn集群
(3)测试结果分析
①由于副本1就在本地,所以该副本不参与测试
一共参与测试的文件:10个文件 * 2个副本 = 20个
压测后的速度:1.61
实测速度:1.61M/s * 20个文件 ≈ 32M/s
三台服务器的带宽:12.5 + 12.5 + 12.5 ≈ 30m/s
所有网络资源都已经用满。
如果实测速度远远小于网络,并且实测速度不能满足工作需求,可以考虑采用固态硬盘或者增加磁盘个数。
②如果客户端不在集群节点,那就三个副本都参与计算
2)测试HDFS读性能
(1)测试内容:读取HDFS集群10个128M的文件
[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 128MB
2021-02-09 11:34:15,847 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
2021-02-09 11:34:15,847 INFO fs.TestDFSIO: Date & time: Tue Feb 09 11:34:15 CST 2021
2021-02-09 11:34:15,847 INFO fs.TestDFSIO: Number of files: 10
2021-02-09 11:34:15,847 INFO fs.TestDFSIO: Total MBytes processed: 1280
2021-02-09 11:34:15,848 INFO fs.TestDFSIO: Throughput mb/sec: 200.28
2021-02-09 11:34:15,848 INFO fs.TestDFSIO: Average IO rate mb/sec: 266.74
2021-02-09 11:34:15,848 INFO fs.TestDFSIO: IO rate std deviation: 143.12
2021-02-09 11:34:15,848 INFO fs.TestDFSIO: Test exec time sec: 20.83
(2)删除测试生成数据
[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -clean
(3)测试结果分析:为什么读取文件速度大于网络带宽?由于目前只有三台服务器,且有三个副本,数据读取就近原则,相当于都是读取的本地磁盘数据,没有走网络。
3)使用Sort程序评测MapReduce
(1)使用RandomWriter来产生随机数,每个节点运行10个Map任务,每个Map产生大约1G大小的二进制随机数
[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar randomwriter random-data
(2)执行Sort程序
[hadoop@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar sort random-data sorted-data
(3)验证数据是否真正排好序了
[hadoop@hadoop102 mapreduce]$
hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar testmapredsort -sortInput random-data -sortOutput sorted-data
2.1.13.5 项目经验之Hadoop参数调优
1)HDFS参数调优hdfs-site.xml
The number of Namenode RPC server threads that listen to requests from clients. If dfs.namenode.servicerpc-address is not configured then Namenode RPC server threads listen to requests from all nodes.
NameNode有一个工作线程池,用来处理不同DataNode的并发心跳以及客户端并发的元数据操作。
对于大集群或者有大量客户端的集群来说,通常需要增大参数dfs.namenode.handler.count的默认值10。
<property>
<name>dfs.namenode.handler.count</name>
<value>10</value>
</property>
dfs.namenode.handler.count=,比如集群规模为8台时,此参数设置为41。可通过简单的python代码计算该值,代码如下。
[atguigu@hadoop102 ~]$ python
Python 2.7.5 (default, Apr 11 2018, 07:36:10)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> print int(20*math.log(8))
41
>>> quit()
2)YARN参数调优yarn-site.xml
(1)情景描述:总共7台机器,每天几亿条数据,数据源->Flume->Kafka->HDFS->Hive
面临问题:数据统计主要用HiveSQL,没有数据倾斜,小文件已经做了合并处理,开启的JVM重用,而且IO没有阻塞,内存用了不到50%。但是还是跑的非常慢,而且数据量洪峰过来时,整个集群都会宕掉。基于这种情况有没有优化方案。
(2)解决办法:
NodeManager内存和服务器实际内存配置尽量接近,如服务器有128g内存,但是NodeManager默认内存8G,不修改该参数最多只能用8G内存。NodeManager使用的CPU核数和服务器CPU核数尽量接近。
- ①yarn.nodemanager.resource.memory-mb NodeManager使用内存数
- ②yarn.nodemanager.resource.cpu-vcores NodeManager使用CPU核数
2.2 安装Zookeeper
2.2.1 集群规划
hadoop102 | hadoop103 | hadoop104 |
---|---|---|
Zookeeper | Zookeeper | Zookeeper |
2.2.2 解压安装
[hadoop@hadoop102 software]$ tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv apache-zookeeper-3.5.7-bin/ zookeeper-3.5.7
# 分发
[hadoop@hadoop102 module]$ xsync zookeeper-3.5.7/
2.2.3 配置服务
# 创建数据目录
[hadoop@hadoop102 zookeeper-3.5.7]$ mkdir zkData
# 重命名
[hadoop@hadoop102 conf]$ mv zoo_sample.cfg zoo.cfg
[hadoop@hadoop102 conf]$ vim zoo.cfg
# 内容
dataDir=/opt/module/zookeeper-3.5.7/zkData
#######################cluster##########################
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
server.4=hadoop104:2888:3888
# 创建服务ID
[hadoop@hadoop102 zkData]$ touch myid
[hadoop@hadoop102 zkData]$ vim myid
# 内容
2
# 分发
[hadoop@hadoop102 zookeeper-3.5.7]$ xsync zkData/
==================== hadoop102 ====================
sending incremental file list
sent 94 bytes received 17 bytes 222.00 bytes/sec
total size is 2 speedup is 0.02
==================== hadoop103 ====================
sending incremental file list
zkData/
zkData/myid
sent 142 bytes received 39 bytes 362.00 bytes/sec
total size is 2 speedup is 0.01
==================== hadoop104 ====================
sending incremental file list
zkData/
zkData/myid
sent 142 bytes received 39 bytes 362.00 bytes/sec
total size is 2 speedup is 0.01
[hadoop@hadoop102 zookeeper-3.5.7]$ xsync conf/
configuration.xsl log4j.properties zoo.cfg
[hadoop@hadoop102 zookeeper-3.5.7]$ xsync conf/zoo.cfg
==================== hadoop102 ====================
sending incremental file list
sent 62 bytes received 12 bytes 148.00 bytes/sec
total size is 942 speedup is 12.73
==================== hadoop103 ====================
sending incremental file list
zoo.cfg
sent 1,051 bytes received 35 bytes 2,172.00 bytes/sec
total size is 942 speedup is 0.87
==================== hadoop104 ====================
sending incremental file list
zoo.cfg
sent 1,051 bytes received 35 bytes 2,172.00 bytes/sec
total size is 942 speedup is 0.87
hadoop102服务ID为2、hadooop103服务ID为3、 hadoop104服务ID为4
2.2.4 启动服务
[hadoop@hadoop102 zookeeper-3.5.7]$ bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[hadoop@hadoop102 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Error contacting service. It is probably not running.
[hadoop@hadoop102 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: follower
[hadoop@hadoop103 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: leader
[hadoop@hadoop104 ~]$ /opt/module/zookeeper-3.5.7/bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: follower
2.2.5 ZK集群启动停止脚本
(1)在hadoop102的/home/atguigu/bin目录下创建脚本
[hadoop@hadoop102 bin]$ vim zk.sh
在脚本中编写如下内容
#!/bin/bash
case $1 in
"start"){
for i in hadoop102 hadoop103 hadoop104
do
echo ---------- zookeeper $i 启动 ------------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh start"
done
};;
"stop"){
for i in hadoop102 hadoop103 hadoop104
do
echo ---------- zookeeper $i 停止 ------------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop"
done
};;
"status"){
for i in hadoop102 hadoop103 hadoop104
do
echo ---------- zookeeper $i 状态 ------------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh status"
done
};;
esac
(2)增加脚本执行权限
[hadoop@hadoop102 bin]$ chmod u+x zk.sh
(3)Zookeeper集群启动脚本
[hadoop@hadoop102 module]$ zk.sh start
(4)Zookeeper集群停止脚本
[hadoop@hadoop102 module]$ zk.sh stop
2.3 安装kafka
2.3.1 集群规划
hadoop102 | hadoop103 | hadoop104 |
---|---|---|
zk | zk | zk |
kafka | kafka | kafka |
2.3.2 解压安装
[hadoop@hadoop102 software]$ tar -zxvf kafka_2.11-2.4.1.tgz -C /opt/module/
# 重命名
[hadoop@hadoop102 software]$ cd ../module/
[hadoop@hadoop102 module]$ ls
hadoop-3.1.3 jdk1.8.0_212 kafka_2.11-2.4.1 zookeeper-3.5.7
[hadoop@hadoop102 module]$ mv kafka_2.11-2.4.1/ kafka
2.3.3 配置
[hadoop@hadoop102 kafka]$ cd config/
[hadoop@hadoop102 config]$ vi server.properties
修改或者增加以下内容:
#broker的全局唯一编号,不能重复
broker.id=0
#删除topic功能使能
delete.topic.enable=true
#kafka运行日志存放的路径
log.dirs=/opt/module/kafka/data
#配置连接Zookeeper集群地址
zookeeper.connect=hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka
# 分发
[hadoop@hadoop102 module]$ xsync kafka/
注意:分别修改hadoop103、hadoop104两台机器上的broker.id为1和2.
2.3.4 配置环境变量
[hadoop@hadoop102 kafka]$ sudo vim /etc/profile.d/my_env.sh
# KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka
export PATH=$PATH:$KAFKA_HOME/bin
[hadoop@hadoop102 kafka]$ source /etc/profile.d/my_env.sh
# 分发
[hadoop@hadoop102 kafka]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh
==================== hadoop102 ====================
Warning: Permanently added the ECDSA host key for IP address '192.168.10.102' to the list of known hosts.
root@hadoop102's password:
root@hadoop102's password:
sending incremental file list
sent 47 bytes received 12 bytes 23.60 bytes/sec
total size is 300 speedup is 5.08
==================== hadoop103 ====================
root@hadoop103's password:
root@hadoop103's password:
sending incremental file list
my_env.sh
sent 394 bytes received 41 bytes 174.00 bytes/sec
total size is 300 speedup is 0.69
==================== hadoop104 ====================
root@hadoop104's password:
root@hadoop104's password:
sending incremental file list
my_env.sh
sent 394 bytes received 41 bytes 124.29 bytes/sec
total size is 300 speedup is 0.69
2.3.5 Kafka集群启动停止脚本
(1)在/home/atguigu/bin目录下创建脚本kf.sh
[hadoop@hadoop102 bin]$ vim kf.sh
在脚本中填写如下内容
#! /bin/bash
case $1 in
"start"){
for i in hadoop102 hadoop103 hadoop104
do
echo " --------启动 $i Kafka-------"
ssh $i "/opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties"
done
};;
"stop"){
for i in hadoop102 hadoop103 hadoop104
do
echo " --------停止 $i Kafka-------"
ssh $i "/opt/module/kafka/bin/kafka-server-stop.sh stop"
done
};;
esac
(2)增加脚本执行权限
[hadoop@hadoop102 bin]$ chmod u+x kf.sh
(3)kf集群启动脚本
[hadoop@hadoop102 module]$ kf.sh start
(4)kf集群停止脚本
[hadoop@hadoop102 module]$ kf.sh stop
2.3.6 项目经验之Kafka机器数量计算
Kafka机器数量(经验公式)= 2 *(峰值生产速度 * 副本数 / 100)+ 1
先拿到峰值生产速度,再根据设定的副本数,就能预估出需要部署Kafka的数量。
1)峰值生产速度
峰值生产速度可以压测得到。
2)副本数
副本数默认是1个,在企业里面2-3个都有,2个居多。
副本多可以提高可靠性,但是会降低网络传输效率。
比如我们的峰值生产速度是50M/s。副本数为2。
Kafka机器数量 = 2 *(50 * 2 / 100)+ 1 = 3台
4.4.5 项目经验之Kafka压力测试
1)Kafka压测
用Kafka官方自带的脚本,对Kafka进行压测。
kafka-consumer-perf-test.sh
kafka-producer-perf-test.sh
Kafka压测时,在硬盘读写速度一定的情况下,可以查看到哪些地方出现了瓶颈(CPU,内存,网络IO)。一般都是网络IO达到瓶颈。
2)Kafka Producer压力测试
(0)压测环境准备
①hadoop102、hadoop103、hadoop104的网络带宽都设置为100mbps。
②关闭hadoop102主机,并根据hadoop102克隆出hadoop105(修改IP和主机名称)
③hadoop105的带宽不设限
④创建一个test topic,设置为3个分区2个副本
[hadoop@hadoop102 kafka]$ bin/kafka-topics.sh --zookeeper hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka --create --replication-factor 2 --partitions 3 --topic test
(1)在/opt/module/kafka/bin目录下面有这两个文件。我们来测试一下
[hadoop@hadoop102 kafka]$ bin/kafka-producer-perf-test.sh --topic test --record-size 100 --num-records 10000000 --throughput -1 --producer-props bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
说明:
record-size是一条信息有多大,单位是字节。
num-records是总共发送多少条信息。
throughput 是每秒多少条信息,设成-1,表示不限流,尽可能快的生产数据,可测出生产者最大吞吐量。
(2)Kafka会打印下面的信息
699884 records sent, 139976.8 records/sec (13.35 MB/sec), 1345.6 ms avg latency, 2210.0 ms max latency.
713247 records sent, 141545.3 records/sec (13.50 MB/sec), 1577.4 ms avg latency, 3596.0 ms max latency.
773619 records sent, 153862.2 records/sec (14.67 MB/sec), 2326.8 ms avg latency, 4051.0 ms max latency.
773961 records sent, 154206.2 records/sec (15.71 MB/sec), 1964.1 ms avg latency, 2917.0 ms max latency.
776970 records sent, 154559.4 records/sec (15.74 MB/sec), 1960.2 ms avg latency, 2922.0 ms max latency.
776421 records sent, 154727.2 records/sec (15.76 MB/sec), 1960.4 ms avg latency, 2954.0 ms max latency.
参数解析:Kafka的吞吐量15m/s左右是否符合预期呢?
hadoop102、hadoop103、hadoop104三台集群的网络总带宽30m/s左右,由于是两个副本,所以Kafka的吞吐量30m/s ➗ 2(副本) = 15m/s
结论:网络带宽和副本都会影响吞吐量。
(4)调整batch.size
batch.size默认值是16k。
batch.size较小,会降低吞吐量。比如说,批次大小为0则完全禁用批处理,会一条一条发送消息);
batch.size过大,会增加消息发送延迟。比如说,Batch设置为64k,但是要等待5秒钟Batch才凑满了64k,才能发送出去。那这条消息的延迟就是5秒钟。
[hadoop@hadoop102 kafka]$ bin/kafka-producer-perf-test.sh --topic test --record-size 100 --num-records 10000000 --throughput -1 --producer-props bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092 batch.size=500
输出结果
69169 records sent, 13833.8 records/sec (1.32 MB/sec), 2517.6 ms avg latency, 4299.0 ms max latency.
105372 records sent, 21074.4 records/sec (2.01 MB/sec), 6748.4 ms avg latency, 9016.0 ms max latency.
113188 records sent, 22637.6 records/sec (2.16 MB/sec), 11348.0 ms avg latency, 13196.0 ms max latency.
108896 records sent, 21779.2 records/sec (2.08 MB/sec), 12272.6 ms avg latency, 12870.0 ms max latency.
(5)linger.ms
如果设置batch size为64k,但是比如过了10分钟也没有凑够64k,怎么办?
可以设置,linger.ms。比如linger.ms=5ms,那么就是要发送的数据没有到64k,5ms后,数据也会发出去。
(6)总结
同时设置batch.size和 linger.ms,就是哪个条件先满足就都会将消息发送出去
Kafka需要考虑高吞吐量与延时的平衡。
3)Kafka Consumer压力测试
(1)Consumer的测试,如果这四个指标(IO,CPU,内存,网络)都不能改变,考虑增加分区数来提升性能。
[hadoop@hadoop102 kafka]$ bin/kafka-consumer-perf-test.sh --broker-list hadoop102:9092,hadoop103:9092,hadoop104:9092 --topic test --fetch-size 10000 --messages 10000000 --threads 1
①参数说明:
–broker-list指定Kafka集群地址
–topic 指定topic的名称
–fetch-size 指定每次fetch的数据的大小
–messages 总共要消费的消息个数
②测试结果说明:
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec
2021-08-03 21:17:21:778, 2021-08-03 21:18:19:775, 514.7169, 8.8749, 5397198, 93059.9514
开始测试时间,测试结束数据,共消费数据514.7169MB,吞吐量8.8749MB/s
(2)调整fetch-size
①增加fetch-size值,观察消费吞吐量。
[hadoop@hadoop102 kafka]$ bin/kafka-consumer-perf-test.sh --broker-list hadoop102:9092,hadoop103:9092,hadoop104:9092 --topic test --fetch-size 100000 --messages 10000000 --threads 1
②测试结果说明:
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec
2021-08-03 21:22:57:671, 2021-08-03 21:23:41:938, 514.7169, 11.6276, 5397198, 121923.7355
(3)总结
吞吐量受网络带宽和fetch-size的影响
4.4.6 项目经验值Kafka分区数计算
(1)创建一个只有1个分区的topic
(2)测试这个topic的producer吞吐量和consumer吞吐量。
(3)假设他们的值分别是Tp和Tc,单位可以是MB/s。
(4)然后假设总的目标吞吐量是Tt,那么分区数 = Tt / min(Tp,Tc)
例如:producer吞吐量 = 20m/s;consumer吞吐量 = 50m/s,期望吞吐量100m/s;
分区数 = 100 / 20 = 5分区
https://blog.csdn.net/weixin_42641909/article/details/89294698
分区数一般设置为:3-10个
2.4 flume安装
2.4.1 集群规划
hadoop102 | hadoop103 | hadoop104 |
---|---|---|
flume | flume | flume |
2.4.2安装地址
(1) Flume官网地址:http://flume.apache.org/
(2)文档查看地址:http://flume.apache.org/FlumeUserGuide.html
(3)下载地址:http://archive.apache.org/dist/flume/
2.4.3 安装部署
(1)将apache-flume-1.9.0-bin.tar.gz上传到linux的/opt/software目录下
(2)解压apache-flume-1.9.0-bin.tar.gz到/opt/module/目录下
[hadoop@hadoop102 software]$ tar -zxf /opt/software/apache-flume-1.9.0-bin.tar.gz -C /opt/module/
(3)修改apache-flume-1.9.0-bin的名称为flume
[hadoop@hadoop102 module]$ mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume
(4)将lib文件夹下的guava-11.0.2.jar删除以兼容Hadoop 3.1.3
[hadoop@hadoop102 module]$ rm /opt/module/flume/lib/guava-11.0.2.jar
注意:删除guava-11.0.2.jar的服务器节点,一定要配置hadoop环境变量。否则会报如下异常。
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Lists
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
… 1 more
(5)将flume/conf下的flume-env.sh.template文件修改为flume-env.sh,并配置flume-env.sh文件
[hadoop@hadoop102 conf]$ mv flume-env.sh.template flume-env.sh
[hadoop@hadoop102 conf]$ vi flume-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_212
2.4.4 分发
[hadoop@hadoop102 module]$ xsync flume/
2.4.5 项目经验之Flume组件选型
1)Source
(1)Taildir Source相比Exec Source、Spooling Directory Source的优势
TailDir Source:断点续传、多目录。Flume1.6以前需要自己自定义Source记录每次读取文件位置,实现断点续传。不会丢数据,但是有可能会导致数据重复。
Exec Source可以实时搜集数据,但是在Flume不运行或者Shell命令出错的情况下,数据将会丢失。
Spooling Directory Source监控目录,支持断点续传。
(2)batchSize大小如何设置?
答:Event 1K左右时,500-1000合适(默认为100)
2)Channel
采用Kafka Channel,省去了Sink,提高了效率。KafkaChannel数据存储在Kafka里面,所以数据是存储在磁盘中。
注意在Flume1.7以前,Kafka Channel很少有人使用,因为发现parseAsFlumeEvent这个配置起不了作用。也就是无论parseAsFlumeEvent配置为true还是false,都会转为Flume Event。这样的话,造成的结果是,会始终都把Flume的headers中的信息混合着内容一起写入Kafka的消息中,这显然不是我所需要的,我只是需要把内容写入即可。
4.5.3 日志采集Flume配置
1)Flume配置分析
Flume直接读log日志的数据,log日志的格式是app.yyyy-mm-dd.log。
2)Flume的具体配置如下:
(1)在/opt/module/flume/conf目录下创建file-flume-kafka.conf文件
[hadoop@hadoop102 conf]$
在文件配置如下内容
#为各组件命名
a1.sources = r1
a1.channels = c1
#描述source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/applog/log/app.*
a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.ETLInterceptor$Builder
#描述channel
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = topic_log
a1.channels.c1.parseAsFlumeEvent = false
#绑定source和channel以及sink和channel的关系
a1.sources.r1.channels = c1
注意:com.atguigu.flume.interceptor.ETLInterceptor是自定义的拦截器的全类名。需要根据用户自定义的拦截器做相应修改。
4.5.4 Flume拦截器
1)创建Maven工程flume-interceptor
2)创建包名:com.atguigu.flume.interceptor
3)在pom.xml文件中添加如下配置
<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.62</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
注意:scope中provided的含义是编译时用该jar包。打包时时不用。因为集群上已经存在flume的jar包。只是本地编译时用一下。
4)在com.atguigu.flume.interceptor包下创建JSONUtils类
package com.atguigu.flume.interceptor;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONException;
public class JSONUtils {
public static boolean isJSONValidate(String log){
try {
JSON.parse(log);
return true;
}catch (JSONException e){
return false;
}
}
}
5)在com.atguigu.flume.interceptor包下创建LogInterceptor类
package com.atguigu.flume.interceptor;
import com.alibaba.fastjson.JSON;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.StandardCharsets;
import java.util.Iterator;
import java.util.List;
public class ETLInterceptor implements Interceptor {
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
byte[] body = event.getBody();
String log = new String(body, StandardCharsets.UTF_8);
if (JSONUtils.isJSONValidate(log)) {
return event;
} else {
return null;
}
}
@Override
public List<Event> intercept(List<Event> list) {
Iterator<Event> iterator = list.iterator();
while (iterator.hasNext()){
Event next = iterator.next();
if(intercept(next)==null){
iterator.remove();
}
}
return list;
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new ETLInterceptor();
}
@Override
public void configure(Context context) {
}
}
@Override
public void close() {
}
}
6)打包
7)需要先将打好的包放入到hadoop102的/opt/module/flume/lib文件夹下面。
[hadoop@hadoop102 lib]$ ls | grep interceptor
flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar
8)分发Flume到hadoop103、hadoop104
[hadoop@hadoop102 module]$ xsync flume/
9)分别在hadoop102、hadoop103上启动Flume
[hadoop@hadoop102 flume]$ bin/flume-ng agent --name a1 --conf-file conf/file-flume-kafka.conf &
[hadoop@hadoop103 flume]$ bin/flume-ng agent --name a1 --conf-file conf/file-flume-kafka.conf &
4.5.5 测试Flume-Kafka通道
(1)生成日志
[hadoop@hadoop102 ~]$ lg.sh
(2)消费Kafka数据,观察控制台是否有数据获取到
[atguigu@hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --from-beginning --topic topic_log
说明:如果获取不到数据,先检查Kafka、Flume、Zookeeper是否都正确启动。再检查Flume的拦截器代码是否正常。
4.5.6 日志采集Flume启动停止脚本
(1)在/home/hadoop/bin目录下创建脚本f1.sh
[hadoop@hadoop102 bin]$ vim f1.sh
在脚本中填写如下内容
#! /bin/bash
case $1 in
"start"){
for i in hadoop102 hadoop103
do
echo " --------启动 $i 采集flume-------"
ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/file-flume-kafka.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume/log1.txt 2>&1 &"
done
};;
"stop"){
for i in hadoop102 hadoop103
do
echo " --------停止 $i 采集flume-------"
ssh $i "ps -ef | grep file-flume-kafka | grep -v grep |awk '{print \$2}' | xargs -n1 kill -9 "
done
};;
esac
说明1:nohup,该命令可以在你退出帐户/关闭终端之后继续运行相应的进程。nohup就是不挂起的意思,不挂断地运行命令。
说明2:awk 默认分隔符为空格
说明3:$2是在“”双引号内部会被解析为脚本的第二个参数,但是这里面想表达的含义是awk的第二个值,所以需要将他转义,用$2表示。
说明4:xargs 表示取出前面命令运行的结果,作为后面命令的输入参数。
(2)增加脚本执行权限
[hadoop@hadoop102 bin]$ chmod u+x f1.sh
(3)f1集群启动脚本
[hadoop@hadoop102 module]$ f1.sh start
(4)f1集群停止脚本
[hadoop@hadoop102 module]$ f1.sh stop
4.6 消费Kafka数据Flume
集群规划
服务器hadoop102 服务器hadoop103 服务器hadoop104
Flume(消费Kafka) Flume
4.6.1 项目经验之Flume组件选型
1)FileChannel和MemoryChannel区别
MemoryChannel传输数据速度更快,但因为数据保存在JVM的堆内存中,Agent进程挂掉会导致数据丢失,适用于对数据质量要求不高的需求。
FileChannel传输速度相对于Memory慢,但数据安全保障高,Agent进程挂掉也可以从失败中恢复数据。
选型:
金融类公司、对钱要求非常准确的公司通常会选择FileChannel
传输的是普通日志信息(京东内部一天丢100万-200万条,这是非常正常的),通常选择MemoryChannel。
2)FileChannel优化
通过配置dataDirs指向多个路径,每个路径对应不同的硬盘,增大Flume吞吐量。
官方说明如下:
Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance
checkpointDir和backupCheckpointDir也尽量配置在不同硬盘对应的目录中,保证checkpoint坏掉后,可以快速使用backupCheckpointDir恢复数据。
3)Sink:HDFS Sink
(1)HDFS存入大量小文件,有什么影响?
元数据层面:每个小文件都有一份元数据,其中包括文件路径,文件名,所有者,所属组,权限,创建时间等,这些信息都保存在Namenode内存中。所以小文件过多,会占用Namenode服务器大量内存,影响Namenode性能和使用寿命
计算层面:默认情况下MR会对每个小文件启用一个Map任务计算,非常影响计算性能。同时也影响磁盘寻址时间。
(2)HDFS小文件处理
官方默认的这三个参数配置写入HDFS后会产生小文件,hdfs.rollInterval、hdfs.rollSize、hdfs.rollCount
基于以上hdfs.rollInterval=3600,hdfs.rollSize=134217728,hdfs.rollCount =0几个参数综合作用,效果如下:
①文件在达到128M时会滚动生成新文件
②文件创建超3600秒时会滚动生成新文件
4.6.2 消费者Flume配置
1)Flume配置分析
2)Flume的具体配置如下:
(1)在hadoop104的/opt/module/flume/conf目录下创建kafka-flume-hdfs.conf文件
[hadoop@hadoop104 conf]$ vim kafka-flume-hdfs.conf
在文件配置如下内容
## 组件
a1.sources=r1
a1.channels=c1
a1.sinks=k1
## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.TimeStampInterceptor$Builder
## channel1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1/
## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_log/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = log-
a1.sinks.k1.hdfs.round = false
#控制生成的小文件
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = lzop
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
4.6.3 Flume时间戳拦截器
由于Flume默认会用Linux系统时间,作为输出到HDFS路径的时间。如果数据是23:59分产生的。Flume消费Kafka里面的数据时,有可能已经是第二天了,那么这部门数据会被发往第二天的HDFS路径。我们希望的是根据日志里面的实际时间,发往HDFS的路径,所以下面拦截器作用是获取日志中的实际时间。
解决的思路:拦截json日志,通过fastjson框架解析json,获取实际时间ts。将获取的ts时间写入拦截器header头,header的key必须是timestamp,因为Flume框架会根据这个key的值识别为时间,写入到HDFS。
1)在com.atguigu.flume.interceptor包下创建TimeStampInterceptor类
package com.atguigu.flume.interceptor;
import com.alibaba.fastjson.JSONObject;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
public class TimeStampInterceptor implements Interceptor {
private ArrayList<Event> events = new ArrayList<>();
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
Map<String, String> headers = event.getHeaders();
String log = new String(event.getBody(), StandardCharsets.UTF_8);
JSONObject jsonObject = JSONObject.parseObject(log);
String ts = jsonObject.getString("ts");
headers.put("timestamp", ts);
return event;
}
@Override
public List<Event> intercept(List<Event> list) {
events.clear();
for (Event event : list) {
events.add(intercept(event));
}
return events;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder {
@Override
public Interceptor build() {
return new TimeStampInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
2)重新打包
3)需要先将打好的包放入到hadoop102的/opt/module/flume/lib文件夹下面。
[hadoop@hadoop102 lib]$ ls | grep interceptor
flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar
4)分发Flume到hadoop103、hadoop104
[hadoop@hadoop102 module]$ xsync flume/
4.6.4 消费者Flume启动停止脚本
(1)在/home/hadoop/bin目录下创建脚本f2.sh
[hadoop@hadoop102 bin]$ vim f2.sh
在脚本中填写如下内容
#! /bin/bash
case $1 in
"start"){
for i in hadoop104
do
echo " --------启动 $i 消费flume-------"
ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/kafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume/log2.txt 2>&1 &"
done
};;
"stop"){
for i in hadoop104
do
echo " --------停止 $i 消费flume-------"
ssh $i "ps -ef | grep kafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
done
};;
esac
(2)增加脚本执行权限
[hadoop@hadoop102 bin]$ chmod u+x f2.sh
(3)f2集群启动脚本
[hadoop@hadoop102 module]$ f2.sh start
(4)f2集群停止脚本
[hadoop@hadoop102 module]$ f2.sh stop
4.6.5 项目经验之Flume内存优化
1)问题描述:如果启动消费Flume抛出如下异常
ERROR hdfs.HDFSEventSink: process failed
java.lang.OutOfMemoryError: GC overhead limit exceeded
2)解决方案步骤
(1)在hadoop102服务器的/opt/module/flume/conf/flume-env.sh文件中增加如下配置
export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"
(2)同步配置到hadoop103、hadoop104服务器
[hadoop@hadoop102 conf]$ xsync flume-env.sh
3)Flume内存参数设置及优化
JVM heap一般设置为4G或更高
-Xmx与-Xms最好设置一致,减少内存抖动带来的性能影响,如果设置不一致容易导致频繁fullgc。
-Xms表示JVM Heap(堆内存)最小尺寸,初始分配;-Xmx 表示JVM Heap(堆内存)最大允许的尺寸,按需分配。如果不设置一致,容易在初始化时,由于内存不够,频繁触发fullgc。
4.7 采集通道启动/停止脚本
(1)在/home/hadoop/bin目录下创建脚本cluster.sh
[hadoop@hadoop102 bin]$ vim cluster.sh
在脚本中填写如下内容
#!/bin/bash
case $1 in
"start"){
echo ================== 启动 集群 ==================
#启动 Zookeeper集群
zk.sh start
#启动 Hadoop集群
hdp.sh start
#启动 Kafka采集集群
kf.sh start
#启动 Flume采集集群
f1.sh start
#启动 Flume消费集群
f2.sh start
};;
"stop"){
echo ================== 停止 集群 ==================
#停止 Flume消费集群
f2.sh stop
#停止 Flume采集集群
f1.sh stop
#停止 Kafka采集集群
kf.sh stop
#停止 Hadoop集群
hdp.sh stop
#停止 Zookeeper集群
zk.sh stop
};;
esac
(2)增加脚本执行权限
[hadoop@hadoop102 bin]$ chmod u+x cluster.sh
(3)cluster集群启动脚本
[hadoop@hadoop102 module]$ cluster.sh start
(4)cluster集群停止脚本
[hadoop@hadoop102 module]$ cluster.sh stop
常见问题及解决方案
5.1 2NN页面不能显示完整信息
1)问题描述
访问2NN页面http://hadoop104:9868,看不到详细信息
2)解决办法
(1)在浏览器上按F12,查看问题原因。定位bug在61行
(2)找到要修改的文件
[hadoop@hadoop102 static]$ pwd
/opt/module/hadoop-3.1.3/share/hadoop/hdfs/webapps/static
[hadoop@hadoop102 static]$ vim dfs-dust.js
:set nu
修改61行
return new Date(Number(v)).toLocaleString();
(3)分发dfs-dust.js
[hadoop@hadoop102 static]$ xsync dfs-dust.js
(4)在http://hadoop104:9868/status.html 页面强制刷新
2.5 安装MySql
2.5.1 检查是否安装mysql并卸载
[hadoop@hadoop102 mysql]$ rpm -qa | grep -i -E mysql\|mariadb | xargs -n1 sudo rpm -e --nodeps
2.5.2 安装依赖
[hadoop@hadoop102 mysql]$ sudo yum install -y libaio
[hadoop@hadoop102 mysql]$ sudo yum -y install autoconf
2.5.3 安装mysql
[hadoop@hadoop102 mysql]$ rpm -ivh 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm
warning: 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Permission denied)
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm
warning: 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing... ################################# [100%]
Updating / installing...
1:mysql-community-common-5.7.16-1.e################################# [100%]
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 02_mysql-community-libs-5.7.16-1.el7.x86_64.rpm
warning: 02_mysql-community-libs-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing... ################################# [100%]
Updating / installing...
1:mysql-community-libs-5.7.16-1.el7################################# [100%]
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 03_mysql-community-libs-compat-5.7.16-1.el7.x86_64.rpm
warning: 03_mysql-community-libs-compat-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing... ################################# [100%]
Updating / installing...
1:mysql-community-libs-compat-5.7.1################################# [100%]
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 04_mysql-community-client-5.7.16-1.el7.x86_64.rpm
warning: 04_mysql-community-client-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing... ################################# [100%]
Updating / installing...
1:mysql-community-client-5.7.16-1.e################################# [100%]
[hadoop@hadoop102 mysql]$ sudo rpm -ivh 05_mysql-community-server-5.7.16-1.el7.x86_64.rpm
warning: 05_mysql-community-server-5.7.16-1.el7.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 5072e1f5: NOKEY
Preparing... ################################# [100%]
Updating / installing...
1:mysql-community-server-5.7.16-1.e################################# [100%]
2.5.4 启动服务
[hadoop@hadoop102 mysql]$ sudo systemctl start mysqld.service
2.5.5 查看服务状态
[hadoop@hadoop102 mysql]$ sudo systemctl status mysqld.service
● mysqld.service - MySQL Server
Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2021-10-06 16:28:44 CST; 13s ago
Process: 4932 ExecStart=/usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid $MYSQLD_OPTS (code=exited, status=0/SUCCESS)
Process: 4858 ExecStartPre=/usr/bin/mysqld_pre_systemd (code=exited, status=0/SUCCESS)
Main PID: 4936 (mysqld)
CGroup: /system.slice/mysqld.service
└─4936 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
Oct 06 16:28:41 hadoop102 systemd[1]: Starting MySQL Server...
Oct 06 16:28:44 hadoop102 systemd[1]: Started MySQL Server.
2.5.6 查看初始密码
[hadoop@hadoop102 mysql]$ sudo cat /var/log/mysqld.log | grep password
2021-10-06T08:28:41.617251Z 1 [Note] A temporary password is generated for root@localhost: iu0TV/6TpLn4
2.5.7 修改密码
[hadoop@hadoop102 mysql]$ mysql -uroot -piu0TV/6TpLn4
# 修改密码规则
mysql> set global validate_password_length=4;
Query OK, 0 rows affected (0.00 sec)
mysql> set global validate_password_policy=0;
Query OK, 0 rows affected (0.00 sec)
# 修改密码
mysql> alter user 'root'@'localhost' identified by '000000';
Query OK, 0 rows affected (0.00 sec)
2.5.8 开启远程登录
mysql> grant all privileges on *.* to 'root'@'%' identified by '000000';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> select user,host from user;
+-----------+-----------+
| user | host |
+-----------+-----------+
| root | % |
| mysql.sys | localhost |
| root | localhost |
+-----------+-----------+
3 rows in set (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
2.6 生成数据
将mock目录中业务目录下gmail.sql倒入到gmail数据库即可。
[hadoop@hadoop102 db_log]$ java -jar gmall2020-mock-db-2021-01-22.jar
2.7 安装sqoop
2.7.1 解压安装
[hadoop@hadoop102 software]$ tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module
2.7.2 重命名
[hadoop@hadoop102 module]$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop
2.7.3 配置
[hadoop@hadoop102 conf]$ mv sqoop-env-template.sh sqoop-env.sh
[hadoop@hadoop102 conf]$ vim sqoop-env.sh
# 添加配置
export HADOOP_COMMON_HOME=/opt/module/hadoop-3.1.3
export HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3
export HIVE_HOME=/opt/module/hive
export ZOOKEEPER_HOME=/opt/module/zookeeper-3.5.7
export ZOOCFGDIR=/opt/module/zookeeper-3.5.7/conf
2.7.4 添加驱动到lib目录
[hadoop@hadoop102 conf]$ mv /opt/software/mysql/mysql-connector-java-5.1.27-bin.jar /opt/module/sqoop/lib/
2.7.5 验证是否安装成功
[hadoop@hadoop102 sqoop]$ bin/sqoop help
Warning: /opt/module/sqoop/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/module/sqoop/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
2021-10-06 18:07:14,693 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
2.7.6 测试是否可以成功连接数据库
[hadoop@hadoop102 sqoop]$ bin/sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 000000
Warning: /opt/module/sqoop/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/module/sqoop/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
2021-10-06 18:08:05,762 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
2021-10-06 18:08:05,777 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2021-10-06 18:08:05,848 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
gmall
mysql
performance_schema
sys
2.7.7 测试sqoop是否可以正常倒入hdfs
bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/gmall \
--username root \
--password 000000 \
--table user_info \
--columns id,login_name \
--where "id>=10 and id<=30" \
--target-dir /test \
--delete-target-dir \
--fields-terminated-by '\t' \
--num-mappers 2 \
--split-by id
2.8 同步策略
数据同步策略的类型包括:全量同步、增量同步、新增及变化同步、特殊情况
- 全量表:存储完整的数据。
- 增量表:存储新增加的数据。
- 新增及变化表:存储新增加的数据和变化的数据。
- 特殊表:只需要存储一次。
某些特殊的表,可不必遵循上述同步策略。例如某些不会发生变化的表(地区表,省份表,民族表)可以只存一份固定值。
2.9 业务数据导入HDFS
2.9.1 分析表同步策略
在生产环境,个别小公司,为了简单处理,所有表全量导入。
中大型公司,由于数据量比较大,还是严格按照同步策略导入数据。
2.9.2 业务数据首日同步脚本
2.9.2.1 脚本编写
(1)在/home/atguigu/bin目录下创建
[hadoop@hadoop102 bin]$ vim mysql_to_hdfs_init.sh
添加如下内容:
#! /bin/bash
APP=gmall
sqoop=/opt/module/sqoop/bin/sqoop
if [ -n "$2" ] ;then
do_date=$2
else
echo "请传入日期参数"
exit
fi
import_data(){
$sqoop import \
--connect jdbc:mysql://hadoop102:3306/$APP \
--username root \
--password 000000 \
--target-dir /origin_data/$APP/db/$1/$do_date \
--delete-target-dir \
--query "$2 where \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by '\t' \
--compress \
--compression-codec lzop \
--null-string '\\N' \
--null-non-string '\\N'
hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /origin_data/$APP/db/$1/$do_date
}
import_order_info(){
import_data order_info "select
id,
total_amount,
order_status,
user_id,
payment_way,
delivery_address,
out_trade_no,
create_time,
operate_time,
expire_time,
tracking_no,
province_id,
activity_reduce_amount,
coupon_reduce_amount,
original_total_amount,
feight_fee,
feight_fee_reduce
from order_info"
}
import_coupon_use(){
import_data coupon_use "select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time,
expire_time
from coupon_use"
}
import_order_status_log(){
import_data order_status_log "select
id,
order_id,
order_status,
operate_time
from order_status_log"
}
import_user_info(){
import_data "user_info" "select
id,
login_name,
nick_name,
name,
phone_num,
email,
user_level,
birthday,
gender,
create_time,
operate_time
from user_info"
}
import_order_detail(){
import_data order_detail "select
id,
order_id,
sku_id,
sku_name,
order_price,
sku_num,
create_time,
source_type,
source_id,
split_total_amount,
split_activity_amount,
split_coupon_amount
from order_detail"
}
import_payment_info(){
import_data "payment_info" "select
id,
out_trade_no,
order_id,
user_id,
payment_type,
trade_no,
total_amount,
subject,
payment_status,
create_time,
callback_time
from payment_info"
}
import_comment_info(){
import_data comment_info "select
id,
user_id,
sku_id,
spu_id,
order_id,
appraise,
create_time
from comment_info"
}
import_order_refund_info(){
import_data order_refund_info "select
id,
user_id,
order_id,
sku_id,
refund_type,
refund_num,
refund_amount,
refund_reason_type,
refund_status,
create_time
from order_refund_info"
}
import_sku_info(){
import_data sku_info "select
id,
spu_id,
price,
sku_name,
sku_desc,
weight,
tm_id,
category3_id,
is_sale,
create_time
from sku_info"
}
import_base_category1(){
import_data "base_category1" "select
id,
name
from base_category1"
}
import_base_category2(){
import_data "base_category2" "select
id,
name,
category1_id
from base_category2"
}
import_base_category3(){
import_data "base_category3" "select
id,
name,
category2_id
from base_category3"
}
import_base_province(){
import_data base_province "select
id,
name,
region_id,
area_code,
iso_code,
iso_3166_2
from base_province"
}
import_base_region(){
import_data base_region "select
id,
region_name
from base_region"
}
import_base_trademark(){
import_data base_trademark "select
id,
tm_name
from base_trademark"
}
import_spu_info(){
import_data spu_info "select
id,
spu_name,
category3_id,
tm_id
from spu_info"
}
import_favor_info(){
import_data favor_info "select
id,
user_id,
sku_id,
spu_id,
is_cancel,
create_time,
cancel_time
from favor_info"
}
import_cart_info(){
import_data cart_info "select
id,
user_id,
sku_id,
cart_price,
sku_num,
sku_name,
create_time,
operate_time,
is_ordered,
order_time,
source_type,
source_id
from cart_info"
}
import_coupon_info(){
import_data coupon_info "select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from coupon_info"
}
import_activity_info(){
import_data activity_info "select
id,
activity_name,
activity_type,
start_time,
end_time,
create_time
from activity_info"
}
import_activity_rule(){
import_data activity_rule "select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from activity_rule"
}
import_base_dic(){
import_data base_dic "select
dic_code,
dic_name,
parent_code,
create_time,
operate_time
from base_dic"
}
import_order_detail_activity(){
import_data order_detail_activity "select
id,
order_id,
order_detail_id,
activity_id,
activity_rule_id,
sku_id,
create_time
from order_detail_activity"
}
import_order_detail_coupon(){
import_data order_detail_coupon "select
id,
order_id,
order_detail_id,
coupon_id,
coupon_use_id,
sku_id,
create_time
from order_detail_coupon"
}
import_refund_payment(){
import_data refund_payment "select
id,
out_trade_no,
order_id,
sku_id,
payment_type,
trade_no,
total_amount,
subject,
refund_status,
create_time,
callback_time
from refund_payment"
}
import_sku_attr_value(){
import_data sku_attr_value "select
id,
attr_id,
value_id,
sku_id,
attr_name,
value_name
from sku_attr_value"
}
import_sku_sale_attr_value(){
import_data sku_sale_attr_value "select
id,
sku_id,
spu_id,
sale_attr_value_id,
sale_attr_id,
sale_attr_name,
sale_attr_value_name
from sku_sale_attr_value"
}
case $1 in
"order_info")
import_order_info
;;
"base_category1")
import_base_category1
;;
"base_category2")
import_base_category2
;;
"base_category3")
import_base_category3
;;
"order_detail")
import_order_detail
;;
"sku_info")
import_sku_info
;;
"user_info")
import_user_info
;;
"payment_info")
import_payment_info
;;
"base_province")
import_base_province
;;
"base_region")
import_base_region
;;
"base_trademark")
import_base_trademark
;;
"activity_info")
import_activity_info
;;
"cart_info")
import_cart_info
;;
"comment_info")
import_comment_info
;;
"coupon_info")
import_coupon_info
;;
"coupon_use")
import_coupon_use
;;
"favor_info")
import_favor_info
;;
"order_refund_info")
import_order_refund_info
;;
"order_status_log")
import_order_status_log
;;
"spu_info")
import_spu_info
;;
"activity_rule")
import_activity_rule
;;
"base_dic")
import_base_dic
;;
"order_detail_activity")
import_order_detail_activity
;;
"order_detail_coupon")
import_order_detail_coupon
;;
"refund_payment")
import_refund_payment
;;
"sku_attr_value")
import_sku_attr_value
;;
"sku_sale_attr_value")
import_sku_sale_attr_value
;;
"all")
import_base_category1
import_base_category2
import_base_category3
import_order_info
import_order_detail
import_sku_info
import_user_info
import_payment_info
import_base_region
import_base_province
import_base_trademark
import_activity_info
import_cart_info
import_comment_info
import_coupon_use
import_coupon_info
import_favor_info
import_order_refund_info
import_order_status_log
import_spu_info
import_activity_rule
import_base_dic
import_order_detail_activity
import_order_detail_coupon
import_refund_payment
import_sku_attr_value
import_sku_sale_attr_value
;;
esac
说明1:
[ -n 变量值 ] 判断变量的值,是否为空
– 变量的值,非空,返回true
– 变量的值,为空,返回false
说明2:
查看date命令的使用,
[atguigu@hadoop102 ~]$ date --help
(2)增加脚本执行权限
[hadoop@hadoop102 bin]$ chmod +x mysql_to_hdfs_init.sh
2.9.2.2 脚本使用
[hadoop@hadoop102 bin]$ mysql_to_hdfs_init.sh all 2020-06-14
2.9.3 业务数据每日同步脚本
2.9.3.1 脚本编写
(1)在/home/atguigu/bin目录下创建
[hadoop@hadoop102 bin]$ vim mysql_to_hdfs.sh
添加如下内容:
#! /bin/bash
APP=gmall
sqoop=/opt/module/sqoop/bin/sqoop
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d '-1 day' +%F`
fi
import_data(){
$sqoop import \
--connect jdbc:mysql://hadoop102:3306/$APP \
--username root \
--password 000000 \
--target-dir /origin_data/$APP/db/$1/$do_date \
--delete-target-dir \
--query "$2 and \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by '\t' \
--compress \
--compression-codec lzop \
--null-string '\\N' \
--null-non-string '\\N'
hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /origin_data/$APP/db/$1/$do_date
}
import_order_info(){
import_data order_info "select
id,
total_amount,
order_status,
user_id,
payment_way,
delivery_address,
out_trade_no,
create_time,
operate_time,
expire_time,
tracking_no,
province_id,
activity_reduce_amount,
coupon_reduce_amount,
original_total_amount,
feight_fee,
feight_fee_reduce
from order_info
where (date_format(create_time,'%Y-%m-%d')='$do_date'
or date_format(operate_time,'%Y-%m-%d')='$do_date')"
}
import_coupon_use(){
import_data coupon_use "select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time,
expire_time
from coupon_use
where (date_format(get_time,'%Y-%m-%d')='$do_date'
or date_format(using_time,'%Y-%m-%d')='$do_date'
or date_format(used_time,'%Y-%m-%d')='$do_date'
or date_format(expire_time,'%Y-%m-%d')='$do_date')"
}
import_order_status_log(){
import_data order_status_log "select
id,
order_id,
order_status,
operate_time
from order_status_log
where date_format(operate_time,'%Y-%m-%d')='$do_date'"
}
import_user_info(){
import_data "user_info" "select
id,
login_name,
nick_name,
name,
phone_num,
email,
user_level,
birthday,
gender,
create_time,
operate_time
from user_info
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'
or DATE_FORMAT(operate_time,'%Y-%m-%d')='$do_date')"
}
import_order_detail(){
import_data order_detail "select
id,
order_id,
sku_id,
sku_name,
order_price,
sku_num,
create_time,
source_type,
source_id,
split_total_amount,
split_activity_amount,
split_coupon_amount
from order_detail
where DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'"
}
import_payment_info(){
import_data "payment_info" "select
id,
out_trade_no,
order_id,
user_id,
payment_type,
trade_no,
total_amount,
subject,
payment_status,
create_time,
callback_time
from payment_info
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'
or DATE_FORMAT(callback_time,'%Y-%m-%d')='$do_date')"
}
import_comment_info(){
import_data comment_info "select
id,
user_id,
sku_id,
spu_id,
order_id,
appraise,
create_time
from comment_info
where date_format(create_time,'%Y-%m-%d')='$do_date'"
}
import_order_refund_info(){
import_data order_refund_info "select
id,
user_id,
order_id,
sku_id,
refund_type,
refund_num,
refund_amount,
refund_reason_type,
refund_status,
create_time
from order_refund_info
where date_format(create_time,'%Y-%m-%d')='$do_date'"
}
import_sku_info(){
import_data sku_info "select
id,
spu_id,
price,
sku_name,
sku_desc,
weight,
tm_id,
category3_id,
is_sale,
create_time
from sku_info where 1=1"
}
import_base_category1(){
import_data "base_category1" "select
id,
name
from base_category1 where 1=1"
}
import_base_category2(){
import_data "base_category2" "select
id,
name,
category1_id
from base_category2 where 1=1"
}
import_base_category3(){
import_data "base_category3" "select
id,
name,
category2_id
from base_category3 where 1=1"
}
import_base_province(){
import_data base_province "select
id,
name,
region_id,
area_code,
iso_code,
iso_3166_2
from base_province
where 1=1"
}
import_base_region(){
import_data base_region "select
id,
region_name
from base_region
where 1=1"
}
import_base_trademark(){
import_data base_trademark "select
id,
tm_name
from base_trademark
where 1=1"
}
import_spu_info(){
import_data spu_info "select
id,
spu_name,
category3_id,
tm_id
from spu_info
where 1=1"
}
import_favor_info(){
import_data favor_info "select
id,
user_id,
sku_id,
spu_id,
is_cancel,
create_time,
cancel_time
from favor_info
where 1=1"
}
import_cart_info(){
import_data cart_info "select
id,
user_id,
sku_id,
cart_price,
sku_num,
sku_name,
create_time,
operate_time,
is_ordered,
order_time,
source_type,
source_id
from cart_info
where 1=1"
}
import_coupon_info(){
import_data coupon_info "select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from coupon_info
where 1=1"
}
import_activity_info(){
import_data activity_info "select
id,
activity_name,
activity_type,
start_time,
end_time,
create_time
from activity_info
where 1=1"
}
import_activity_rule(){
import_data activity_rule "select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from activity_rule
where 1=1"
}
import_base_dic(){
import_data base_dic "select
dic_code,
dic_name,
parent_code,
create_time,
operate_time
from base_dic
where 1=1"
}
import_order_detail_activity(){
import_data order_detail_activity "select
id,
order_id,
order_detail_id,
activity_id,
activity_rule_id,
sku_id,
create_time
from order_detail_activity
where date_format(create_time,'%Y-%m-%d')='$do_date'"
}
import_order_detail_coupon(){
import_data order_detail_coupon "select
id,
order_id,
order_detail_id,
coupon_id,
coupon_use_id,
sku_id,
create_time
from order_detail_coupon
where date_format(create_time,'%Y-%m-%d')='$do_date'"
}
import_refund_payment(){
import_data refund_payment "select
id,
out_trade_no,
order_id,
sku_id,
payment_type,
trade_no,
total_amount,
subject,
refund_status,
create_time,
callback_time
from refund_payment
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'
or DATE_FORMAT(callback_time,'%Y-%m-%d')='$do_date')"
}
import_sku_attr_value(){
import_data sku_attr_value "select
id,
attr_id,
value_id,
sku_id,
attr_name,
value_name
from sku_attr_value
where 1=1"
}
import_sku_sale_attr_value(){
import_data sku_sale_attr_value "select
id,
sku_id,
spu_id,
sale_attr_value_id,
sale_attr_id,
sale_attr_name,
sale_attr_value_name
from sku_sale_attr_value
where 1=1"
}
case $1 in
"order_info")
import_order_info
;;
"base_category1")
import_base_category1
;;
"base_category2")
import_base_category2
;;
"base_category3")
import_base_category3
;;
"order_detail")
import_order_detail
;;
"sku_info")
import_sku_info
;;
"user_info")
import_user_info
;;
"payment_info")
import_payment_info
;;
"base_province")
import_base_province
;;
"activity_info")
import_activity_info
;;
"cart_info")
import_cart_info
;;
"comment_info")
import_comment_info
;;
"coupon_info")
import_coupon_info
;;
"coupon_use")
import_coupon_use
;;
"favor_info")
import_favor_info
;;
"order_refund_info")
import_order_refund_info
;;
"order_status_log")
import_order_status_log
;;
"spu_info")
import_spu_info
;;
"activity_rule")
import_activity_rule
;;
"base_dic")
import_base_dic
;;
"order_detail_activity")
import_order_detail_activity
;;
"order_detail_coupon")
import_order_detail_coupon
;;
"refund_payment")
import_refund_payment
;;
"sku_attr_value")
import_sku_attr_value
;;
"sku_sale_attr_value")
import_sku_sale_attr_value
;;
"all")
import_base_category1
import_base_category2
import_base_category3
import_order_info
import_order_detail
import_sku_info
import_user_info
import_payment_info
import_base_trademark
import_activity_info
import_cart_info
import_comment_info
import_coupon_use
import_coupon_info
import_favor_info
import_order_refund_info
import_order_status_log
import_spu_info
import_activity_rule
import_base_dic
import_order_detail_activity
import_order_detail_coupon
import_refund_payment
import_sku_attr_value
import_sku_sale_attr_value
;;
esac
(2)增加脚本执行权限
[hadoop@hadoop102 bin]$ chmod +x mysql_to_hdfs.sh
2.9.3.2 脚本使用
[hadoop@hadoop102 bin]$ mysql_to_hdfs.sh all 2020-06-15
2.10 安装Hive
2.10.1 解压安装
[hadoop@hadoop102 hive]$ tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/module/
2.10.2 重命名
[hadoop@hadoop102 module]$ mv apache-hive-3.1.2-bin/ hive
2.10.3 配置
[hadoop@hadoop102 conf]$ vim hive-site.xml
添加内容
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>000000</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop102</value>
</property>
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
</configuration>
2.10.4 配置环境变量
# 配置环境变量
[hadoop@hadoop102 conf]$ sudo vim /etc/profile.d/my_env.sh
添加内容
# HIVE_HOME
export HIVE_HOME=/opt/module/hive
export PATH=$PATH:$HIVE_HOME/bin
# 使生效
[hadoop@hadoop102 conf]$ source /etc/profile.d/my_env.sh
2.10.5 拷贝驱动
[hadoop@hadoop102 mysql]$ cp /opt/software/mysql/mysql-connector-java-5.1.27-bin.jar /opt/module/hive/lib/
2.10.6 登陆mysql
[hadoop@hadoop102 mysql]$ mysql -uroot -p000000
2.10.7 创建元数据库
mysql> create database metastore;
Query OK, 1 row affected (0.00 sec)
mysql> quit
Bye
2.10.8 初始化元数据库
[hadoop@hadoop102 mysql]$ schematool -initSchema -dbType mysql -verbose
2.10.9 验证Hive是否安装成功
[hadoop@hadoop102 mysql]$ hive
which: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/hive/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 762f0e4f-4a2a-45d8-9a1b-51c2432ab65f
Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 89ed2aaf-1470-4b72-931c-f23063aa0ddf
hive (default)> show databases;
OK
database_name
default
Time taken: 0.614 seconds, Fetched: 1 row(s)
2.11 安装spark
2.11.1 区别
Hive引擎包括:默认MR、tez、spark
- Hive on Spark:Hive既作为存储元数据又负责SQL的解析优化,语法是HQL语法,执行引擎变成了Spark,Spark负责采用RDD执行。
- Spark on Hive : Hive只作为存储元数据,Spark负责SQL解析优化,语法是Spark SQL语法,Spark负责采用RDD执行。
2.11.2 安装
兼容性说明
注意:官网下载的Hive3.1.2和Spark3.0.0默认是不兼容的。因为Hive3.1.2支持的Spark版本是2.4.5,所以需要我们重新编译Hive3.1.2版本。
编译步骤:官网下载Hive3.1.2源码,修改pom文件中引用的Spark版本为3.0.0,如果编译通过,直接打包获取jar包。如果报错,就根据提示,修改相关方法,直到不报错,打包获取jar包。
# 解压安装
[hadoop@hadoop102 spark]$ tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv spark-3.0.0-bin-hadoop3.2/ spark
# 添加环境变量
[hadoop@hadoop102 spark]$ sudo vim /etc/profile.d/my_env.sh
# 内容
# SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH=$PATH:$SPARK_HOME/bin
# 使环境变量生效
[hadoop@hadoop102 spark]$ source /etc/profile.d/my_env.sh
2.11.3 配置
# 创建配置文件
[hadoop@hadoop102 module]$ vim /opt/module/hive/conf/spark-default.xml
# 添加内容
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop102:8020/spark-history
spark.executor.memory 1g
spark.driver.memory 1g
# 在hdfs创建spark- history路径
[hadoop@hadoop102 module]$ hadoop dfs -mkdir /spark-history
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.
[hadoop@hadoop102 module]$ hadoop dfs -ls
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.
ls: `.': No such file or directory
[hadoop@hadoop102 module]$ hadoop dfs -ls /
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.
Found 4 items
drwxr-xr-x - hadoop supergroup 0 2021-10-06 20:00 /origin_data
drwxr-xr-x - hadoop supergroup 0 2021-10-07 00:58 /spark-history
drwxr-xr-x - hadoop supergroup 0 2021-10-06 18:14 /test
drwxrwx--- - hadoop supergroup 0 2021-10-06 21:29 /tmp
2.11.4 向HDFS上传Spark纯净版jar包
-
说明1:由于Spark3.0.0非纯净版默认支持的是hive2.3.7版本,直接使用会和安装的Hive3.1.2出现兼容性问题。所以采用Spark纯净版jar包,不包含hadoop和hive相关依赖,避免冲突。
-
说明2:Hive任务最终由Spark来执行,Spark任务资源分配由Yarn来调度,该任务有可能被分配到集群的任何一个节点。所以需要将Spark的依赖上传到HDFS集群路径,这样集群中任何一个节点都能获取到。
# 解压纯净版压缩
[hadoop@hadoop102 spark]$ tar -zxvf spark-3.0.0-bin-without-hadoop.tgz
# 在hdfs创建/spark-jars目录
[hadoop@hadoop102 spark]$ hdfs dfs -mkdir /spark-jars
# 上传纯净版jar到hdfs
[hadoop@hadoop102 spark]$ hdfs dfs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars
2.11.5 修改hive-site.xml配置文件
# 编辑配置文件
[hadoop@hadoop102 spark]$ vim /opt/module/hive/conf/hive-site.xml
# 添加内容
<!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
<property>
<name>spark.yarn.jars</name>
<value>hdfs://hadoop102:8020/spark-jars/*</value>
</property>
<!--Hive执行引擎-->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
# 重启集群
[hadoop@hadoop102 ~]$ hdp.sh stop
=================== 关闭 hadoop集群 ===================
--------------- 关闭 historyserver ---------------
--------------- 关闭 yarn ---------------
Stopping nodemanagers
hadoop104: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
hadoop103: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
Stopping resourcemanager
--------------- 关闭 hdfs ---------------
Stopping namenodes on [hadoop102]
Stopping datanodes
Stopping secondary namenodes [hadoop104]
[hadoop@hadoop102 ~]$ hdp.sh start
=================== 启动 hadoop集群 ===================
--------------- 启动 hdfs ---------------
Starting namenodes on [hadoop102]
Starting datanodes
Starting secondary namenodes [hadoop104]
--------------- 启动 yarn ---------------
Starting resourcemanager
Starting nodemanagers
--------------- 启动 historyserver ---------------
2.11.6 增加Application Master资源比例
容量调度器对每个资源队列中同时运行的Application Master占用的资源进行了限制,该限制通过yarn.scheduler.capacity.maximum-am-resource-percent参数实现,其默认值是0.1,表示每个资源队列上Application Master最多可使用的资源为该队列总资源的10%,目的是防止大部分资源都被Application Master占用,而导致Map/Reduce Task无法执行。
生产环境该参数可使用默认值。但学习环境,集群资源总数很少,如果只分配10%的资源给Application Master,则可能出现,同一时刻只能运行一个Job的情况,因为一个Application Master使用的资源就可能已经达到10%的上限了。故此处可将该值适当调大。
# 编辑配置文件
[hadoop@hadoop102 spark]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/capacity-scheduler.xml
# 添加内容
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.8</value>
</property
S
# 分发
[hadoop@hadoop102 spark]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/capacity-scheduler.xml
==================== hadoop102 ====================
sending incremental file list
sent 77 bytes received 12 bytes 178.00 bytes/sec
total size is 8,260 speedup is 92.81
==================== hadoop103 ====================
sending incremental file list
capacity-scheduler.xml
sent 868 bytes received 107 bytes 1,950.00 bytes/sec
total size is 8,260 speedup is 8.47
==================== hadoop104 ====================
sending incremental file list
capacity-scheduler.xml
sent 868 bytes received 107 bytes 1,950.00 bytes/sec
total size is 8,260 speedup is 8.47
2.11.7 测试
[hadoop@hadoop102 spark]$ hive
which: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/hive/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/opt/module/kafka/bin:/opt/module/hive/bin:/opt/module/spark/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = a4481e73-37f7-42da-a8e8-3c2af926495b
Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive Session ID = e71805fb-3229-4654-88ee-10cec1ff16e8
hive (default)> create table student(id int, name string);
OK
Time taken: 0.951 seconds
hive (default)> insert into table student values(1,'abc');
Query ID = hadoop_20211007011117_f6444aac-6394-4733-b1a3-61c6a45590b5
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Running with YARN Application = application_1633523263782_0055
Kill Command = /opt/module/hadoop-3.1.3/bin/yarn application -kill application_1633523263782_0055
Hive on Spark Session Web UI URL: http://hadoop103:40693
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-0 ........ 0 FINISHED 1 1 0 0 0
Stage-1 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.14 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 6.14 second(s)
Loading data to table default.student
OK
col1 col2
Time taken: 22.071 seconds
2.12 安装Hbase
2.12.1 解压到指定目录
# 解压
[hadoop@hadoop102 hbase]$ tar -zxvf hbase-2.0.5-bin.tar.gz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv hbase-2.0.5/ hbase
2.12.2 配置
[hadoop@hadoop102 hbase-2.0.5]$ vim /opt/module/hbase-2.0.5/conf/hbase-site.xml
# 添加内容
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop102:8020/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop102,hadoop103,hadoop104</value>
</property>
# 修改配置,不自己管理zk
[hadoop@hadoop102 hbase-2.0.5]$ vim /opt/module/hbase-2.0.5/conf/hbase-env.sh
export HBASE_MANAGES_ZK=false
# 配置region服务
[hadoop@hadoop102 hbase-2.0.5]$ vim /opt/module/hbase-2.0.5/conf/regionservers
hadoop102
hadoop103
hadoop104
2.12.3 建立软连接
[hadoop@hadoop102 module]$ ln -s /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml /opt/module/hbase/conf/core-site.xml
[hadoop@hadoop102 module]$ ln -s /opt/module/hadoop-3.1.3/etc/hadoop/hdfs-site.xml /opt/module/hbase/conf/hdfs-site.xml
2.12.4 配置环境变量
[hadoop@hadoop102 hbase]$ vim /etc/profile.d/my_env.sh
# 添加内容
# HBASE_HOME
export HBASE_HOME=/opt/module/hbase
export PATH=$PATH:$HBASE_HOME/bin
# 是环境变量生效
[hadoop@hadoop102 hbase]$ source /etc/profile.d/my_env.sh
2.12.5 分发
# 分发hbase安装包
[hadoop@hadoop102 module]$ xsync hbase/
# 环境变量
[hadoop@hadoop102 hbase]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh
==================== hadoop102 ====================
root@hadoop102's password:
root@hadoop102's password:
sending incremental file list
sent 48 bytes received 12 bytes 24.00 bytes/sec
total size is 548 speedup is 9.13
==================== hadoop103 ====================
root@hadoop103's password:
root@hadoop103's password:
sending incremental file list
my_env.sh
sent 643 bytes received 41 bytes 273.60 bytes/sec
total size is 548 speedup is 0.80
==================== hadoop104 ====================
root@hadoop104's password:
root@hadoop104's password:
sending incremental file list
my_env.sh
sent 643 bytes received 41 bytes 456.00 bytes/sec
total size is 548 speedup is 0.80
2.12.6 启动服务
[hadoop@hadoop102 hbase]$ start-hbase.sh
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
running master, logging to /opt/module/hbase/logs/hbase-hadoop-master-hadoop102.out
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hadoop103: running regionserver, logging to /opt/module/hbase/logs/hbase-hadoop-regionserver-hadoop103.out
hadoop104: running regionserver, logging to /opt/module/hbase/logs/hbase-hadoop-regionserver-hadoop104.out
hadoop102: running regionserver, logging to /opt/module/hbase/logs/hbase-hadoop-regionserver-hadoop102.out
hadoop103: SLF4J: Class path contains multiple SLF4J bindings.
hadoop103: SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hadoop103: SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hadoop103: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
hadoop103: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hadoop104: SLF4J: Class path contains multiple SLF4J bindings.
hadoop104: SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hadoop104: SLF4J: Found binding in [jar:file:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hadoop104: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
hadoop104: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[hadoop@hadoop102 hbase]$ xcall.sh jps
============hadoop102================
22785 NameNode
23218 NodeManager
22582 QuorumPeerMain
27078 HMaster
27590 Jps
27243 HRegionServer
22941 DataNode
23389 JobHistoryServer
23791 Kafka
============hadoop103================
30499 QuorumPeerMain
30979 NodeManager
30597 DataNode
31589 Kafka
30776 ResourceManager
32060 HRegionServer
32254 Jps
============hadoop104================
27393 Jps
25895 SecondaryNameNode
25784 DataNode
25980 NodeManager
26444 Kafka
26476 Application
25693 QuorumPeerMain
[hadoop@hadoop102 hbase]$
2.12.7 访问web服务
http://hadoop102:16010/master-status
2.12.8 启动服务可能遇到的问题
[hadoop@hadoop102 hbase]$ bin/hbase-daemon.sh start master
[hadoop@hadoop102 hbase]$ bin/hbase-daemon.sh start regionserver
提示:如果集群之间的节点时间不同步,会导致regionserver无法启动,抛出ClockOutOfSyncException异常。
修复提示:
a、同步时间服务
请参看帮助文档:《尚硅谷大数据技术之Hadoop入门》
b、属性:hbase.master.maxclockskew设置更大的值
<property>
<name>hbase.master.maxclockskew</name>
<value>180000</value>
<description>Time difference of regionserver from master</description>
</property>
2.13 安装Solr
2.13.1 创建系统用户
- Hadoop102
[hadoop@hadoop102 hbase]$ sudo useradd solr
[hadoop@hadoop102 hbase]$ sudo echo solr | passwd --stdin solr
Only root can do that.
[hadoop@hadoop102 hbase]$ sudo echo solr | sudo passwd --stdin solr
Changing password for user solr.
passwd: all authentication tokens updated successfully.
- Hadoop103
[hadoop@hadoop103 module]$ sudo useradd solr
[hadoop@hadoop103 module]$ sudo echo solr | sudo passwd --stdin solr
Changing password for user solr.
passwd: all authentication tokens updated successfully.
- hadoop104
[hadoop@hadoop104 conf]$ sudo echo solr | sudo passwd --stdin solr
Changing password for user solr.
passwd: all authentication tokens updated successfully.
2.13.2 解压并重命名
# 解压
hadoop@hadoop102 solr]$ tar -zxvf solr-7.7.3.tgz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv solr-7.7.3/ solr
2.13.3 分发
[hadoop@hadoop102 module]$ xsync solr/
2.13.4 授权给solr用户
[hadoop@hadoop102 module]$ sudo chown -R solr:solr solr/
[hadoop@hadoop103 module]$ sudo chown -R solr:solr solr/
[hadoop@hadoop104 module]$ sudo chown -R solr:solr solr/
2.13.5 修改配置文件
[hadoop@hadoop102 module]$ sudo vim solr/bin/solr.in.sh
# 添加内容
ZK_HOST="hadoop102:2181,hadoop103:2181,hadoop104:2181"
2.13.6 启动服务
[hadoop@hadoop102 module]$ sudo -i -u solr /opt/module/solr/bin/solr start
*** [WARN] *** Your open file limit is currently 1024.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
*** [WARN] *** Your Max Processes Limit is currently 4096.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
Waiting up to 180 seconds to see Solr running on port 8983 [\]
Started Solr server on port 8983 (pid=28181). Happy searching!
[hadoop@hadoop103 module]$ sudo -i -u solr /opt/module/solr/bin/solr start
*** [WARN] *** Your open file limit is currently 1024.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
*** [WARN] *** Your Max Processes Limit is currently 4096.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
NOTE: Please install lsof as this script needs it to determine if Solr is listening on port 8983.
Started Solr server on port 8983 (pid=32545). Happy searching!
[hadoop@hadoop104 module]$ sudo -i -u solr /opt/module/solr/bin/solr start
*** [WARN] *** Your open file limit is currently 1024.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
*** [WARN] *** Your Max Processes Limit is currently 4096.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
NOTE: Please install lsof as this script needs it to determine if Solr is listening on port 8983.
Started Solr server on port 8983 (pid=27686). Happy searching!
说明:上述警告内容是:solr推荐系统允许的最大进程数和最大打开文件数分别为65000和65000,而系统默认值低于推荐值。如需修改可参考以下步骤,修改完需要重启方可生效,此处可暂不修改。
(1)修改打开文件数限制
修改/etc/security/limits.conf文件,增加以下内容
* soft nofile 65000
* hard nofile 65000
(2)修改进程数限制
修改/etc/security/limits.d/20-nproc.conf文件
* soft nproc 65000
2.13.7 查看web服务
http://hadoop103:8983/solr/#/
2.14 安装atlas
2.14.1 解压安装重命名
[hadoop@hadoop102 atlas]$ tar -zxvf apache-atlas-2.1.0-bin.tar.gz -C /opt/module/
[hadoop@hadoop102 module]$ mv apache-atlas-2.1.0/ atlas
2.14.2 配置
# 配置hbase
[hadoop@hadoop102 atlas]$ vim conf/atlas-application.properties
# 添加内容
atlas.graph.storage.hostname=hadoop102:2181,hadoop103:2181,hadoop104:2181
# 修改环境变量
[hadoop@hadoop102 atlas]$ vim conf/atlas-env.sh
# 添加内容
export HBASE_CONF_DIR=/opt/module/hbase/conf
#
2.14.3 配置solr
[hadoop@hadoop102 atlas]$ vim conf/atlas-application.properties
# 添加内容
atlas.graph.index.search.backend=solr
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=hadoop102:2181,hadoop103:2181,hadoop104:2181
# 创建collection
[hadoop@hadoop102 atlas]$ sudo -i -u solr /opt/module/solr/bin/solr create -c vertex_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2
Created collection 'vertex_index' with 3 shard(s), 2 replica(s) with config-set 'vertex_index'
[hadoop@hadoop102 atlas]$ sudo -i -u solr /opt/module/solr/bin/solr create -c edge_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2
Created collection 'edge_index' with 3 shard(s), 2 replica(s) with config-set 'edge_index'
[hadoop@hadoop102 atlas]$ sudo -i -u solr /opt/module/solr/bin/solr create -c fulltext_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2
Created collection 'fulltext_index' with 3 shard(s), 2 replica(s) with config-set 'fulltext_index'
2.14.4 配置kafka
[hadoop@hadoop102 atlas]$ vim conf/atlas-application.properties
# 添加内容
atlas.notification.embedded=false
atlas.kafka.data=/opt/module/kafka/data
atlas.kafka.zookeeper.connect= hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka
atlas.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
2.14.5 配置atlas
[hadoop@hadoop102 atlas]$ vim conf/atlas-application.properties
# 添加内容
######### Server Properties #########
atlas.rest.address=http://hadoop102:21000
# If enabled and set to true, this will run setup steps when the server starts
atlas.server.run.setup.on.start=false
######### Entity Audit Configs #########
atlas.audit.hbase.zookeeper.quorum=hadoop102:2181,hadoop103:2181,hadoop104:2181
# 配置日志
[hadoop@hadoop101 conf]# vim atlas-log4j.xml
#去掉如下代码的注释
<appender name="perf_appender" class="org.apache.log4j.DailyRollingFileAppender">
<param name="file" value="${atlas.log.dir}/atlas_perf.log" />
<param name="datePattern" value="'.'yyyy-MM-dd" />
<param name="append" value="true" />
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d|%t|%m%n" />
</layout>
</appender>
<logger name="org.apache.atlas.perf" additivity="false">
<level value="debug" />
<appender-ref ref="perf_appender" />
</logger>
2.14.6 Kerberos相关配置
若Hadoop集群开启了Kerberos认证,Atlas与Hadoop集群交互之前就需要先进行Kerberos认证。若Hadoop集群未开启Kerberos认证,则本节可跳过。
1.为Atlas创建Kerberos主体,并生成keytab文件
[hadoop@hadoop102 ~]# kadmin -padmin/admin -wadmin -q"addprinc -randkey atlas/hadoop102"
[hadoop@hadoop102 ~]# kadmin -padmin/admin -wadmin -q"xst -k /etc/security/keytab/atlas.service.keytab atlas/hadoop102"
2.修改/opt/module/atlas/conf/atlas-application.properties配置文件,增加以下参数
atlas.authentication.method=kerberos
atlas.authentication.principal=atlas/hadoop102@EXAMPLE.COM
atlas.authentication.keytab=/etc/security/keytab/atlas.service.keytab
2.15 Atlas集成Hive
1.安装Hive Hook
1)解压Hive Hook
[hadoop@hadoop102 ~]# tar -zxvf apache-atlas-2.1.0-hive-hook.tar.gz
2)将Hive Hook依赖复制到Atlas安装路径
[hadoop@hadoop102 ~]# cp -r apache-atlas-hive-hook-2.1.0/* /opt/module/atlas/
3)修改/opt/module/hive/conf/hive-env.sh配置文件
注:需先需改文件名
[hadoop@hadoop102 ~]# mv hive-env.sh.template hive-env.sh
增加如下参数
export HIVE_AUX_JARS_PATH=/opt/module/atlas/hook/hive
2.修改Hive配置文件,在/opt/module/hive/conf/hive-site.xml文件中增加以下参数,配置Hive Hook。
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
3.修改/opt/module/atlas/conf/atlas-application.properties配置文件中的以下参数
######### Hive Hook Configs #######
atlas.hook.hive.synchronous=false
atlas.hook.hive.numRetries=3
atlas.hook.hive.queueSize=10000
atlas.cluster.name=primary
4)将Atlas配置文件/opt/module/atlas/conf/atlas-application.properties
拷贝到/opt/module/hive/conf目录
[hadoop@hadoop102 ~]# cp /opt/module/atlas/conf/atlas-application.properties /opt/module/hive/conf/
- 启动服务
[hadoop@hadoop102 atlas]$ bin/atlas_start.py
2.16 扩展内容
2.16.1 Atlas源码编译
安装Maven
1)Maven下载:https://maven.apache.org/download.cgi
2)把apache-maven-3.6.1-bin.tar.gz上传到linux的/opt/software目录下
3)解压apache-maven-3.6.1-bin.tar.gz到/opt/module/目录下面
[root@hadoop102 software]# tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/module/
4)修改apache-maven-3.6.1的名称为maven
[root@hadoop102 module]# mv apache-maven-3.6.1/ maven
5)添加环境变量到/etc/profile中
#MAVEN_HOME
export MAVEN_HOME=/opt/module/maven
export PATH=$PATH:$MAVEN_HOME/bin
6)测试安装结果
[root@hadoop102 module]# source /etc/profile
[root@hadoop102 module]# mvn -v
7)修改setting.xml,指定为阿里云
[root@hadoop101 module]# cd /opt/module/maven/conf/
[root@hadoop102 maven]# vim settings.xml
<!-- 添加阿里云镜像-->
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>central</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
<mirror>
<id>UK</id>
<name>UK Central</name>
<url>http://uk.maven.org/maven2</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>repo1</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://repo1.maven.org/maven2/</url>
</mirror>
<mirror>
<id>repo2</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://repo2.maven.org/maven2/</url>
</mirror>
编译Atlas源码
1)把apache-atlas-2.1.0-sources.tar.gz上传到hadoop102的/opt/software目录下
2)解压apache-atlas-2.1.0-sources.tar.gz到/opt/module/目录下面
[root@hadoop101 software]# tar -zxvf apache-atlas-2.1.0-sources.tar.gz -C /opt/module/
3)下载Atlas依赖
[root@hadoop101 software]# export MAVEN_OPTS="-Xms2g -Xmx2g"
[root@hadoop101 software]# cd /opt/module/apache-atlas-sources-2.1.0/
[root@hadoop101 apache-atlas-sources-2.1.0]# mvn clean -DskipTests install
[root@hadoop101 apache-atlas-sources-2.1.0]# mvn clean -DskipTests package -Pdis
#一定要在${atlas_home}执行
[root@hadoop101 apache-atlas-sources-2.1.0]# cd distro/target/
[root@hadoop101 target]# mv apache-atlas-2.1.0-server.tar.gz /opt/software/
[root@hadoop101 target]# mv apache-atlas-2.1.0-hive-hook.tar.gz /opt/software/
提示:执行过程比较长,会下载很多依赖,大约需要半个小时,期间如果报错很有可能是因为TimeOut造成的网络中断,重试即可。
Atlas内存配置
如果计划存储数万个元数据对象,建议调整参数值获得最佳的JVM GC性能。以下是常见的服务器端选项
1)修改配置文件/opt/module/atlas/conf/atlas-env.sh
#设置Atlas内存
#设置Atlas内存
export ATLAS_SERVER_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps"
#建议JDK1.7使用以下配置
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=3072m -XX:PermSize=100M -XX:MaxPermSize=512m"
#建议JDK1.8使用以下配置
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m"
#如果是Mac OS用户需要配置
export ATLAS_SERVER_OPTS="-Djava.awt.headless=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
参数说明: -XX:SoftRefLRUPolicyMSPerMB 此参数对管理具有许多并发用户的查询繁重工作负载的GC性能特别有用。
配置用户名密码
Atlas支持以下身份验证方法:File、Kerberos协议、LDAP协议
通过修改配置文件atlas-application.properties文件开启或关闭三种验证方法
atlas.authentication.method.kerberos=true|false
atlas.authentication.method.ldap=true|false
atlas.authentication.method.file=true|false
如果两个或多个身份证验证方法设置为true,如果较早的方法失败,则身份验证将回退到后一种方法。例如,如果Kerberos身份验证设置为true并且ldap身份验证也设置为true,那么,如果对于没有kerberos principal和keytab的请求,LDAP身份验证将作为后备方案。
本文主要讲解采用文件方式修改用户名和密码设置。其他方式可以参见官网配置即可。
1)打开/opt/module/atlas/conf/users-credentials.properties文件
[atguigu@hadoop102 conf]$ vim users-credentials.properties
#username=group::sha256-password
admin=ADMIN::8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918
rangertagsync=RANGER_TAG_SYNC::e3f67240f5117d1753c940dae9eea772d36ed5fe9bd9c94a300e40413f1afb9d
(1)admin是用户名称
(2)8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918是采用sha256加密的密码,默认密码为admin。
2)例如:修改用户名称为atguigu,密码为atguigu
(1)获取sha256加密的atguigu密码
[atguigu@hadoop102 conf]$ echo -n "atguigu"|sha256sum
2628be627712c3555d65e0e5f9101dbdd403626e6646b72fdf728a20c5261dc2
(2)修改用户名和密码
[atguigu@hadoop102 conf]$ vim users-credentials.properties
#username=group::sha256-password
atguigu=ADMIN::2628be627712c3555d65e0e5f9101dbdd403626e6646b72fdf728a20c5261dc2
rangertagsync=RANGER_TAG_SYNC::e3f67240f5117d1753c940dae9eea772d36ed5fe9bd9c94a300e40413f1afb9d
2.14 Kylin安装
2.14.1 解压安装
[hadoop@hadoop102 kylin]$ tar -zxvf apache-kylin-3.0.2-bin.tar.gz -C /opt/module/
# 重命名
[hadoop@hadoop102 module]$ mv apache-kylin-3.0.2-bin/ kylin
2.14.2 修改兼容性配置
# 43行添加内容
[hadoop@hadoop102 kylin]$ vim /opt/module/kylin/bin/find-spark-dependency.sh
# 添加内容
! -name '*jackson*' ! -name '*metastore*'
# 修改后的内容
spark_dependency=`find -L $spark_home/jars -name '*.jar' ! -name '*slf4j*' ! -name '*jackson*' ! -name '*metastore*' ! -name '*calcite*' ! -name '*doc*' ! -name '*test*' ! -name '*sources*' ''-printf '%p:' | sed 's/:$//'`
if [ -z "$spark_dependency" ]
2.14.3 启动
需要启动hadoop、hbase、zk
[hadoop@hadoop102 bin]$ ./kylin.sh start
2.14.4 查看web管理页面
http://hadoop102:7070/kylin