大数据开发相关组件部署及数据抽取

大抵是不用进厂了罢。



目录

一、大数据集群组件部署

1. 基础环境配置

2. hadoop集群部署

2.1 jdk的部署和安装

2.2 Hadoop集群配置

3.hive组件部署

3.1 MYSql数据库部署

3.2 Hive部署

4. spark组件部署

5. Flink组件部署

6. Flume组件部署

7. Zookeeper组件部署

8. Kafka组件部署

9. MaxWell组件部署

10. Redis组件部署

11. Hbase组件部署

12. Clickhouse组件部署

二、数据抽取

2.1 实时数据抽取

2.1.1 Maxwell数据抽取

2.1.2 端口日志数据抽取

2. 离线数据抽取

实时数据分析

4.1 Linux端环境准备

4.2 Windows端环境准备

4.3 实时数据分析代码开发

4.4 统计SKU商品的总额,将值存入Redis

🥇Summary

获取源码?私信?关注?点赞?收藏?WeChat?


一、大数据集群组件部署

1. 基础环境配置

主机名IP地址账号密码
master192.168.1.100rootpassword
slave1192.168.1.101rootpassword
slave2192.168.1.102rootpassword

修改三个节点的主机名:

[root@master ~]# hostnamectl set-hostname master

slave1:

[root@localhost ~]# hostnamectl set-hostname slave1
[root@localhost ~]# bash
[root@slave1 ~]# 

slave2:

[root@localhost ~]# hostnamectl set-hostname slave2
[root@localhost ~]# bash
[root@slave2 ~]# 

设置固定ip:

master:

命令:

[root@master ~]# vi /etc/sysconfig/network-scripts/ifcfg-ens33

文件内容:

TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens35
DEVICE=ens35
ONBOOT=yes
IPADDR=192.168.1.100
NETMASK=255.255.255.0
GATEWAY=192.168.1.254

DNS1=114.114.114.114
DNS2=8.8.8.8

重启网卡:

[root@master ~]# systemctl restart network

查看ip:

[root@master ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens35: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:69:e9:e8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.100/24 brd 192.168.1.255 scope global dynamic ens35
       valid_lft 1111sec preferred_lft 1111sec
    inet6 fe80::2859:3941:9736:7030/64 scope link 
       valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN qlen 1000
    link/ether 52:54:00:00:28:5e brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever
4: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN qlen 1000
    link/ether 52:54:00:00:28:5e brd ff:ff:ff:ff:ff:ff
[root@master ~]# 

slave1:

命令:

[root@master ~]# vi /etc/sysconfig/network-scripts/ifcfg-ens33

文件内容:

TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens35
DEVICE=ens35
ONBOOT=yes
IPADDR=192.168.1.101
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS1=114.114.114.114
DNS2=8.8.8.8

重启网卡:

[root@slave1 ~]# systemctl restart network

查看ip:

[root@slave1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens35: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:75:1f:92 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global dynamic ens35
       valid_lft 1507sec preferred_lft 1507sec
    inet6 fe80::f66b:758e:9c24:88b9/64 scope link 
       valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN qlen 1000
    link/ether 52:54:00:4b:05:cc brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever
4: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN qlen 1000
    link/ether 52:54:00:4b:05:cc brd ff:ff:ff:ff:ff:ff
[root@slave1 ~]# 

slave2:

命令:

[root@master ~]# vi /etc/sysconfig/network-scripts/ifcfg-ens160 

文件内容:

TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens35
DEVICE=ens35
ONBOOT=yes
IPADDR=192.168.1.102
NETMASK=255.255.255.0
GATEWAY=192.168.1.254

DNS1=114.114.114.114
DNS2=8.8.8.8

重启网卡:

root@slave2 ~]# systemctl restart network

查看ip:

[root@slave2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens35: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:4b:92:6c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.102/24 brd 192.168.1.255 scope global dynamic ens35
       valid_lft 416sec preferred_lft 416sec
    inet6 fe80::2162:1337:a894:11ce/64 scope link 
       valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN qlen 1000
    link/ether 52:54:00:4b:05:cc brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever
4: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN qlen 1000
    link/ether 52:54:00:4b:05:cc brd ff:ff:ff:ff:ff:ff
[root@slave2 ~]# 

在master节点的/etc/hosts内部做ip和主机名的映射:

命令:

[root@master ~]# vi /etc/hosts

内容:

192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2

将master的/etc/hosts配置文件拷贝到slave1和slave2的/etc目录下

[root@master ~]# scp /etc/hosts
hosts        hosts.allow  hosts.deny   
[root@master ~]# scp /etc/hosts slave1:/etc/
The authenticity of host 'slave1 (192.168.1.101)' can't be established.
ECDSA key fingerprint is SHA256:4lVQUkkZo5DlZodLDGbEP3NZKrLvXNW/qeIGRch1eNI.
ECDSA key fingerprint is MD5:a4:e8:a1:a8:45:d1:69:f5:94:d3:9c:99:1f:7d:c2:d5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave1,192.168.1.101' (ECDSA) to the list of known hosts.
root@slave1's password: 
hosts                                             100%  221    64.7KB/s   00:00    
[root@master ~]# scp /etc/hosts slave2:/etc/
The authenticity of host 'slave2 (192.168.1.102)' can't be established.
ECDSA key fingerprint is SHA256:4lVQUkkZo5DlZodLDGbEP3NZKrLvXNW/qeIGRch1eNI.
ECDSA key fingerprint is MD5:a4:e8:a1:a8:45:d1:69:f5:94:d3:9c:99:1f:7d:c2:d5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave2,192.168.1.102' (ECDSA) to the list of known hosts.
root@slave2's password: 
hosts                                             100%  221   257.8KB/s   00:00    
[root@master ~]#

scp -r  
复制目录
scr 
复制文件

在三个节点做免密登录:

master:

[root@master ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:s3RzKkZ7vcTmll5uQW0ldRuil5NGgC34K/qe33+OgkI root@master
The key's randomart image is:
+---[RSA 2048]----+
|        . o..o oo|
|       . o .o = =|
|        . .. * +.|
|         .  o o o|
|        S + .. . |
|       E * *  .  |
|      o * + =... |
|     . o.+.=o+o. |
|      o+o. +*++. |
+----[SHA256]-----+
[root@master ~]# ssh-copy-id master
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'master (192.168.1.100)' can't be established.
ECDSA key fingerprint is SHA256:2lG2/rO51PcX3P7MHU9/WjpTuuAVJS6yYZiLDA+gUZ4.
ECDSA key fingerprint is MD5:c9:06:dc:69:31:20:01:4f:bb:26:db:2e:e0:da:92:94.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@master's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'master'"
and check to make sure that only the key(s) you wanted were added.

[root@master ~]# ssh-copy-id slave1
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@slave1's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'slave1'"
and check to make sure that only the key(s) you wanted were added.

[root@master ~]# ssh-copy-id slave2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@slave2's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'slave2'"
and check to make sure that only the key(s) you wanted were added.

[root@master ~]# 

slave1:

[root@slave1 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:wXPBu3R+MWJQv7fnyGhgvCrsNyiHbzzzJzQB/kkMHn4 root@slave1
The key's randomart image is:
+---[RSA 2048]----+
|         ....    |
|      +.  o. .   |
|     + =+ .o  .  |
|      + E+o + o. |
|       +S= = ..o.|
|        = = . ...|
|     + o o o . ..|
|    o X + o .o o.|
|     *o*o= .. o .|
+----[SHA256]-----+
[root@slave1 ~]# ssh-copy-id master
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'master (192.168.1.100)' can't be established.
ECDSA key fingerprint is SHA256:2lG2/rO51PcX3P7MHU9/WjpTuuAVJS6yYZiLDA+gUZ4.
ECDSA key fingerprint is MD5:c9:06:dc:69:31:20:01:4f:bb:26:db:2e:e0:da:92:94.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@master's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'master'"
and check to make sure that only the key(s) you wanted were added.

[root@slave1 ~]# ssh-copy-id slave1
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'slave1 (192.168.1.101)' can't be established.
ECDSA key fingerprint is SHA256:4lVQUkkZo5DlZodLDGbEP3NZKrLvXNW/qeIGRch1eNI.
ECDSA key fingerprint is MD5:a4:e8:a1:a8:45:d1:69:f5:94:d3:9c:99:1f:7d:c2:d5.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@slave1's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'slave1'"
and check to make sure that only the key(s) you wanted were added.

[root@slave1 ~]# ssh-copy-id slave2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'slave2 (192.168.1.102)' can't be established.
ECDSA key fingerprint is SHA256:4lVQUkkZo5DlZodLDGbEP3NZKrLvXNW/qeIGRch1eNI.
ECDSA key fingerprint is MD5:a4:e8:a1:a8:45:d1:69:f5:94:d3:9c:99:1f:7d:c2:d5.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@slave2's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'slave2'"
and check to make sure that only the key(s) you wanted were added.

[root@slave1 ~]# 

slave2:

[root@slave2 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:RCXC1qIT61bJFnA1ycNG5ET4zaQEv6duUL7y7UvLq+4 root@slave2
The key's randomart image is:
+---[RSA 2048]----+
|    .o+@Xo.      |
|    ..**Oo.      |
|     * B+*       |
|    + =.+.o      |
|   . + oS .      |
|    o . .o       |
|   .   ....      |
|      ..o+ .     |
|       *EoBo     |
+----[SHA256]-----+
[root@slave2 ~]# ssh-copy-id master
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'master (192.168.1.100)' can't be established.
ECDSA key fingerprint is SHA256:2lG2/rO51PcX3P7MHU9/WjpTuuAVJS6yYZiLDA+gUZ4.
ECDSA key fingerprint is MD5:c9:06:dc:69:31:20:01:4f:bb:26:db:2e:e0:da:92:94.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@master's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'master'"
and check to make sure that only the key(s) you wanted were added.

[root@slave2 ~]# ssh-copy-id slave1
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'slave1 (192.168.1.101)' can't be established.
ECDSA key fingerprint is SHA256:4lVQUkkZo5DlZodLDGbEP3NZKrLvXNW/qeIGRch1eNI.
ECDSA key fingerprint is MD5:a4:e8:a1:a8:45:d1:69:f5:94:d3:9c:99:1f:7d:c2:d5.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@slave1's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'slave1'"
and check to make sure that only the key(s) you wanted were added.

[root@slave2 ~]# ssh-copy-id slave2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'slave2 (192.168.1.102)' can't be established.
ECDSA key fingerprint is SHA256:4lVQUkkZo5DlZodLDGbEP3NZKrLvXNW/qeIGRch1eNI.
ECDSA key fingerprint is MD5:a4:e8:a1:a8:45:d1:69:f5:94:d3:9c:99:1f:7d:c2:d5.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@slave2's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'slave2'"
and check to make sure that only the key(s) you wanted were added.

[root@slave2 ~]# 

master:

关闭防火墙:

[root@master ~]# systemctl stop firewalld

禁止防火墙开机启动:

[root@master ~]# systemctl disable firewalld

slave1:

关闭防火墙:

[root@master ~]# systemctl stop firewalld

禁止防火墙开机启动:

[root@master ~]# systemctl disable firewalld

slave2:

关闭防火墙:

[root@master ~]# systemctl stop firewalld

禁止防火墙开机启动:

[root@master ~]# systemctl disable firewalld

2. hadoop集群部署

2.1 jdk的部署和安装

创建要解压的路径:

[root@master softwares]# mkdir /opt/module
[root@master softwares]# ls /opt/
module  softwares
[root@master softwares]# 

解压jdk:

[root@master softwares]# tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

重命名jdk文件件

[root@master softwares]# ls /opt/module/
jdk1.8.0_212
[root@master softwares]# cd /opt/module/
[root@master module]# ls
jdk1.8.0_212
[root@master module]# mv jdk1.8.0_212/ jdk
[root@master module]# ls
jdk
[root@master module]# 

修改环境变量配置:

[root@master module]# vi /etc/profile

修改内容:

# JAVA_HOME 
export JAVA_HOME=/opt/module/jdk
export PATH="$JAVA_HOME/bin:$PATH"

source使配置文件生效:

[root@master module]# source /etc/profile
[root@master module]# 

执行java -version 查看Java版本,能看到版本说明配置生效:

[root@master module]# java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
[root@master module]# 

在slave1和slave2节点创建module路径:

[root@slave1 ~]# mkdir /opt/module/
[root@slave1 ~]# 
[root@slave2 ~]# mkdir /opt/module

拷贝jdk和jdk的环境变量到slave1和slave2节点:

[root@master module]# scp -r /opt/module/jdk/ slave1:/opt/module/jdk
[root@master module]# scp -r /opt/module/jdk/ slave2:/opt/module/jdk

在slave1和slave2节点配置环境变量:

[root@slave1 jdk]# vi /etc/profile
[root@slave1 jdk]# source /etc/profile
[root@slave1 jdk]# java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
[root@slave1 jdk]# 
 [root@slave2 ~]# vi /etc/profile
[root@slave2 ~]# source /etc/profile
[root@slave2 ~]# java -version\
> ;
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
[root@slave2 ~]# 

2.2 Hadoop集群配置

  1. 解压hadoop到/opt/module/路径下:

    [root@master softwares]# tar -zxvf /opt/softwares/hadoop-3.1.3.tar.gz -C /opt/module/
  2. 修改文件名称,配置hadoop环境变量

    [root@master module]# mv hadoop-3.1.3/ hadoop
    [root@master module]# ls
    hadoop  jdk
    [root@master module]# vi /etc/profile

    配置内容:

    # HADOOP_HOME
    export HADOOP_HOME=/opt/module/hadoop
    export PATH="$HADOOP_HOME/bin:$PATH"

    source并查看hadoop版本:

    [root@master module]# hadoop version
    Hadoop 3.1.3
    Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579
    Compiled by ztang on 2019-09-12T02:47Z
    Compiled with protoc 2.5.0
    From source with checksum ec785077c385118ac91aadde5ec9799
    This command was run using /opt/module/hadoop/share/hadoop/common/hadoop-common-3.1.3.jar
    [root@master module]# 
  3. 配置hadoop相关配置文件:

    核心配置文件

    配置core-site.xml

    [root@master hadoop]# vi /opt/module/hadoop/etc/hadoop/core-site.xml 

    文件内容:

    <configuration>
    <!--指定NameNode的地址 -->
            <property>
                         <name>fs.defaultFS</name>
                         <value>hdfs://master:9000</value>
            </property>
    <!-- 指定hadoop数据的存储目录-->
            <property>
                        <name>hadoop.tmp.dir</name>
                        <value>/opt/module/hadoop/data</value>
            </property>
    </configuration>

HDFS配置文件:

配置hdfs-site.xml

[root@master hadoop]# vim /opt/module/hadoop/etc/hadoop/hdfs-site.xml 

文件内容:

<configuration>
        <!--namenode web端访问地址 -->
        <property>
                <name>dfs.namenode.http-address</name>
                <value>master:50070</value>
        </property>
        <!-- hdfs副本数量-->
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
</configuration>

YARN配置文件:

配置yarn-site.xml

[root@master hadoop]# vim /opt/module/hadoop/etc/hadoop/yarn-site.xml 

配置内容:

<configuration>

<!-- Site specific YARN configuration properties -->
<!-- 指定MR走shuffle-->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
<!-- 指定ResourceManager地址-->‘
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>master</value>
        </property>
</configuration>

MAPREDUCE配置文件:

配置mapred-site.xml

[root@master hadoop]# vim /opt/module/hadoop/etc/hadoop/mapred-site.xml 

配置内容:

<configuration>
<!--指定MapReduce程序运行环境 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
<!--指定历史端地址 -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
    </property>
<!--指定历史web端地址 -->
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
    </property>
</configuration>
配置workers
[root@master hadoop]# vim /opt/module/hadoop/etc/hadoop/workers

配置内容:

master
slave1
slave2

配置hadoop-env.sh

[root@master hadoop]# vi /opt/module/hadoop/etc/hadoop/hadoop-env.sh

配置内容:

export JAVA_HOME=/opt/module/jdk

配置mapred-env.sh

[root@master hadoop]# vi /opt/module/hadoop/etc/hadoop/mapred-env.sh 

配置内容:

export JAVA_HOME=/opt/module/jdk

配置yarn-env.sh

[root@master hadoop]# vi /opt/module/hadoop/etc/hadoop/yarn-env.sh 

配置内容:

export JAVA_HOME=/opt/module/jdk

将hadoop拷贝到slave1和slave2节点:

[root@master hadoop]# scp -r /opt/module/hadoop/ slave1:/opt/module/hadoop/
[root@master hadoop]# scp -r /opt/module/hadoop/ slave2:/opt/module/hadoop/

格式化namenode:

[root@master hadoop]# hdfs namenode -format

在/etc/profile下配置hadoop的HDFS用户和yarn的用户

[root@master hadoop]# vi /etc/profile

配置内容:

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

source /etc/profile使配置文件生效:

[root@master hadoop]# source /etc/profile

启动Hadoop集群:

[root@master hadoop]# pwd
/opt/module/hadoop
[root@master hadoop]# ./sbin/start-all.sh 

jps查看hadoop进程是否启动:

[root@master hadoop]# jps
7472 DataNode
7297 NameNode
7682 SecondaryNameNode
7939 ResourceManager
8467 Jps
8091 NodeManager
[root@master hadoop]# 

slave1:

[root@slave1 hadoop]# jps
5194 NodeManager
5325 Jps
5087 DataNode
[root@slave1 hadoop]# 

slave2:

[root@slave2 hadoop]# jps
5065 NodeManager
5196 Jps
4958 DataNode
[root@slave2 hadoop]# 

3.hive组件部署

3.1 MYSql数据库部署

安装包位置 /opt/softwares/mysql

账号:root

密码:123456

  1. 卸载mariadb

[root@master hadoop]# rpm -qa | grep mariadb
mariadb-libs-5.5.56-2.el7.x86_64
[root@master hadoop]# rpm -e --nodeps mari
mariadb-libs  marisa        
[root@master hadoop]# rpm -e --nodeps mariadb-libs 
[root@master hadoop]# 
  1. 通过rpm命令按顺序安装mysql:

安装包位置:

[root@master softwares]# cd Mysql/
[root@master Mysql]# ls
01_mysql-community-common-5.7.16-1.el7.x86_64.rpm
02_mysql-community-libs-5.7.16-1.el7.x86_64.rpm
03_mysql-community-libs-compat-5.7.16-1.el7.x86_64.rpm
04_mysql-community-client-5.7.16-1.el7.x86_64.rpm
05_mysql-community-server-5.7.16-1.el7.x86_64.rpm
mysql-connector-java-5.1.27-bin.jar
[root@master Mysql]# pwd
/opt/softwares/Mysql
[root@master Mysql]# 

通过rpm命令按顺序安装mysql:

[root@master Mysql]# rpm -ivh 01_mysql-community-common-5.7.16-1.el7.x86_64.rpm 
[root@master Mysql]# rpm -ivh 02_mysql-community-libs-5.7.16-1.el7.x86_64.rpm 
[root@master Mysql]# rpm -ivh 03_mysql-community-libs-compat-5.7.16-1.el7.x86_64.rpm 
[root@master Mysql]# rpm -ivh 04_mysql-community-client-5.7.16-1.el7.x86_64.rpm 
[root@master Mysql]# rpm -ivh 05_mysql-community-server-5.7.16-1.el7.x86_64.rpm 

查看是否安装成功:

[root@master Mysql]# rpm -qa | grep mysql
mysql-community-server-5.7.16-1.el7.x86_64
mysql-community-libs-5.7.16-1.el7.x86_64
mysql-community-libs-compat-5.7.16-1.el7.x86_64
mysql-community-client-5.7.16-1.el7.x86_64
mysql-community-common-5.7.16-1.el7.x86_64
[root@master Mysql]#

通过命令启动mysql:

[root@master Mysql]# service mysqld start
Redirecting to /bin/systemctl start mysqld.service

查看临时密码并通过临时密码登录mysql:

[root@master Mysql]# grep 'temporary password' /var/log/mysqld.log 
2022-11-14T13:46:27.335510Z 1 [Note] A temporary password is generated for root@localhost: 
+N_r2eOrOvk*

[root@master Mysql]# mysql -uroot -p+N_r2eOrOvk*
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.16

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 

mysql修改密码及开启远程访问:

mysql> set global validate_password_policy=0;
Query OK, 0 rows affected (0.00 sec)

mysql> set global validate_password_length=1;
Query OK, 0 rows affected (0.00 sec)

mysql> alter user 'root'@'localhost' identified by '123456';
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to 'root'@'%' identified by '123456'; 
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> exit;

退出后使用自己设置的密码登录mysql测试:

[root@master Mysql]# mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.7.16 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 

3.2 Hive部署

解压hive,并重命名为hive,将hive环境变量配置到/etc/profile:

[root@master Mysql]# tar -zxvf /opt/softwares/apache-hive-3.1.2-bin.tar.gz -C /opt/module/
[root@master Mysql]# mv /opt/module/apache-hive-3.1.2-bin/ /opt/module/hive
[root@master Mysql]# ls /opt/module/
hadoop  hive  jdk
[root@master Mysql]#vi /etc/profile
[root@master Mysql]# source /etc/profile
[root@master Mysql]# 

环境变量配置内容:

# HIVE_HOME
export HIVE_HOME=/opt/module/hive
export PATH="$HIVE_HOME/bin:$PATH"

将Hive元数据配置到mysql:

将jdbc驱动拷贝到hive的lib目录下:

[root@master Mysql]# cp /opt/softwares/Mysql/mysql-connector-java-5.1.27-bin.jar /opt/module/hive/lib/
[root@master Mysql]# 

在hive相对路径下的conf目录下,配置hive-site.xml:

[root@master Mysql]# vi /opt/module/hive/conf/hive-site.xml

配置文件内容:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://master:3306/metastore?useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
    </property>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>

    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>

    <property>
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
    </property>

    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>master</value>
    </property>

    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>
    
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>

    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
    </property>
</configuration>

初始化元数据库:

登录Mysql,创建hive的元数据库:

[root@master Mysql]# mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6
Server version: 5.7.16 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create database metastore;
Query OK, 1 row affected (0.00 sec)

mysql> quit;
Bye
[root@master Mysql]# 

初始化hive元数据库:

[root@master Mysql]# schematool --initSchema -dbType mysql -verbose

启动Hive客户端,查看数据库:

[root@master Mysql]# hive
which: no hbase in (/opt/module/hadoop/bin:/opt/module/hive/bin:/opt/module/jdk/bin:/usr/local/python3/bin:/opt/module/hadoop/bin:/opt/module/jdk/bin:/usr/local/python3/bin:/opt/module/hadoop/bin:/opt/module/jdk/bin:/usr/local/python3/bin:/opt/module/jdk/bin:/usr/local/python3/bin:/usr/local/python3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 69642fdc-0c17-4637-8e11-3415e32bfd4d

Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 1ee7f81d-bfad-4ab6-aa30-6c4c26976021
hive (default)> show databases;
OK
database_name
default
Time taken: 0.543 seconds, Fetched: 1 row(s)
hive (default)> 

4. spark组件部署

解压spark安装包,重命名spark并配置环境变量:

[root@master softwares]# tar -zxvf /opt/softwares/spark-3.1.3-bin-hadoop3.2.tgz -C /opt/module/
[root@master softwares]# mv /opt/module/spark-3.1.3-bin-hadoop3.2/ /opt/module/spark
[root@master softwares]# vi /etc/profile

环境变量配置内容:

# SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH="$SPARK_HOME/bin:$PATH"

source 使环境变量生效:

[root@master softwares]# source /etc/profile

重命名workers。template 文件并配置计算节点:

[root@master conf]# mv workers.template workers
[root@master conf]# vim workers 

配置内容:

master
slave1
slave2

修改spark-env.sh文件:

[root@master conf]# cp spark-env.sh.template spark-env.sh
[root@master conf]# vim spark-env.sh
[root@master conf]# 

添加如下内容:

export JAVA_HOME=/opt/module/jdk
export HADOOP_HOME=/opt/module/hadoop
export SPARK_MASTER_IP=master
export SPARK_MASTER_PORT=7077
export SPARK_DIST_CLASSPATH=$(/opt/module/hadoop/bin/hadoop classpath)
export SPARK_YARN_USER_ENV="CLASSPATH=/opt/module/hadoop/etc/hadoop"
export HADOOP_CONF_DIR=/opt/module/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/module/hadoop/etc/hadoop

将spark拷贝到其他两个节点:

[root@master conf]# scp -r /opt/module/spark/ slave1:/opt/module/
[root@master conf]# scp -r /opt/module/spark/ slave2:/opt/module/

启动spark集群并查看进程:

[root@master spark]# /opt/module/spark/sbin/start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/module/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
master: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-master.out
slave1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave1.out
slave2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave2.out
[root@master spark]# jps
7472 DataNode
7297 NameNode
7682 SecondaryNameNode
7939 ResourceManager
11715 Master
11909 Jps
8091 NodeManager
11854 Worker
[root@master spark]# 

运行SparkPI程序测试spark:

[root@master spark]# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 examples/jars/spark-examples_2.12-3.1.3.jar

结果:

2022-11-14 22:24:35,002 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
2022-11-14 22:24:35,010 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 2.131 s
2022-11-14 22:24:35,013 INFO scheduler.DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
2022-11-14 22:24:35,013 INFO scheduler.TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
2022-11-14 22:24:35,014 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 2.189453 s
Pi is roughly 3.141475707378537
2022-11-14 22:24:35,046 INFO server.AbstractConnector: Stopped Spark@4fbb001b{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2022-11-14 22:24:35,048 INFO ui.SparkUI: Stopped Spark web UI at http://master:4040
2022-11-14 22:24:35,050 INFO cluster.StandaloneSchedulerBackend: Shutting down all executors
2022-11-14 22:24:35,050 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
2022-11-14 22:24:35,122 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
[root@master spark]# 

5. Flink组件部署

解压Flink修改flink名称并配置环境变量:

[root@master spark]# tar -zxvf /opt/softwares/flink-1.14.6-bin-scala_2.12.tgz -C /opt/module/
[root@master spark]# mv /opt/module/flink-1.14.6/ /opt/module/flink/
[root@master spark]# vi /etc/profile

配置内容:

# FLINK_HOME
export FLINK_HOME=/opt/module/flink
export PATH="$FLINK_HOME/bin:$PATH"

source /etc/profile配置文件:

[root@master spark]# source /etc/profile
[root@master spark]# 

修改flink-conf.yaml配置文件:

[root@master conf]# vim /opt/module/flink/conf/flink-conf.yaml 

修改内容:

jobmanager.rpc.address: master

修改workers配置文件:

[root@master conf]# vim /opt/module/flink/conf/workers

修改内容:

master
slave1
slave2

将flink分发到slave1和slave2:

[root@master conf]# scp -r /opt/module/flink/ slave1:/opt/module/
[root@master conf]# scp -r /opt/module/flink/ slave2:/opt/module/

启动flink集群并查看进程:

[root@master flink]# ./bin/start-cluster.sh 
Starting cluster.
Starting standalonesession daemon on host master.
Starting taskexecutor daemon on host master.
Starting taskexecutor daemon on host slave1.
Starting taskexecutor daemon on host slave2.
[root@master flink]# jps
7472 DataNode
7297 NameNode
13617 TaskManagerRunner
7682 SecondaryNameNode
7939 ResourceManager
11715 Master
13732 Jps
8091 NodeManager
11854 Worker
[root@master flink]#

提交测试jar包测试flink:

[root@master batch]# flink run -m master:8081 /opt/module/flink/examples/batch/WordCount.jar 
注意:运行flink要关闭spark standalone集群

测试结果:

(under,1)
(undiscover,1)
(unworthy,1)
(us,3)
(we,4)
(weary,1)
(what,1)
(when,2)
(whether,1)
(whips,1)
(who,2)
(whose,1)
(will,1)
(wish,1)
(with,3)
(would,2)
(wrong,1)
(you,1)
[root@master batch]# 

FLinkWeb端界面(http://192.168.1.100:8081/#/overview):

6. Flume组件部署

解压flume修改文件名称并配置环境变量:

[root@master module]# tar -zxvf /opt/softwares/apache-flume-1.9.0-bin.tar.gz -C /opt/module/
[root@master module]# mv /opt/module/apache-flume-1.9.0-bin/ /opt/module/flume
[root@master batch]# vim /etc/profile

环境变量配置内容:

# FLUME_HOME
export FLUME_HOME=/opt/module/flume
export PATH="$FLUME_HOME/bin:$PATH"

source /etc/profile:

[root@master batch]# source /etc/profile
[root@master batch]# 

删除flume和hadoop冲突的jar包:

cd /opt/module/flume/lib
[root@master lib]# rm -rf /opt/module/flume/lib/guava-11.0.2.jar

修改log4j配置文件:

[root@master lib]# vim /opt/module/flume/conf/log4j.properties 

修改内容:

flume.log.dir=/opt/module/flume/logs

7. Zookeeper组件部署

解压zookeeper修改文件名称并配置环境变量:

[root@master lib]# tar -zxvf /opt/softwares/apache-zookeeper-3.5.7-bin.tar.gz -C /opt/module/
[root@master lib]# mv /opt/module/apache-zookeeper-3.5.7-bin/ /opt/module/zookeeper
mv apache-zookeeper-3.5.7-bin/ zookeeper
[root@master lib]# vim /etc/profile

环境变量修改内容:

# ZOOKEEPER_HOME
export ZOOKEEPER_HOME=/opt/module/zookeeper
export PATH="$ZOOKEEPER_HOME/bin:$PATH"

source /etc/profile使环境变量生效:

[root@master lib]# source /etc/profile
[root@master lib]#

配置服务器编号:

在zookeeper目录下创建zkData

[root@master zookeeper]# mkdir zkData
[root@master zookeeper]# cd zkData/
[root@master zkData]# vim myid

myid文件内容:

2

配置zoo.cfg文件

重命名zoo_sample.cfg为zoo.cfg,并修改zoo.cfg文件内容:

cd ../conf
[root@master conf]# mv zoo_sample.cfg zoo.cfg
[root@master conf]# vim zoo.cfg 

修改内容:

dataDir=/opt/module/zookeeper/zkData


# 设置zookeeper内部通信地址和选举端口
server.2=master:2888:3888
server.3=slave1:2888:3888
server.4=slave2:2888:3888

将zookeeper拷贝到slave1和slave2:

[root@master conf]# scp -r /opt/module/zookeeper/ slave1:/opt/module/
[root@master conf]# scp -r /opt/module/zookeeper/ slave2:/opt/module/

修改slave1和slave2上myid内容:

[root@slave1 module]# vim /opt/module/zookeeper/zkData/myid 
3
[root@slave2 hadoop]# vim /opt/module/zookeeper/zkData/myid 
4

启动zookeeper集群:

[root@slave2 hadoop]# /opt/module/zookeeper/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@slave2 hadoop]# 
[root@slave1 module]# /opt/module/zookeeper/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@slave1 module]# 
[root@master conf]# /opt/module/zookeeper/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@master conf]# 

查看zookeeper进程状态 验证结果:

[root@master conf]# /opt/module/zookeeper/bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: follower
[root@master conf]# 
[root@slave1 module]# /opt/module/zookeeper/bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: follower
[root@slave1 module]# 
[root@slave2 hadoop]# /opt/module/zookeeper/bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: leader
[root@slave2 hadoop]# 

8. Kafka组件部署

解压重命名kafka,并配置环境变量:

[root@master conf]# tar -zxvf /opt/softwares/kafka_2.12-3.0.0.tgz -C /opt/module/
[root@master conf]# mv /opt/module/kafka_2.12-3.0.0/ /opt/module/kafka
[root@master conf]# vim /etc/profile

配置内容:

# KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka
export PATH="$KAFKA_HOME/bin:$PATH"

修改kafka的server.properties配置文件:

[root@master config]# vim /opt/module/kafka/config/server.properties 

修改内容:

# 唯一值
broker.id=0
# kafka启动会自动创建
log.dirs=/opt/module/kafka/logs
# zk连接
zookeeper.connect=master:2181,slave1:2181,slave2:2181/kafka

将kafka拷贝到slave1和slave2节点:

[root@master config]# scp -r /opt/module/kafka/ slave1:/opt/module/
[root@master config]# scp -r /opt/module/kafka/ slave2:/opt/module/

修改slave1和slave2节点的broker.id值

[root@slave1 module]# vim /opt/module/kafka/config/server.properties
broker.id=1
[root@slave2 hadoop]# vim /opt/module/kafka/config/server.properties 
broker.id=2

在三个节点启动kafka:

[root@master config]# /opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties 
[root@slave1 module]# /opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties 
[root@slave2 hadoop]# /opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties 

/opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/server.properties 

查看kafka进程:

[root@master config]# jps
7472 DataNode
7297 NameNode
7682 SecondaryNameNode
7939 ResourceManager
15123 CliFrontend
18035 QuorumPeerMain
16388 StandaloneSessionClusterEntrypoint
18725 Kafka
16694 TaskManagerRunner
14170 CliFrontend
8091 NodeManager
18813 Jps
[root@master config]# 
[root@slave1 module]# jps
10032 Jps
9383 QuorumPeerMain
8841 TaskManagerRunner
5194 NodeManager
9963 Kafka
5087 DataNode
[root@slave1 module]# 
[root@slave2 hadoop]# jps
9202 QuorumPeerMain
9874 Jps
5065 NodeManager
8681 TaskManagerRunner
9804 Kafka
4958 DataNode
[root@slave2 hadoop]# 

9. MaxWell组件部署

解压重命名maxwell,并配置环境变量:

[root@master softwares]# tar -zxvf /opt/softwares/maxwell-1.29.2.tar.gz -C /opt/module/
[root@master softwares]# mv /opt/module/maxwell-1.29.2/ /opt/module/maxwell/
[root@master softwares]# vim /etc/profile

修改内容:

# MAXWELL_HOME
export MAXWELL_HOME=/opt/module/maxwell
export PATH="$MAXWELL_HOME/bin:$PATH"

修改MySQL相关配置:

启用MySQL Binlog,进行数据同步要先开启BinLog

修改MySQL配置文件/etc/my/cnf

[root@master ~]# vim /etc/my.cnf

修改内容:

#数据库id
server-id = 1
#启动binlog,该参数的值会作为binlog的文件名
log-bin=mysql-bin
#binlog类型,maxwell要求为row类型
binlog_format=row
#启用binlog的数据库,需根据实际情况作出修改
binlog-do-db=ds_pub

在mysql中创建出数据库ds_pub:

[root@master ~]# mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 30
Server version: 5.7.16 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| metastore          |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
5 rows in set (0.00 sec)

mysql> create database ds_pub;
Query OK, 1 row affected (0.00 sec)

mysql> exit;
Bye

重启mysql服务:

[root@master ~]# systemctl restart mysqld
[root@master ~]# 

在MySQL中创建Maxwell所需要的数据库和用户

[root@master ~]# mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.16-log MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE DATABASE maxwell;
Query OK, 1 row affected (0.00 sec)

mysql> set global validate_password_policy=0;
Query OK, 0 rows affected (0.00 sec)

mysql> set global validate_password_length=4;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE USER 'maxwell'@'%' IDENTIFIED BY 'maxwell';
Query OK, 0 rows affected (0.01 sec)

mysql> GRANT ALL ON maxwell.* TO 'maxwell'@'%';
Query OK, 0 rows affected (0.00 sec)

mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';
Query OK, 0 rows affected (0.00 sec)

mysql> exit;
Bye
[root@master ~]# 

[root@master redis]# cd /opt/maxwell [root@master maxwell]# ls bin config.properties.example lib log4j2.xml README.md config.md kinesis-producer-library.properties.example LICENSE quickstart.md [root@master maxwell]# cp config.properties.example config.properties

10. Redis组件部署

解压重命名redis,并配置环境变量:

[root@master softwares]# tar -zxvf redis-6.2.6.tar.gz -C /opt/module/
[root@master softwares]# mv /opt/module/redis-6.2.6/ /opt/module/redis
[root@master softwares]# cd /opt/module/
[root@master module]# ls
flink  flume  hadoop  hive  jdk  kafka  maxwell  redis  spark  zookeeper
[root@master module]# 

(1.)查看是否安装了gcc相关程序 命令:which gcc 在这里插入图片描述 (显示没有相关的文件) (2)安装gcc程序 命令:yum -y install gcc automake autoconf libtool make

查看gcc编译器版本:

[root@master redis]# gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[root@master redis]# 

进入redis目录,执行make命令:

[root@master redis]# cd /opt/module/redis/
[root@master redis]# make
[root@master redis]# make install

备份redis.conf 到/root目录

[root@master redis]# cp redis.conf /root/
[root@master redis]#

设置Redis在后台启动:

[root@master ~]# vim redis.conf 

修改内容:

bind 192.168.1.100 -::1

# 设置为在后台启动
daemonize yes 
#pro

修改redis.conf 名称:

[root@master ~]# mv redis.conf my_redis.conf

后台启动redis

[root@master ~]# redis-server /root/my_redis.conf 
[root@master ~]# 

用客户端去访问redis:

[root@master ~]# redis-cli -h 192.168.1.100 -p 6379
192.168.1.100:6379> 

11. Hbase组件部署

解压重命名hbase,并配置环境:

[root@master hbase]# tar -zxvf hbase-2.3.3-bin.tar.gz -C /opt/module/
[root@master hbase]# mv hbase-2.3.3/ hbase
[root@master module]# ls
clickhouse  flink  flume  hadoop  hbase  hive  jdk  kafka  maxwell  Mysql  redis  spark  zookeeper

配置profile环境变量:

[root@master module]# vi /etc/profile
#hbase_home
export HBASE_HOME=/opt/module/hbase
export PATH="HBASE_HOME/bin:$PATH"
[root@master module]# source /etc/profile

修改hbase-env.sh配置文件:

[root@master conf]# vi hbase-env.sh

export JAVA_HOME=/opt/module/jdk
export HBASE_MANAGES_ZK=false

修改hbase-site.xml文件:

<configuration>
  <!--
    The following properties are set for running HBase as a single process on a
    developer workstation. With this configuration, HBase is running in
    "stand-alone" mode and without a distributed file system. In this mode, and
    without further configuration, HBase and ZooKeeper data are stored on the
    local filesystem, in a path under the value configured for `hbase.tmp.dir`.
    This value is overridden from its default value of `/tmp` because many
    systems clean `/tmp` on a regular basis. Instead, it points to a path within
    this HBase installation directory.

    Running against the `LocalFileSystem`, as opposed to a distributed
    filesystem, runs the risk of data integrity issues and data loss. Normally
    HBase will refuse to run in such an environment. Setting
    `hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
    permitting operation. This configuration is for the developer workstation
    only and __should not be used in production!__

    See also https://hbase.apache.org/book.html#standalone_dist
  -->
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.tmp.dir</name>
    <value>./tmp</value>
  </property>
  <property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
  </property>
  <property>
    <name>hbase.root.dir</name>
    <value>hdfs://master:9000/base</value>
  </property>
 <property>
    <name>hbase.zookeeper.quorum</name>
    <value>master,slave1,slave2</value>
  </property>
 <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
  </property>
 <property>
    <name>hbase.master.maxclockskew</name>
    <value>30000</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/opt/mopdule/zookeeper</value>
  </property>
  <property>
    <name>zookeeper.session.timeout</name>
    <value>300000</value>
  </property>
</configuration>

修改regionsservers文件:

[root@master conf]# vi regionservers

master
slave1
slave2

启动hbase

[root@master hbase]# ./bin/start-hbase.sh

jps:在哪台服务器使用上述命令启动则那台服务器即为master节点,使用 jps命令查看启动情况

[root@master hbase]# jps
2465 ResourceManager
1844 NameNode
5652 HRegionServer
6308 Jps
2792 NodeManager
3144 Worker
3001 Master
2219 SecondaryNameNode
3211 QuorumPeerMain
5435 HMaster
2015 DataNode
3583 Kafka

12. Clickhouse组件部署

rpm安装:

[root@master clickhouse]# rpm -ivh clickhouse-common-static-dbg-21.9.4.35-2.x86_64.rpm
警告:clickhouse-common-static-dbg-21.9.4.35-2.x86_64.rpm: 头V4 RSA/SHA1 Signature, 密钥 ID e0c56bd4: NOKEY
准备中...                          ################################# [100%]
正在升级/安装...
   1:clickhouse-common-static-dbg-21.9################################# [100%]
[root@master clickhouse]# rpm -ivh clickhouse-common-static-21.9.4.35-2.x86_64.rpm
警告:clickhouse-common-static-21.9.4.35-2.x86_64.rpm: 头V4 RSA/SHA1 Signature, 密钥 ID e0c56bd4: NOKEY
准备中...                          ################################# [100%]
正在升级/安装...
   1:clickhouse-common-static-21.9.4.3################################# [100%]
[root@master clickhouse]# rpm -ivh clickhouse-server-21.9.4.35-2.noarch.rpm
警告:clickhouse-server-21.9.4.35-2.noarch.rpm: 头V4 RSA/SHA1 Signature, 密钥 ID e0c56bd4: NOKEY
准备中...                          ################################# [100%]
正在升级/安装...
   1:clickhouse-server-21.9.4.35-2    ################################# [100%]
ClickHouse binary is already located at /usr/bin/clickhouse
Symlink /usr/bin/clickhouse-server already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-server to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-client to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-local to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-benchmark to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-copier already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-copier to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-obfuscator to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-git-import to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-compressor to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-format to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-extract-from-config already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-extract-from-config to /usr/bin/clickhouse.
Creating clickhouse group if it does not exist.
 groupadd -r clickhouse
Creating clickhouse user if it does not exist.
 useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse clickhouse
Will set ulimits for clickhouse user in /etc/security/limits.d/clickhouse.conf.
Creating config directory /etc/clickhouse-server/config.d that is used for tweaks of main server configuration.
Creating config directory /etc/clickhouse-server/users.d that is used for tweaks of users configuration.
Config file /etc/clickhouse-server/config.xml already exists, will keep it and extract path info from it.
/etc/clickhouse-server/config.xml has /var/lib/clickhouse/ as data path.
/etc/clickhouse-server/config.xml has /var/log/clickhouse-server/ as log path.
Users config file /etc/clickhouse-server/users.xml already exists, will keep it and extract users info from it.
 chown --recursive clickhouse:clickhouse '/etc/clickhouse-server'
Creating log directory /var/log/clickhouse-server/.
Creating data directory /var/lib/clickhouse/.
Creating pid directory /var/run/clickhouse-server.
 chown --recursive clickhouse:clickhouse '/var/log/clickhouse-server/'
 chown --recursive clickhouse:clickhouse '/var/run/clickhouse-server'
 chown clickhouse:clickhouse '/var/lib/clickhouse/'
 groupadd -r clickhouse-bridge
 useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse-bridge clickhouse-bridge
 chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-odbc-bridge'
 chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-library-bridge'
Enter password for default user: 
Password for default user is saved in file /etc/clickhouse-server/users.d/default-password.xml.
Setting capabilities for clickhouse binary. This is optional.

ClickHouse has been successfully installed.

Start clickhouse-server with:
 sudo clickhouse start

Start clickhouse-client with:
 clickhouse-client --password

Created symlink from /etc/systemd/system/multi-user.target.wants/clickhouse-server.service to /etc/systemd/system/clickhouse-server.service.
[root@master clickhouse]# ls
clickhouse-client-21.9.4.35-2.noarch.rpm         clickhouse-common-static-dbg-21.9.4.35-2.x86_64.rpm
clickhouse-common-static-21.9.4.35-2.x86_64.rpm  clickhouse-server-21.9.4.35-2.noarch.rpm
[root@master clickhouse]# rpm -ivh clickhouse-client-21.9.4.35-2.noarch.rpm
警告:clickhouse-client-21.9.4.35-2.noarch.rpm: 头V4 RSA/SHA1 Signature, 密钥 ID e0c56bd4: NOKEY
准备中...                          ################################# [100%]
正在升级/安装...
   1:clickhouse-client-21.9.4.35-2    ################################# [100%]
[root@master clickhouse]# 

修改配置文件:

[root@master module]# vi /etc/clickhouse-server/config.xml
<log>/data/clickhouse/log/clickhouse-server/clickhouse-server.log</log>

<errorlog>/data/clickhouse/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<!-- Path to data directory, with trailing slash. -->
<path>/data/clickhouse/</path>

<!-- Path to temporary data for processing hard queries. -->
<tmp_path>/data/clickhouse/tmp/</tmp_path>

<!-- Directory with user provided files that are accessible by 'file' table function. -->
<user_files_path>/data/clickhouse/user_files/</user_files_path>

启动 ClickServer 前台启动:

clickhouse-server --config-file=/etc/clickhouse-server/config.xml
查看启动后的进程:ps -aux | grep click

后台启动:

nohup clickhouse-server --config-file=/etc/clickhouse-server/config.xml >null 2>&1 &

使用 client 连接server

clickhouse-client

二、数据抽取

2.1 实时数据抽取

2.1.1 Maxwell数据抽取

修改Maxwell配置文件config.peoperties

[root@master ~]# cd /opt/module/maxwell/
[root@master maxwell]# ls
bin                                          LICENSE
config.md                                    log4j2.xml
config.properties.example                    quickstart.md
kinesis-producer-library.properties.example  README.md
lib
[root@master maxwell]# cp config.properties.example config.properties
[root@master maxwell]# vim config.properties

修改内容:

log_level=info

producer=kafka
kafka.bootstrap.servers=master:9092,slave1:9092
kafka_topic=maxwell

# mysql login info
host=master
user=maxwell
password=maxwell
jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai

启动maxwell:

[root@master maxwell]# /opt/module/maxwell/bin/maxwell --config /opt/module/maxwell/config.properties --daemon
Redirecting STDOUT to /opt/module/maxwell/bin/../logs/MaxwellDaemon.out
Using kafka version: 1.0.0
[root@master maxwell]# 

在kafka中创建topic:

[root@master maxwell]# kafka-topics.sh --bootstrap-server master:9092 --partitions 1 --replication-factor 1 --create -topic maxwell
Created topic maxwell.
[root@master maxwell]# 

启动kafka console consumer消费数据:

[root@master maxwell]# 
[root@master conf]# kafka-console-consumer.sh --bootstrap-server master:9092 --topic maxwell

在mysql中执行source命令,往数据库中添加数据:

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| ds_pub             |
| maxwell            |
| metastore          |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
7 rows in set (0.00 sec)

mysql> use ds_pub;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> source '/opt/data/ds_pub.sql';

结果(kafka的console consumer消费到MySQL数据库中导入的数据)

{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":15,"data":{"id":17,"name":"辽宁","region_id":"3","area_code":"210000","iso_code":"CN-21"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":16,"data":{"id":18,"name":"陕西","region_id":"7","area_code":"610000","iso_code":"CN-61"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":17,"data":{"id":19,"name":"甘肃","region_id":"7","area_code":"620000","iso_code":"CN-62"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":18,"data":{"id":20,"name":"青海","region_id":"7","area_code":"630000","iso_code":"CN-63"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":19,"data":{"id":21,"name":"宁夏","region_id":"7","area_code":"640000","iso_code":"CN-64"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":20,"data":{"id":22,"name":"新疆","region_id":"7","area_code":"650000","iso_code":"CN-65"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":21,"data":{"id":23,"name":"河南","region_id":"4","area_code":"410000","iso_code":"CN-41"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":22,"data":{"id":24,"name":"湖北","region_id":"4","area_code":"420000","iso_code":"CN-42"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":23,"data":{"id":25,"name":"湖南","region_id":"4","area_code":"430000","iso_code":"CN-43"}}
{"database":"ds_pub","table":"base_province","type":"insert","ts":1668443406,"xid":1326,"xoffset":24,"data":{"id":26,"name":"广东","region_id":"5","area_code":"440000","iso_code":"CN-44"}}

2.1.2 端口日志数据抽取

编写Flume.conf配置文件,配置要采集的端口

[root@master conf]# vim /opt/module/flume/conf/read_socket_write_kafka.conf(出现错误问题)

[root@master ~]# cd /opt/flume/conf
[root@master conf]# ls
flume-conf.properties.template  flume-env.sh.template  read_socket_write_kafka.conf(出现错误问题)
flume-env.ps1.template          log4j.properties       socket_to_kafka.conf
[root@master conf]# vim socket_to_kafka.conf 

文件内容:

# # 命名这个代理上的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#
#描述/配置源
a1.sources.r1.type = exec
a1.sources.r1.command = nc master 26001
#
# # 描述接收器
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = master:9092
a1.sinks.k1.kafka.topic = order
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.acks = 1
# 使用一个通道缓冲内存中的事件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 将source和sink绑定到通道
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume采集端口日志: ------第二个master

[root@master conf]# flume-ng agent --name a1 --conf /opt/module/flume/conf/ --conf-file /opt/module/flume/conf/read_socket_write_kafka.conf -Dflume.root.logger=INFO,console

flume-ng agent --name a1 --conf /opt/flume/conf/ --conf-file /opt/flume/conf/socket_to_kafka.conf-Dflume.root.logger=INFO,console

在kafka中创建topic: -----之前创建好

[root@master ~]# kafka-topics.sh --bootstrap-server master:9092 --partitions 1 --replication-factor 1 --create -topic order
Created topic order.
[root@master ~]# 

开启kafka的console consumer:在命令行消费数据 -----第三个master可直接出现数据

[root@master ~]# kafka-console-consumer.sh --bootstrap-server master:9092,slave1:9092,slave2:9092 --from-beginning --topic order

开启日志端口脚本,往端口实时生成数据: - ----第一个master

[root@master ~]# cd /opt/data/
[root@master data]# ls
ds_pub.sql  order.sh  order.txt
[root@master data]# ll -a
total 304
drwxr-xr-x  2 root root     57 Nov 15 00:53 .
drwxr-xr-x. 5 root root     49 Nov 15 00:52 ..
-rw-r--r--  1 root root 187698 Nov 15 00:53 ds_pub.sql
-rw-r--r--  1 root root     72 Nov 15 00:52 order.sh
-rw-r--r--  1 root root 115027 Nov 15 00:53 order.txt
[root@master data]# chmod -R 777 /opt/data/
[root@master data]# sh order.sh | nc -lk 26001    

结果:

nc master 26001

D,"8815","3632","14","Dior迪奥口红唇膏送女友老婆礼物生日礼物 烈艳蓝金999+888两支装礼盒","http://kAXllAQEzJWHwiExxVmyJIABfXyzbKwedeofMqwh","496","2","2020-4-26 18:55:16","2402","56"
I,
D,"8816","3633","14","Dior迪奥口红唇膏送女友老婆礼物生日礼物 烈艳蓝金999+888两支装礼盒","http://kAXllAQEzJWHwiExxVmyJIABfXyzbKwedeofMqwh","496","1","2020-4-26 18:55:16","2401",
I,
D,"8817","3634","12","联想(Lenovo)拯救者Y7000 英特尔酷睿i7 2019新款 15.6英寸发烧游戏本笔记本电脑(i7-9750H 8GB 512GB SSD GTX1650 4G 高色域","http://kAXllAQEzJWHwiExxVmyJIABfXyzbKwedeofMqwh","6699","2","2020-4-26 18:55:16","2401",
I,
D,"8818","3635","4","小米Play 流光渐变AI双摄 4GB+64GB 梦幻蓝 全网通4G 双卡双待 小水滴全面屏拍照游戏智能手机","http://SXlkutIjYpDWWTEpNUiisnlsevOHVElrdngQLgyZ","1442","1","2020-4-26 18:55:16","2402","33"
I,
D,"8819","3636","3","小米(MI)电视 55英寸曲面4K智能WiFi网络液晶电视机4S L55M5-AQ 小米电视4S 55英寸 曲面","http://LzpblavcZQeYEbwbSjsnmsgAjtpudhDradqsRgdZ","3100","3","2020-4-26 18:55:16","2401",
I,
D,"8820","3637","15","迪奥(Dior)烈艳蓝金唇膏 口红 3.5g 999号 哑光-经典正红","http://kAXllAQEzJWHwiExxVmyJIABfXyzbKwedeofMqwh","252","2","2020-4-26 18:55:16","2401",

reboot new 重启

15日上9点30分完成 从10开始即可

2. 离线数据抽取

教程参考:

hive(19-86):尚硅谷大数据Hive教程(基于hive3.x丨hive3.1.2)_哔哩哔哩_bilibili

sparkSQL(153-184):尚硅谷大数据Spark教程从入门到精通_哔哩哔哩_bilibili

从mysql抽取数据到hive,使用的是spark,spark的开发工具是idea

在idea创建maven项目:

根据情况设置:maven的阿里云的源,加快maven的下载速度:

在idea项目中打开maven的settings.xml文件:

在settings.xml内部增加下面选项:

     <mirror>
        <id>nexus-aliyun</id>
        <mirrorOf>central</mirrorOf>
        <name>Nexus aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public</url>
    </mirror>

如果没有settings.xml文件,可以自己新建,并写入以下配置:

  <mirrors>
    <!-- mirror
     | Specifies a repository mirror site to use instead of a given repository. The repository that
     | this mirror serves has an ID that matches the mirrorOf element of this mirror. IDs are used
     | for inheritance and direct lookup purposes, and must be unique across the set of mirrors.
     |
    <mirror>
      <id>mirrorId</id>
      <mirrorOf>repositoryId</mirrorOf>
      <name>Human Readable Name for this Mirror.</name>
      <url>http://my.repository.com/repo/path</url>
    </mirror>
     -->
     <mirror>
        <id>nexus-aliyun</id>
        <mirrorOf>central</mirrorOf>
        <name>Nexus aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public</url>
    </mirror>

  </mirrors>

创建完成项目后,配置项目的pom文件:

    <properties>
        <scala.version>2.12</scala.version>
        <mysqlconnect.version>5.1.47</mysqlconnect.version>
        <spark.version>3.1.1</spark.version>
        <hive.version>2.3.6</hive.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysqlconnect.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>2.3.7</version>
        </dependency>

    </dependencies>

将hive的hive-site.xml文件拷贝到项目的resources目录下:

如果hadoop的windos项目未配置,需要配置hadoop的windows相关内容:

参考:Windows配置本地Hadoop运行环境 - 走看看

配置完成以后,开始spark项目相关编写:

在scala目录下创建edu.jl.ods package, 在edu.jl.ods中创建hello.scala文件,

在文件中选择scala版本,如果没有的话在idea的plugins下安装scala的插件:

在hello.scala中写入以下内容,测试scala环境是否能正常运行:

package edu.jl.ods

object hello {
  def main(args: Array[String]): Unit = {
    println("hello")
  }
}

运行结果:

如果scala版本有误,可以在项目的structure配置中重新选择scala版本:

spark读取MySQL数据库ds_pub,表order_info;

在edu.jl.ods中创建SparkReadMySQL:

在文件内写入以下代码:

package edu.jl.ods

import org.apache.spark.sql.SparkSession

import java.util.Properties

object SparkReadMySQL {

  def main(args: Array[String]): Unit = {

    // 构建sparkSession
    val sparkSession: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("SparkReadMySQL")
      .getOrCreate()

    // 设置连接mysql相关配置
    val MYSQLDBURL:String = "jdbc:mysql://192.168.1.100:3306/ds_pub?useUnicode=true&characterEncoding=utf-8&useSSL=false"

    val properties = new Properties()
    properties.setProperty("user","root")
    properties.setProperty("password","123456")
    properties.setProperty("driver","com.mysql.jdbc.Driver")

    sparkSession.read.jdbc(MYSQLDBURL,"order_info",properties).createTempView("order_info")

    val selectId = "select id,consignee from order_info"

    sparkSession.sql(selectId).show()

  }

}

spark读取mysql并写入hive:

在edu.jl.ods中创建SparkReadMySQLToHive.scala:

package edu.jl.ods

import org.apache.spark.sql.SparkSession

import java.util.Properties

object SparkReadMySQLtoHive {

  def main(args: Array[String]): Unit = {

    // 设置hadoop的用户名为root
    System.setProperty("HADOOP_USER_NAME","root")

    // 构建sparkSession
    val sparkSession: SparkSession = SparkSession.builder()
      .master("local[*]")
      .config("spark.sql.warehouse.dir", "hdfs://192.168.1.100:9000/user/hive/warehouse/")
      .appName("SparkReadMySQL")
      .enableHiveSupport()
      .getOrCreate()

    // 设置连接mysql相关配置
    val MYSQLDBURL: String = "jdbc:mysql://192.168.1.100:3306/ds_pub?useUnicode=true&characterEncoding=utf-8&useSSL=false"

    // MySQL properties配置
    val properties = new Properties()
    properties.setProperty("user", "root")
    properties.setProperty("password", "123456")
    properties.setProperty("driver", "com.mysql.jdbc.Driver")

    // spark 读取mysqljdbc配置 读取mysql的 order_info 这个表 并创建临时视图 order_info
    sparkSession.read.jdbc(MYSQLDBURL, "order_info", properties).createTempView("order_info")

    // 从视图order_info 中查询数据
    val selectId = "select * from order_info"

    // 在hive的ods库中创建order_info表
    val create_order_info: String =
      """create table ods.order_info(
        | id                  int,
        | consignee           string,
        | consignee_tel       string,
        | final_total_amount  string,
        | order_status        string,
        | user_id             string,
        | delivery_address    string,
        | order_comment       string,
        | out_trade_no        string,
        | trade_body          string,
        | create_time         date,
        | operate_time        date,
        | expire_time         date,
        | tracking_no         String,
        | parent_order_id     int,
        | img_url             string,
        | province_id         int,
        | benefit_reduce_amount string,
        | original_total_amount string,
        | feight_fee           string
        |) row format delimited fields terminated by ',' stored as textfile;
        |""".stripMargin

    // 将从order_info视图中查到的数据加载到 hive的ods库的order_info表中
    val loadDataToHive: String =
      """
        |insert overwrite ods.order_info
        |select * from order_info
        |""".stripMargin

    // 从hive的ods库的order_info 查询数据(验证数据是否写入到hive中了)
    val ods_order_info = "select * from ods.order_info"

    // 在sparkSession中执行上面的sql语句
//    sparkSession.sql(create_order_info).show()
//    sparkSession.sql("desc ods.order_info").show()

    sparkSession.sql(loadDataToHive)

    sparkSession.sql("select * from ods.order_info limit 5").show()

//    sparkSession.sql(selectId).show()

    // sql运行完毕关闭sparkSession连接
    sparkSession.close()

  }

}

写入hive的分区表,分区为昨天的日期:

在edu.jl.ods中创建SparkReadMySQLToHivePartitionTable.scala:

package edu.jl.ods

import org.apache.spark.sql.SparkSession

import java.util.Properties

object SparkReadMySQLtoHivePartitionTable {

  def main(args: Array[String]): Unit = {

    // 设置hadoop的用户名为root
    System.setProperty("HADOOP_USER_NAME","root")

    // 构建sparkSession
    val sparkSession: SparkSession = SparkSession.builder()
      .master("local[*]")
      .config("spark.sql.warehouse.dir", "hdfs://192.168.1.100:9000/user/hive/warehouse/")
      .appName("SparkReadMySQL")
      .enableHiveSupport()
      .getOrCreate()

    // 设置连接mysql相关配置
    val MYSQLDBURL: String = "jdbc:mysql://192.168.1.100:3306/ds_pub?useUnicode=true&characterEncoding=utf-8&useSSL=false"

    // MySQL properties配置
    val properties = new Properties()
    properties.setProperty("user", "root")
    properties.setProperty("password", "123456")
    properties.setProperty("driver", "com.mysql.jdbc.Driver")

    // spark 读取mysqljdbc配置 读取mysql的 order_info 这个表 并创建临时视图 order_info
    sparkSession.read.jdbc(MYSQLDBURL, "order_info", properties).createTempView("order_info")

    // 从视图order_info 中查询数据
    val selectId = "select * from order_info"

    // 在hive的ods库中创建order_info表
    val create_order_info: String =
      """create table ods.order_info_par(
        | id                  int,
        | consignee           string,
        | consignee_tel       string,
        | final_total_amount  string,
        | order_status        string,
        | user_id             string,
        | delivery_address    string,
        | order_comment       string,
        | out_trade_no        string,
        | trade_body          string,
        | create_time         date,
        | operate_time        date,
        | expire_time         date,
        | tracking_no         String,
        | parent_order_id     int,
        | img_url             string,
        | province_id         int,
        | benefit_reduce_amount string,
        | original_total_amount string,
        | feight_fee           string
        |) partitioned by (etl_date string) row format delimited fields terminated by ',' stored as textfile;
        |""".stripMargin

    // 将从order_info视图中查到的数据加载到 hive的ods库的order_info_par表中
    val loadDataToHive: String =
      """
        |insert overwrite ods.order_info_par partition(etl_date="20221114")
        |select * from order_info
        |""".stripMargin

    // 从hive的ods库的order_info 查询数据(验证数据是否写入到hive中了)
    val ods_order_info = "select * from ods.order_info_par limit 5"

    // 在sparkSession中执行上面的sql语句
//    sparkSession.sql(create_order_info).show()
//    sparkSession.sql("desc ods.order_info").show()

    sparkSession.sql(loadDataToHive)

    sparkSession.sql(ods_order_info).show()

//    sparkSession.sql(selectId).show()

    // sql运行完毕关闭sparkSession连接
    sparkSession.close()

  }

}

结果查询分区表的分区情况:

hive (ods)> show partitions order_info_par;
OK
partition
etl_date=20221114
Time taken: 0.048 seconds, Fetched: 1 row(s)
hive (ods)> 

查看分区表的前五行数据:

hive (ods)> select * from order_info_par limit 5;
OK
order_info_par.id   order_info_par.consignee    order_info_par.consignee_tel    order_info_par.final_total_amount   order_info_par.order_status order_info_par.user_id  order_info_par.delivery_address order_info_par.order_comment    order_info_par.out_trade_no order_info_par.trade_body   order_info_par.create_time  order_info_par.operate_time order_info_par.expire_time  order_info_par.tracking_no  order_info_par.parent_order_id  order_info_par.img_url  order_info_par.province_id  order_info_par.benefit_reduce_amount    order_info_par.original_total_amount    order_info_par.feight_fee   order_info_par.etl_date
3443    严致  13207871570 1449.00 1005    2790    第4大街第5号楼4单元464门 描述345855    214537477223728 小米Play 流光渐变AI双摄 4GB+64GB 梦幻蓝 全网通4G 双卡双待 小水滴全面屏拍照游戏智能手机等1件商品 2020-04-25  2020-04-26  2020-04-25  NULL    NULL    http://img.gmall.com/117814.jpg 20  0.00    1442.00 7.00    20221114
3444    慕容亨 13028730359 17805.00    1005    2015    第9大街第26号楼3单元383门    描述948496    226551358533723 Apple iPhoneXSMax (A2104) 256GB 深空灰色 移动联通电信4G手机 双卡双待等2件商品   2020-04-25  2020-04-26  2020-04-25  NULL    NULL    http://img.gmall.com/353392.jpg 11  0.00    17800.00    5.00    20221114
3445    姚兰凤 13080315675 16180.00    1005    8263    第5大街第1号楼7单元722门 描述148518    754426449478474 联想(Lenovo)拯救者Y7000 英特尔酷睿i7 2019新款 15.6英寸发烧游戏本笔记本电脑(i7-9750H 8GB 512GB SSD GTX1650 4G 高色域等3件商品   2020-04-25  2020-04-26  2020-04-25  NULLNULL    http://img.gmall.com/478856.jpg 26  3935.00 20097.00    18.00   20221114
3446    柏锦黛 13487267342 4922.00 1005    7031    第17大街第40号楼2单元564门   描述779464    262955273144195 十月稻田 沁州黄小米 (黄小米 五谷杂粮 山西特产 真空装 大米伴侣 粥米搭档) 2.5kg等4件商品 2020-04-25  2020-04-26  2020-04-25  NULL    NULL    http://img.gmall.com/144444.jpg 30  0.00    4903.00 19.00   20221114
3447    计娴瑾 13208002474 6665.00 1005    5903    第4大街第25号楼6单元338门    描述396659    689816418657611 荣耀10青春版 幻彩渐变 2400万AI自拍 全网通版4GB+64GB 渐变蓝 移动联通电信4G全面屏手机 双卡双待等3件商品 2020-04-25  2020-04-25  2020-04-25  NULL    NULL    http://img.gmall.com/793265.jpg 29  0.00    6660.00 5.00    20221114
Time taken: 0.145 seconds, Fetched: 5 row(s)
hive (ods)> 

实时数据分析

4.1 Linux端环境准备

需要开启端口生成数据,在/opt/data目录下执行端口开启的命令:

[root@master data]# pwd
/opt/data
[root@master data]# sh order.sh | nc -lk 26001

如果端口被占用(Ncat: bind to :::26001: Address already in use. QUITTING.),需要通过netstat -ntlp命令查看占用端口的进程号,通过kill -9 加进程号的方式关闭对应进程:

[root@master data]# netstat -ntlp
[root@master data]# kill -9 xxx

kill掉占用端口的进程后重新启动数据生成脚本:

[root@master data]# sh order.sh | nc -lk 26001

需要配置并启动flume:

[root@master ~]# flume-ng agent --name a1 --conf /opt/module/flume/conf/ --conf-file /opt/module/flume/conf/socket_to_kafka.conf -Dflume.root.logger=INFO,console

socket_to_kafka.conf配置内容:

# # 命名这个代理上的组件 
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#描述/配置源 
a1.sources.r1.type = exec
a1.sources.r1.command = nc master 26001

# # 描述接收器 
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = master:9092
a1.sinks.k1.kafka.topic = order
a1.sinks.k1.kafka.producer.acks = 1

# a1.sinks.k1.type = logger

# 使用一个通道缓冲内存中的事件 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 
          
# 将source和sink绑定到通道 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 

如果flume 在后台仍然运行 可以通过jps查看进程号方式找到进程号 并kill -9 关闭

[root@master conf]# kill -9 12618
[root@master conf]# jps
10720 Kafka
3315 NodeManager
2901 SecondaryNameNode
3157 ResourceManager
11669 ConsoleConsumer
3688 QuorumPeerMain
8008 RunJar
15080 Jps
2489 NameNode
2653 DataNode
11279 ConsoleConsumer
[1]+  Killed                  flume-ng agent --name a1 --conf /opt/module/flume/conf/ --conf-file /opt/module/flume/conf/socket_to_kafka.conf -Dflume.root.logger=INFO,console  (wd: ~)
(wd now: /opt/module/flume/conf)
[root@master conf]# 

需要修改kafka配置,并重启kafka:

修改kafka/config/server.properties配置:

[root@master config]# vim /opt/module/kafka/config/server.properties 

修改内容:

advertised.listeners=PLAINTEXT://192.168.1.100:9092

重启kafka:

[root@master config]# jps
10720 Kafka
15106 Application
3315 NodeManager
2901 SecondaryNameNode
3157 ResourceManager
11669 ConsoleConsumer
3688 QuorumPeerMain
8008 RunJar
2489 NameNode
15370 Jps
2653 DataNode
11279 ConsoleConsumer
[root@master config]# kill -9 10720
[root@master config]# /opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties 

注:三台linux都要执行上述操作;

在master节点开启kafka-console-consumer.sh接收数据,验证配置是否成功:

[root@master config]# kafka-console-consumer.sh --bootstrap-server master:9092,slave1:9092,slave2:9092 --topic order
I,"3578","张娣","13586813843","26718","1005","8708","第12大街第14号楼3单元391门","描述732297","347956132393657","Apple iPhoneXSMax (A2104) 256GB 深空灰色 移动联通电信4G手机 双卡双待等3件商品","2020-4-26 18:55:01","2020-4-26 18:59:01","2020-4-26 19:10:01",,,"http://img.gmall.com/991685.jpg","21","0","26700","18"
D,"8756","3573","1","荣耀10青春版 幻彩渐变 2400万AI自拍 全网通版4GB+64GB 渐变蓝 移动联通电信4G全面屏手机 双卡双待","http://AOvKmfRQEBRJJllwCwCuptVAOtBBcIjWeJRsmhbJ","2220","3","2020-4-26 18:55:01","2401",
I,"3579","伏彩春","13316165573","7374","1005","1515","第16大街第14号楼7单元619门","描述349867","152178625166735","荣耀10 GT游戏加速 AIS手持夜景 6GB+64GB 幻影蓝全网通 移动联通电信等3件商品","2020-4-26 18:55:01","2020-4-26 19:03:49","2020-4-26 19:10:01",,,"http://img.gmall.com/648845.jpg","15","0","7356","18"

4.2 Windows端环境准备

创建Flink实时数据分析的项目:

在项目中引入Flink开发所需的pom依赖:

<properties>
        <flink.version>1.14.0</flink.version>
        <scala.version>2.12</scala.version>
        <hive.version>2.3.7</hive.version>
        <mysqlconnect.version>5.1.47</mysqlconnect.version>
        <!--        <hdfs.version>2.7.2</hdfs.version>-->
        <spark.version>3.0.0</spark.version>
    </properties>

    <dependencies>
<!--        <dependency>-->
<!--            <groupId>org.scalanlp</groupId>-->
<!--            <artifactId>jblas</artifactId>-->
<!--            <version>1.2.1</version>-->
<!--        </dependency>-->

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime-web_2.12</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.12</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.12</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.12</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner_2.12</artifactId>
            <version>${flink.version}</version>
        </dependency>
<!--        <dependency>-->
<!--            <groupId>org.apache.flink</groupId>-->
<!--            <artifactId>flink-table-planner-blink_2.12</artifactId>-->
<!--            <version>${flink.version}</version>-->
<!--        </dependency>-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-api-scala-bridge_2.12</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-common</artifactId>
            <version>${flink.version}</version>
            <type>pom</type>
        </dependency>
<!--        <dependency>-->
<!--            <groupId>org.apache.flink</groupId>-->
<!--            <artifactId>flink-jdbc_2.12</artifactId>-->
<!--            <version>${flink.version}</version>-->
<!--        </dependency>-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-csv</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.bahir</groupId>
            <artifactId>flink-connector-redis_2.12</artifactId>
            <version>1.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-avro</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-filesystem_2.12</artifactId>
            <version>1.10.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner-blink_2.12</artifactId>
            <version>1.10.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>${hive.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysqlconnect.version}</version>
        </dependency>

        <!--spark处理离线-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.8.2</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.dom4j/dom4j -->
        <dependency>
            <groupId>org.dom4j</groupId>
            <artifactId>dom4j</artifactId>
            <version>2.1.3</version>
        </dependency>

    </dependencies>

    <build>
        <!--<sourceDirectory>src/main/scala</sourceDirectory>-->
        <resources>
            <resource>
                <directory>src/main/scala</directory>
            </resource>
            <resource>
                <directory>src/main/java</directory>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <configuration>
                    <recompileMode>incremental</recompileMode>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

导包异常可以检查maven中是否有对应的jar包,如果确认有的话可以换一下maven配置

在windows关闭防火墙(防止端口数据被防火墙拦截)

4.3 实时数据分析代码开发

参考教程(1-114):【尚硅谷】Flink1.13教程(Scala版)_哔哩哔哩_bilibili

创建package edu.jl,并在package下创建Hello.scala:

在hello.scala中设置scala的sdk:

package edu.jl

object hello {
  def main(args: Array[String]): Unit = {
    println("hello")
  }
}

执行hello,确认scala环境是可用的

在edu.jl下创建readKafka.scala:

在readKafka.scala中写入:

package edu.jl

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer

import java.util.Properties


object readKafka {

  def main(args: Array[String]): Unit = {
    // 获取flink流式执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1) // 设置代码并行度为1

    //设置kafka的properties
    val properties = new Properties()
    properties.setProperty("bootstrap.servers","192.168.1.100:9092")

    // 设置Flink的数据源为kafka,创建数据流
    val ds: DataStream[String] = env.addSource(new FlinkKafkaConsumer[String]("order", new SimpleStringSchema(), properties).setStartFromLatest())

    // 打印从kafka获取到的数据
    ds.print()

    // 执行流式环境
    env.execute()
  }
}

重启端口和flume后,运行代码文件,结果从idea的输出框中可以看到信息,证明windows的flink代码环境可以消费到数据:

4.4 统计SKU商品的总额,将值存入Redis

启动redis:

[root@master ~]# redis-server /root/my_redis.conf 

创建readKafkaToRedis.scala文件,并在文件内写入:

package edu.jl

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}

import java.util.Properties


object readKafkaToRedis {

  def main(args: Array[String]): Unit = {
    // 获取flink流式执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1) // 设置代码并行度为1

    //设置kafka的properties
    val properties = new Properties()
    properties.setProperty("bootstrap.servers","192.168.1.100:9092")

    // 设置Flink的数据源为kafka,创建数据流
    val ds: DataStream[String] = env.addSource(new FlinkKafkaConsumer[String]("order", new SimpleStringSchema(), properties).setStartFromLatest())

    // 过滤掉表头的数据 生成新的数据流filter_ds
    val filter_ds: DataStream[String] = ds.filter(!_.split(",")(1).equals("\"id\""))

    // 从filter_ds中提取出需要分析的数据
    // 取出 sku_id相同的 sku_price 的和
    val map_ds = filter_ds.map(line => {
      val words: Array[String] = line.split(",")
      val dat: String = words(0)

      if (dat.trim == "D") {
        val sku_price: Double = words(6).split('"')(1).toDouble
        val sku_id: String = words(3).split('"')(1)
        (sku_id, sku_price)
      } else {
        ("报错", 1.0)
      }
    }).keyBy(_._1).sum(1)

    map_ds.print()
    map_ds.addSink(new RedisSink[(String, Double)](flinkJedisPoolConfig,new MyRedisMapper))

    // 执行流式环境
    env.execute()
  }
  class MyRedisMapper extends RedisMapper[Tuple2[String,Double]] {

    // 定义保存数据到redis的命令,HSET 表名 key value
    override def getCommandDescription: RedisCommandDescription = new RedisCommandDescription(RedisCommand.SET)

    // 将id转换为key
    override def getKeyFromData(t: (String, Double)): String = t._1

    // 将price值指定为value
    override def getValueFromData(t: (String, Double)): String = t._2.toString

  }

  private val flinkJedisPoolConfig: FlinkJedisPoolConfig = new FlinkJedisPoolConfig.Builder()
    .setHost("192.168.1.100")
    .setPort(6379)
    .build()
}

代码运行结果:

在Redis中查询(获取所有的key keys 星号,获取具体的值 get key):

keys * // 获取所有的key
get key // 获取对应key的值
例如
192.168.1.100:6379> get 4
"1442.0"

结果

192.168.1.100:6379> keys *
 1) "10"
 2) "\"11\""
 3) "\"6\""
 4) "\"14\""
 5) "6"
 6) "\"7\""
 7) "\"4\""
 8) "\"8\""
 9) "\"15\""
10) "\"13\""
11) "\"12\""
12) "\"9\""
13) "\"2\""
14) "\"1\""
15) "\"5\""
16) "15"
17) "\"16\""
18) "\"10\""
19) "\xe6\x8a\xa5\xe9\x94\x99"
20) "14"
21) "3"
192.168.1.100:6379> get 3
"3100.0"
192.168.1.100:6379> get 14
"496.0"
192.168.1.100:6379> get 4
"1442.0"

至此,本篇文章就已经全部结束了,感谢大家的观看。

已许久许久许久……未更新。

忙于考试。

加油加油加油!!!

/(ㄒoㄒ)/~~


🥇Summary

上述内容就是此次  大数据开发相关组件部署及数据抽取  的全部内容了,感谢大家的支持,相信在很多方面存在着不足乃至错误,希望可以得到大家的指正。🙇‍(ง •_•)ง

我非轻舟

2024年第二期,继续加油!!!

希望大家有好的意见或者建议,欢迎私信,一起加油


以上就是本篇文章的全部内容了

 ~ 关注我,点赞博文~ 每天带你涨知识!

1.看到这里了就 [点赞+好评+收藏] 三连 支持下吧,你的「点赞,好评,收藏」是我创作的动力。

2.关注我 ~ 每天带你学习 :各种前端插件、3D炫酷效果、图片展示、文字效果、以及整站模板 、HTML模板 、C++、数据结构、Python程序设计、Java程序设计、爬虫等! 「在这里有好多 开发者,一起探讨 前端 开发 知识,互相学习」!

3.以上内容技术相关问题可以相互学习,可 关 注 ↓公 Z 号 获取更多源码 !
 

获取源码?私信?关注?点赞?收藏?WeChat?PDF PDF  PDF

👍+✏️+⭐️+🙇‍

有需要源码的小伙伴可以 关注下方微信公众号 " Enovo开发工厂 " ,一起交流啊!!!PDF PDF  PDF

  • 26
    点赞
  • 34
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Enovo_你当像鸟飞往你的山

好好读书!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值