阿里云安装Hadoop全家桶

阿里云安装Hadoop全家桶

对于学生来说,阿里云的高校计划可以领取300元的优惠卷是对于我们学习hadoop有很大帮助
阿里云高校计划链接入口

推荐一个开发工具,主要对新手不是很熟悉使用vim的用户
Xterminal

😀 文章的前言:
本文章使用阿里云服务器安装Hadoop基础教程

文中提到的配置文件下载地址

阿里云跳转链接

软件版本下载链接
Hadoop3.3.4https://hadoop.apache.org/release/3.3.6.html
Flume1.10.1https://hbase.apache.org/downloads.html
Zookeeper3.8.3https://zookeeper.apache.org/releases.html
JDK1.8https://www.oracle.com/cn/java/technologies/downloads/
Kafka2.12https://kafka.apache.org/
MySQL8.0.3https://dev.mysql.com/downloads/mysql/8.0.html
Hive3.1.3https://archive.apache.org/dist/hive/hive-3.1.3/
Flink1.17.1https://flink.apache.org/downloads/
HBase2.4.11https://hbase.apache.org/downloads.html
Spark3.1.3https://spark.apache.org/docs/3.1.3/
Spark-without-hadoop3.1.3https://mypikpak.com/s/VNvFw7s2Rkzu8k0qZ6XVulrYo1,https://pan.baidu.com/s/1af9OveQl8bNuIoL0aiZXGQ?pwd=4m9w
dataxhttps://github.com/alibaba/DataX?tab=readme-ov-file
Maxwellhttps://maxwells-daemon.io/

服务器规划

hadoop102hadoop103hadoop104
NameNodeResourceManageSecondaryNaneNode
DataNodeDataNodeDataNode
NodeManagerNodeManagerNodeManager
HistoryServer--
jdkjdkjdk
zkzkzk
kafkakafkakafka
MySQL--
Flume--
Hive--
flink(JobManager)flinkflink
TaskManagerTaskManagerTaskManager
HBaseHBaseHBase
Spark
datax
Maxwel

阿里云购买服务器

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

配置安全组

Untitled

Untitled

Untitled

Untitled

可以选择手动添加,但太麻烦了有多直接选择导入

Untitled

搭建环境

服务器登录

xterminal

Untitled

分别记住这三个公网ip

打开软件

Untitled

测试

Untitled

Untitled

配置内网ip互联

Untitled

vim /etc/hosts

192.168.1.224   hadoop102       hadoop102
192.168.1.223   hadoop104       hadoop104
192.168.1.225   hadoop103       hadoop103

Untitled

https://www.notion.so

三台机器都需要

一个一个操作太麻烦了写一个脚本

在写脚本之前我们想要创建一个用户来管理我们后续的hadoop全家桶安装

于是创建一个hadoop用户

useradd hadoop
passwd hadoop
设置hadoop用户密码为1234

切换到hadoop用户写脚本

Untitled

vim xsync

#!/bin/bash

#1. 判断参数个数
if [ $# -lt 1 ]
then
  echo Not Enough Arguement!
  exit;
fi

#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
  echo ====================  $host  ====================
  #3. 遍历所有目录,挨个发送
  for file in $@
  do
    #4 判断文件是否存在
    if [ -e $file ]
    then
      #5. 获取父目录
      pdir=$(cd -P $(dirname $file); pwd)
      #6. 获取当前文件的名称
      fname=$(basename $file)
      ssh $host "mkdir -p $pdir"
      rsync -av $pdir/$fname $host:$pdir
    else
      echo $file does not exists!
    fi
  done
done

请注意

Untitled

这三个为自己的主机

给文件权限

chmod 777 xsync

分发内容

xsync /etc/hosts

Untitled

SSH互联

接下来就是互联按照往常的安装来说

分别是 root 和 hadoop 用户需要进行互联

这个软件有个好处就是可以批量操作

Untitled

Untitled

开始互联吧!

ssh-keygen -trsa -b 4096

Untitled

ssh-copy-id hadoop102
ssh-copy-id hadoop103
ssh-copy-id hadoop104

root 和 hadoop 用户都要操作一遍

jdk1.8

后续的安装都推荐使用hadoop用户且用hadoop102进行操作

先用root给hadoop权限

chown -R hadoop:hadoop /opt

上传目录

mkdir /opt/software

安装目录

mkdir /opt/module

上传jdk,hadoop,zookeeper,kafka,flume

Untitled

Untitled

耐心等一会

Untitled

现在上传完成开始进行jdk安装

tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

修改名字(默认都在module操作,后续不提示)

mv jdk1.8.0_212/ jdk

环境变量

#JAVA_HOME
export JAVA_HOME=/opt/module/jdk
export PATH=$PATH:$JAVA_HOME/bin

验证

Untitled

分发

Untitled

先不着急查看其他两台,profile还没有分发,等hadoop一起

Hadoop

解压

tar -zxvf hadoop-3.3.4.tar.gz  -C /opt/module/

改名

 mv hadoop-3.3.4/ hadoop

配置

cd hadoop/etc/hadoop/

推荐使用软件自带的编辑器 xedit命令

core-site.xml

hdfs-site.xml

mapred-site.xml

yarn-site.xml

xedit workers 

hadoop102
hadoop103
hadoop104
xedit hadoop-env.sh

export JAVA_HOME=/opt/module/jdk
export HADOOP_HOME=/opt/module/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs

环境变量

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

分发环境

xsync /etc/proifle

分发hadoop

xsync hadoop/

检查java

在这里插入图片描述

检查hadoop

在这里插入图片描述

格式化

cd /opt/module/hadoop/
bin/hdfs namenode -format

启动脚本

vim hdp.sh

#!/bin/bash
if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi
case $1 in
"start")
        echo " =================== 启动 hadoop集群 ==================="

        echo " --------------- 启动 hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop/sbin/start-dfs.sh"
        echo " --------------- 启动 yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop/sbin/start-yarn.sh"
        echo " --------------- 启动 historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== 关闭 hadoop集群 ==================="

        echo " --------------- 关闭 historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop/bin/mapred --daemon stop historyserver"
        echo " --------------- 关闭 yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop/sbin/stop-yarn.sh"
        echo " --------------- 关闭 hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

给权限
chmod 777 hdp.sh

jps查看脚本

vim xcall

#! /bin/bash
for i in hadoop102 hadoop103 hadoop104
do
    echo --------- $i ----------
    ssh $i "/opt/module/jdk/bin/jps $$*"
done

chmod 777

启动与关闭

hdp.sh start
hdp.sh stop

Untitled

Untitled

Untitled

Zookeeper

解压修改名字

tar -zxvf apache-zookeeper-3.7.1-bin.tar.gz  -C /opt/module/
mv apache-zookeeper-3.7.1-bin/ zookeeper

配置

配置服务器编号

cd zookeeper/
mkdir zkData
cd zkData/
vim myid
2

注意编号是2

修改配置文件

cd conf/
mv zoo_sample.cfg zoo.cfg
xedit zoo.cfg 

dataDir=/opt/module/zookeeper/zkData
#######################cluster##########################
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
server.4=hadoop104:2888:3888

zoo.cfg

Untitled

或者替换配置文件内容

分发

cd /opt/moudle
xsync zookeeper/

修改hadoop103,hadoop104的myid配置

hadoop103 对应3
hadoop104 对应4

脚本

cd /home/hadoop/bin
vim zk.sh

#!/bin/bash

# 设置JAVA_HOME和更新PATH环境变量
export JAVA_HOME=/opt/module/jdk
export PATH=$PATH:$JAVA_HOME/bin

# 检查输入参数
if [ $# -ne 1 ]; then
    echo "用法: $0 {start|stop|status}"
    exit 1
fi

# 执行操作
case "$1" in
    start)
        echo "---------- Zookeeper 启动 ------------"
        /opt/module/zookeeper/bin/zkServer.sh start
        ssh hadoop103 "export JAVA_HOME=/opt/module/jdk; export PATH=\$PATH:\$JAVA_HOME/bin; /opt/module/zookeeper/bin/zkServer.sh start"
        ssh hadoop104 "export JAVA_HOME=/opt/module/jdk; export PATH=\$PATH:\$JAVA_HOME/bin; /opt/module/zookeeper/bin/zkServer.sh start"
        ;;
    stop)
        echo "---------- Zookeeper 停止 ------------"
        /opt/module/zookeeper/bin/zkServer.sh stop
        ssh hadoop103 "export JAVA_HOME=/opt/module/jdk; export PATH=\$PATH:\$JAVA_HOME/bin; /opt/module/zookeeper/bin/zkServer.sh stop"
        ssh hadoop104 "export JAVA_HOME=/opt/module/jdk; export PATH=\$PATH:\$JAVA_HOME/bin; /opt/module/zookeeper/bin/zkServer.sh stop"
        ;;
    status)
        echo "---------- Zookeeper 状态 ------------"
        /opt/module/zookeeper/bin/zkServer.sh status
        ssh hadoop103 "export JAVA_HOME=/opt/module/jdk; export PATH=\$PATH:\$JAVA_HOME/bin; /opt/module/zookeeper/bin/zkServer.sh status"
        ssh hadoop104 "export JAVA_HOME=/opt/module/jdk; export PATH=\$PATH:\$JAVA_HOME/bin; /opt/module/zookeeper/bin/zkServer.sh status"
        ;;
    *)
        echo "未知命令: $1"
        echo "用法: $0 {start|stop|status}"
        exit 2
        ;;
esac

Untitled

kafka

解压和修改名

tar -zxvf kafka_2.12-3.3.1.tgz  -C /opt/module/
mv kafka_2.12-3.3.1/ kafka

配置

xedit server.properties

添加
advertised.listeners=PLAINTEXT://hadoop102:9092

修改
log.dirs=/opt/module/kafka/datas
zookeeper.connect=hadoop102:2181,hadoop103:2181,hadoop104:2181

环境变量

#KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka
export PATH=$PATH:$KAFKA_HOME/bin

记得刷新

分发

xsync kafka/

修改hadoop103/104的配置文件

[hadoop@hadoop103 module]$ vim kafka/config/server.properties
修改:
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
#broker对外暴露的IP和端口 (每个节点单独配置)
advertised.listeners=PLAINTEXT://hadoop103:9092

[hadoop@hadoop104 module]$ vim kafka/config/server.properties
修改:
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=2
#broker对外暴露的IP和端口 (每个节点单独配置)
advertised.listeners=PLAINTEXT://hadoop104:9092

脚本

请记住kakfa是在zookeeper启动下才能成功启动

vim kf.sh

#!/bin/bash

# Kafka和Zookeeper的配置
KAFKA_HOME=/opt/module/kafka
ZOOKEEPER_HOME=/opt/module/zookeeper
JAVA_HOME=/opt/module/jdk

# 定义启动Kafka的函数
start_kafka() {
    echo "Starting Kafka on hadoop102..."
    $KAFKA_HOME/bin/kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties
    
    echo "Starting Kafka on hadoop104..."
    ssh hadoop104 "export JAVA_HOME=$JAVA_HOME; export KAFKA_HOME=$KAFKA_HOME; $KAFKA_HOME/bin/kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties"
    
    echo "Starting Kafka on hadoop103..."
    ssh hadoop103 "export JAVA_HOME=$JAVA_HOME; export KAFKA_HOME=$KAFKA_HOME; $KAFKA_HOME/bin/kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties"
}

# 定义停止Kafka的函数
stop_kafka() {
    echo "Stopping Kafka on hadoop102..."
    $KAFKA_HOME/bin/kafka-server-stop.sh
    
    echo "Stopping Kafka on hadoop104..."
    ssh hadoop104 "export KAFKA_HOME=$KAFKA_HOME; $KAFKA_HOME/bin/kafka-server-stop.sh"
    
    echo "Stopping Kafka on hadoop103..."
    ssh hadoop103 "export KAFKA_HOME=$KAFKA_HOME; $KAFKA_HOME/bin/kafka-server-stop.sh"
}

# 定义检查Kafka状态的函数
check_status() {
    echo "Checking Kafka status on hadoop102..."
    ssh hadoop102 "jps | grep -i kafka"
    
    echo "Checking Kafka status on hadoop104..."
    ssh hadoop104 "jps | grep -i kafka"
    
    echo "Checking Kafka status on hadoop103..."
    ssh hadoop103 "jps | grep -i kafka"
}

# 处理命令行参数
case "$1" in
    start)
        start_kafka
        ;;
    stop)
        stop_kafka
        ;;
    status)
        check_status
        ;;
    *)
        echo "Usage: $0 {start|stop|status}"
        exit 1
esac

Untitled

Flume

解压和修改名

 tar -zxvf apache-flume-1.10.1-bin.tar.gz  -C /opt/module/
 mv apache-flume-1.10.1-bin/ flume

配置

log4j2.xml

vim log4j2.xml

修改
<Property name="LOG_DIR">/opt/module/flume/log</Property>

添加
<Root level="INFO">
      <AppenderRef ref="LogFile" />
      <AppenderRef ref="Console" />
    </Root>

分发

xsync flume/

MySQL

MySQL下载地址(推荐)

或者用上面的

解压下载的

Untitled

上传MySQL和hive

Untitled

安装MySQL

cd /opt/software/MySQL/
sh install_mysql.sh

root 密码是 000000

检查登录

mysql -root -p000000

Untitled

安装成功

Hive

解压和修改名

tar -zxvf hive-3.1.3.tar.gz -C /opt/module/

mv apache-hive-3.1.3-bin/ hive

配置

hive-env.sh

vim hive-env.sh

export HADOOP_HOME=/opt/module/hadoop
export HIVE_CONF_DIR=/opt/module/hive/conf
export HIVE_AUX_JARS_PATH=/opt/module/hive/lib

hive-site.xml

vim hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!--配置Hive保存元数据信息所需的 MySQL URL地址-->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8&amp;allowPublicKeyRetrieval=true</value>
    </property>

    <!--配置Hive连接MySQL的驱动全类名-->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.cj.jdbc.Driver</value>
    </property>

    <!--配置Hive连接MySQL的用户名 -->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>

    <!--配置Hive连接MySQL的密码 -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>000000</value>
    </property>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>

    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>

    <property>
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
    </property>

    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>hadoop102</value>
    </property>

    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>
    
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>

    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
    </property>
</configuration>

环境变量

#HIVE_HOME
export HIVE_HOME=/opt/module/hive
export PATH=$PATH:$HIVE_HOME/bin

解决日志Jar包冲突

cd /opt/module/hive/lib/
mv log4j-slf4j-impl-2.17.1.jar log4j-slf4j-impl-2.17.1.jar.bak

将MySQL的JDBC驱动拷贝到Hive的lib目录

cp /opt/software/MySQL/mysql-connector-j-8.0.31.jar /opt/module/hive/lib/

检查
ll | grep mysql

Untitled

配置元数据库

mysql操作

mysql -uroot -p000000
mysql> create database metastore;
mysql> quit;

hive操作

hdp.sh start
cd /opt/module/hive
bin/schematool -initSchema -dbType mysql -verbos

Untitled

修改元数据库字符集

use metastore;
alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
alter table TABLE_PARAMS modify column PARAM_VALUE mediumtext character set utf8;

启动

hive

show databases;

Untitled

配置客户端

beeline

在hive目录创建logs文件夹
#先启动metastore服务 然后启动hiveserver2服务
nohup bin/hive --service metastore >> logs/metastore.log 2>&1 &
nohup bin/hive --service hiveserver2 >> logs/hiveserver2.log 2>&1 &

bin/beeline
! connect jdbc:hive2://hadoop102:10000
hadoop
回车

Untitled

DG配置

Untitled

脚本

这个脚本只用于启动metastor和hiveserver2以及杀死hive

hi.sh

#!/bin/bash

export HIVE_HOME=/opt/module/hive

case $1 in
  start)
    echo "正在启动 Hive..."
    nohup $HIVE_HOME/bin/hive --service metastore >> $HIVE_HOME/logs/metastore.log 2>&1 &
    nohup $HIVE_HOME/bin/hive --service hiveserver2 >> $HIVE_HOME/logs/hiveserver2.log 2>&1 &
    echo "Hive 启动完成。"
    ;;
  stop)
    echo "开始关闭 Hive..."
    runjar_pids=$(jps | grep 'RunJar' | awk '{print $1}')
    if [ -z "$runjar_pids" ]; then
      echo "没有发现 RunJar 进程在运行。"
    else
      for pid in $runjar_pids; do
        echo "正在杀掉进程:$pid"
        kill -15 $pid  # 先尝试优雅地终止进程
        sleep 1       # 等待几秒,给进程时间响应
        kill -9 $pid  # 如果进程仍然运行,强制终止
      done
    fi
    echo "Hive 关闭完成。"
    ;;
  *)
    echo "用法: $0 {start|stop}"
    exit 1
    ;;
esac

exit 0

HBase

安装前确保Hadoop集群和zk启动

Untitled

解压和修改名

tar -zxvf hbase-2.4.11-bin.tar.gz -C /opt/module/
mv hbase-2.4.11/ hbase

配置

环境变量配置

#HBASE_HOME
export HBASE_HOME=/opt/module/hbase
export PATH=$PATH:$HBASE_HOME/bin

记得分发和刷新一下

hbase-env.sh

export HBASE_MANAGES_ZK=false

hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>

  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>hadoop102,hadoop103,hadoop104</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://hadoop102:8020/hbase</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>

  

  <property>
    <name>hbase.wal.provider</name>
    <value>filesystem</value>
  </property>
</configuration>

regionservers

hadoop102
hadoop103
hadoop104

解决HBase和Hadoop的log4j兼容性问题

cd /opt/module/hbase/lib/client-facing-thirdparty/
mv slf4j-reload4j-1.7.33.jar  slf4j-reload4j-1.7.33.jar.bak

分发和启动

成功图,端口号16010

Untitled

如果说找不到jdk

自动在hbase-env.sh添加jdk路径

高可用

touch conf/backup-masters
echo hadoop103 > conf/backup-masters
最后分发就行

Flink

解压和修改名

tar -zxvf flink-1.17.1-bin-scala_2.12.tgz -C /opt/module/
mv flink-1.17.1/ flink

配置

flink-conf.yaml

# JobManager节点地址.
jobmanager.rpc.address: hadoop102
jobmanager.bind-host: 0.0.0.0
rest.address: hadoop102
rest.bind-address: 0.0.0.0
# TaskManager节点地址.需要配置为当前机器名
taskmanager.bind-host: 0.0.0.0
taskmanager.host: hadoop102

workers

hadoop102
hadoop103
hadoop104

masters

hadoop102:8081

分发

hadoop103/104

修改配置flink-conf.yaml

taskmanager.host: hadoop103

==============================================================================

taskmanager.host: hadoop104

Yarn运行模式

环境变量添加

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=`hadoop classpath`

启动截图

image-20240402203301648

如果打不开,不要慌,先检查阿里云安全组是否打开了8081端口

image-20240402203307438

Spark(Yarn)

解压和修改名

tar -zxvf spark-3.1.3-bin-hadoop3.2.tgz -C /opt/module/
 mv spark-3.1.3-bin-hadoop3.2/ spark-yarn

配置

hadoop的yarn-site.xml

<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
<property>
     <name>yarn.nodemanager.pmem-check-enabled</name>
     <value>false</value >
</property>

<!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->true的话浪费内存 
<property>
     <name>yarn.nodemanager.vmem-check-enabled</name>
     <value>false</value>
</property>

记得分发

Spark配置

 cd /opt/module/sprak/conf
 mv spark-env.sh.template spark-env.sh
 mv spark-defaults.conf.template spark-defaults.conf

spark-env.sh

YARN_CONF_DIR=/opt/module/hadoop/etc/hadoop

spark-defaults.conf

添加
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 
-Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/directory 
-Dspark.history.retainedApplications=30"
spark.yarn.historyServer.address=hadoop102:18080
spark.history.ui.port=18080

启动和测试

启动历史服务器

sbin/start-history-server.sh 

mkdir -p /tmp/spark-events(没有权限自己给)

测试

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
./examples/jars/spark-examples_2.12-3.1.3.jar \
10

查看

Untitled

Spark on Hive

解压和修改名

tar -zxvf spark-3.3.1-bin-without-hadoop.tgz -C /opt/module/
mv /opt/module/spark-3.3.1-bin-without-hadoop /opt/module/spark

配置

spark-env.sh

mv /opt/module/spark/conf/spark-env.sh.template /opt/module/spark/conf/spark-env.sh
vim /opt/module/spark/conf/spark-env.sh

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

环境变量

# SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH=$PATH:$SPARK_HOME/bin

记得刷新变量

hive创建spark文件

vim /opt/module/hive/conf/spark-defaults.conf

添加下面的内容

spark.master                               yarn
spark.eventLog.enabled                   true
spark.eventLog.dir                        hdfs://hadoop102:8020/spark-history
spark.executor.memory                    1g
spark.driver.memory					     1g

HDFS创建路径,用于存储历史日志和Spark的jar包

hadoop fs -mkdir /spark-history
hadoop fs -mkdir /spark-jars
hadoop fs -put /opt/module/spark/jars/* /spark-jars

修改hive的hive-site.xml

添加

<!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
<property>
    <name>spark.yarn.jars</name>
    <value>hdfs://hadoop102:8020/spark-jars/*</value>
</property>
  
<!--Hive执行引擎-->
<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
</property>

Yarn环境配置

 vim  /opt/module/hadoop/etc/hadoop/capacity-scheduler.xml
 修改
 <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.8</value>
</property

Untitled

测试

nohup hive --service metastore &
hive
create table student(id int, name string);
insert into table student values(1,'abc');

Untitled

第一次很慢需要调用资源,但第二次就很快了。

Maxwell

解压和修改名

 cd /opt/software/
 tar -zxvf maxwell-1.29.2.tar.gz -C /opt/module/
 cd ../module/
 mv maxwell-1.29.2/ maxwell

启用MySQL Binlog

sudo vim /etc/my.cnf

增加

#数据库id
server-id = 1
#启动binlog,该参数的值会作为binlog的文件名
log-bin=mysql-bin
#binlog类型,maxwell要求为row类型
binlog_format=row
#启用binlog的数据库,需根据实际情况作出修改
binlog-do-db=gmall

重启MySQL服务

sudo systemctl restart mysqld

在MySQL中创建数据库

CREATE DATABASE maxwell;
REATE USER 'maxwell'@'%' IDENTIFIED BY 'maxwell';
GRANT ALL ON maxwell.* TO 'maxwell'@'%';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';

配置Maxwell

cd /opt/module/maxwell
cp config.properties.example config.properties
vim config.properties
#Maxwell数据发送目的地,可选配置有stdout|file|kafka|kinesis|pubsub|sqs|rabbitmq|redis
producer=kafka
# 目标Kafka集群地址
kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
#目标Kafka topic,可静态配置,例如:maxwell,也可动态配置,例如:%{database}_%{table}
kafka_topic=topic_db

# MySQL相关配置
host=hadoop102
user=maxwell
password=maxwell
jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai&allowPublicKeyRetrieval=true

# 过滤gmall中的z_log表数据,该表是日志数据的备份,无须采集
filter=exclude:gmall.z_log
# 指定数据按照主键分组进入Kafka不同分区,避免数据倾斜
producer_partition_by=primary_key

脚本

#!/bin/bash

MAXWELL_HOME=/opt/module/maxwell

status_maxwell(){
    result=`ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | wc -l`
    return $result
}

start_maxwell(){
    status_maxwell
    if [[ $? -lt 1 ]]; then
        echo "启动Maxwell"
        $MAXWELL_HOME/bin/maxwell --config $MAXWELL_HOME/config.properties --daemon
    else
        echo "Maxwell正在运行"
    fi
}

stop_maxwell(){
    status_maxwell
    if [[ $? -gt 0 ]]; then
        echo "停止Maxwell"
        ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | awk '{print $2}' | xargs kill -9
    else
        echo "Maxwell未在运行"
    fi
}

case $1 in
    start )
        start_maxwell
    ;;
    stop )
        stop_maxwell
    ;;
    restart )
       stop_maxwell
       start_maxwell
    ;;
esac

datax

解压和修改名

tar -zxvf datax.tar.gz -C /opt/module/

自我测试检查

python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json

解压和修改名

tar -zxvf datax.tar.gz -C /opt/module/

自我测试检查

python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json

image-20240518180425228

  • 24
    点赞
  • 34
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值