Hadoop3.x集成Hive3.1.2手册

最新推荐文章于 2023-07-21 15:18:09 发布

阿坚87

最新推荐文章于 2023-07-21 15:18:09 发布

阅读量1.4k

点赞数

分类专栏：大数据数据分析文章标签： hadoop hive mysql

本文链接：https://blog.csdn.net/idiotion/article/details/121282873

版权

大数据同时被 2 个专栏收录

28 篇文章 1 订阅

订阅专栏

数据分析

24 篇文章 1 订阅

订阅专栏

1）、安装MySQL

卸载内置MariaDB

rpm -qa | grep mariadb
sudo yum -y remove mariadb-libs-5.5.68-1.el7.x86_64

下载并解压文件

下载地址：https://downloads.mysql.com/archives/get/p/23/file/mysql-8.0.21-1.el8.x86_64.rpm-bundle.tar
```
tar -xvf mysql-8.0.21-1.el7.x86_64.rpm-bundle.tar -C ~/
```

安装依赖库

sudo yum install -y libaio.x86_64 libaio-devel.x86_64 
sudo yum install -y openssl-devel.x86_64 openssl.x86_64 
sudo yum install -y perl.x86_64 perl-devel.x86_64 
sudo yum install -y perl-JSON.noarch 
sudo yum install -y autoconf
sudo yum install -y net-tools

安装mysql相关包，注意相关顺序

sudo rpm -ivh mysql-community-common-8.0.21-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-8.0.21-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-client-8.0.21-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-server-8.0.21-1.el7.x86_64.rpm
#以下为选装
# sudo rpm -ivh mysql-community-devel-8.0.21-1.el7.x86_64.rpm

初始化数据库

sudo mysqld --initialize --console
sudo chown -R mysql: /var/lib/mysql
# 查看数据库初始密码
sudo cat /var/log/mysqld.log | grep password

# 查看数据库服务状态
sudo service mysqld status
# 启动数据库服务
sudo service mysqld start

修改初始密码及远程登录

mysql -u root -p
#修改密码
alter user 'root'@'localhost' identified by  'root123';

#远程设置
use mysql;
update user set host='%' where user='root';

#授权用户名的权限，赋予任何主机访问数据的权限
#grant all privileges ON *.* to 'root'@'%' with grant option;
#flush privileges;

# 创建管理员用户
create user 'admin'@'%' identified by 'admin123';
grant all privileges on *.* to 'admin'@'%' with grant option;
flush privileges;

2）、安装Hive

主要参考文档：

https://blog.csdn.net/liuhuabing760596103/article/details/89175063

https://blog.csdn.net/weixin_45484707/article/details/108207329

1. 下载与解压

从清华镜像站 https://mirrors.bfsu.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz 下载hive文件

wget https://mirrors.bfsu.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar zxvf apache-hive-3.1.2-bin.tar.gz -C /app
cd /app
mv apache-hive-3.1.2-bin apache-hive-3.1.2

2. 修改环境变量

sudo vi /etc/profile.d/env.sh
## 添加HIVE_HOME
export HIVE_HOME=/app/apache-hive-3.1.2
export PATH=$PATH:$HIVE_HOME/bin

# 同步到各机器
sudo /home/hadoop/bin/xsync /etc/profile.d/env.sh
# 在所有服务器应用新的环境变量
source /etc/profile

3. 配置hive-env.sh

HADOOP_HOME=/app/hadoop-3.2.2
export HIVE_CONF_DIR=/app/apache-hive-3.1.2/conf
export HIVE_AUX_JARS_PATH=/app/apache-hive-3.1.2/lib

4. 配置hive-site.xml

<!-- 记录Hive中的元数据信息  记录在mysql中 -->
<property>
    <name>hive.metastore.db.type</name>
    <value>mysql</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://hadoop101:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;allowPublicKeyRetrieval=true</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
</property>
<!-- mysql的用户名和密码 -->
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>admin</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>admin123</value>
</property>

<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
</property>

<property>
    <name>hive.exec.scratchdir</name>
    <value>/user/hive/tmp</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/user/hive/log</value>
</property>

<!-- hiveserver的工作目录 -->
<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/data/hive/tmp/hiveuser</value>
</property>
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/data/hive/tmp/${hive.session.id}_resources</value>
</property>

<!-- hiveserver的日志路径配置 -->
<property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/data/hive/tmp/operation_logs</value>
  </property>

<!-- 客户端远程连接 -->
<property>
    <name>hive.server2.thrift.client.user</name>
    <value>hadoop</value>
    <description>Username to use against thrift client</description>
</property>
<property>
    <name>hive.server2.thrift.client.password</name>
    <value>hadoop123</value>
    <description>Password to use against thrift client</description>
</property>
<property> 
    <name>hive.server2.thrift.port</name> 
    <value>10000</value>
</property>
<!-- !!!需要填写不同机器的ip或机器名!!!! -->
<property> 
    <name>hive.server2.thrift.bind.host</name> 
    <value>0.0.0.0</value>
</property>
<property>
    <name>hive.server2.webui.host</name>
    <value>0.0.0.0</value>
</property>

<!-- hive服务的页面的端口 -->
<property>
    <name>hive.server2.webui.port</name>
    <value>10002</value>
</property>
<property> 
    <name>hive.server2.long.polling.timeout</name> 
    <value>5000</value>                               
</property>
<property>
    <name>hive.server2.enable.doAs</name>
    <value>true</value>
</property>
<property>
    <name>datanucleus.autoCreateSchema</name>
    <value>false</value>
</property>
<property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
</property>

<property>
    <name>hive.execution.engine</name>
    <value>mr</value>
</property>

<!-- zookeeper 相关配置 -->
<property>
    <name>hive.zookeeper.quorum</name>
    <value>hadoop101,hadoop102,hadoop103</value>
</property>
<property>
    <name>hive.server2.support.dynamic.service.discovery</name>
    <value>true</value>
</property>
<property>
    <name>hive.server2.zookeeper.namespace</name>
    <value>hiveserver2</value>
</property>
<property>
    <name>hive.server2.zookeeper.publish.configs</name>
    <value>true</value>
</property>

<!-- 配置metastore高可用 -->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop101:9083,thrift://hadoop102:9083,thrift://hadoop103:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
<property>
    <name>hive.metastore.uri.selection</name>
    <value>RANDOM</value>
    <description>
        Expects one of [sequential, random].
        Determines the selection mechanism used by metastore client to connect to remote metastore.  SEQUENTIAL implies that the first valid metastore from the URIs specified as part of hive.metastore.uris will be picked.  RANDOM implies that the metastore will be picked randomly
    </description>
</property>

<!-- 配置权限 -->
<property>
    <name>hive.security.authorization.createtable.owner.grants</name>
    <value>ALL</value>
    <description>
        The privileges automatically granted to the owner whenever a table gets created.
        An example like "select,drop" will grant select and drop privilege to the owner
        of the table. Note that the default gives the creator of a table no access to the
        table (but see HIVE-8067).
    </description>
</property>
<!-- 是否支持distinct多个字段 -->
<property>
    <name>hive.groupby.skewindata</name>
    <value>false</value>
    <description>Whether there is skew in data to optimize group by queries</description>
</property>

<!-- 配置update和delete操作支持 -->
<!-- 参考: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions -->
<!-- http://bcxw.net/article/202.html -->
<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
    <description>
        Whether Hive supports concurrency control or not. 
        A ZooKeeper instance must be up and running when using zookeeper Hive lock manager 
    </description>
</property>
<!-- 动态分区(事务要求必须开) -->
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nostrict</value>
    <description>
        In strict mode, the user must specify at least one static partition
        in case the user accidentally overwrites all partitions.
        In nonstrict mode all partitions are allowed to be dynamic.
    </description>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
    <description>
        Set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager as part of turning on Hive
        transactions, which also requires appropriate settings for hive.compactor.initiator.on,
        hive.compactor.worker.threads, hive.support.concurrency (true),
        and hive.exec.dynamic.partition.mode (nonstrict).
        The default DummyTxnManager replicates pre-Hive-0.13 behavior and provides
        no transactions.
    </description>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
    <description>
        Whether to run the initiator and cleaner threads on this metastore instance or not.
        Set this to true on one instance of the Thrift metastore service as part of turning
        on Hive transactions. For a complete list of parameters required for turning on
        transactions, see hive.txn.manager.
    </description>
</property>
<property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
    <description>
        How many compactor worker threads to run on this metastore instance. Set this to a
        positive number on one or more instances of the Thrift metastore service as part of
        turning on Hive transactions. For a complete list of parameters required for turning
        on transactions, see hive.txn.manager.
        Worker threads spawn MapReduce jobs to do compactions. They do not do the compactions
        themselves. Increasing the number of worker threads will decrease the time it takes
        tables or partitions to be compacted once they are determined to need compaction.
        It will also increase the background load on the Hadoop cluster as more MapReduce jobs
        will be running in the background.
    </description>
</property>
<property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
</property>


<!-- 小文件合并问题(可选) -->
<property>
    <name>hive.merge.size.per.task</name>
    <value>268435456</value>
    <description>Size of merged files at the end of the job</description>
</property>
<property>
    <name>hive.merge.smallfiles.avgsize</name>
    <value>16777216</value>
    <description>
        When the average output file size of a job is less than this number, Hive will start an additional 
        map-reduce job to merge the output files into bigger files. This is only done for map-only jobs 
        if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.
    </description>
</property>
<!-- 合并mr产生的小文件 -->
<property>
    <name>hive.merge.mapredfiles</name>
    <value>true</value>
    <description>Merge small files at the end of a map-reduce job</description>
</property>
<!-- 合并tez产生的小文件 -->
<property>
    <name>hive.merge.tezfiles</name>
    <value>true</value>
    <description>Merge small files at the end of a Tez DAG</description>
</property>
<!-- 合并spark产生的小文件 -->
<property>
    <name>hive.merge.sparkfiles</name>
    <value>true</value>
    <description>Merge small files at the end of a Spark DAG Transformation</description>
</property>

5. 配置Hadoop中的core-site.xml

文件绝对路径：/app/hadoop-3.2.2/etc/hadoop/core-site.xml，（未做权限控制，暂时不做配置）

<!-- 权限配置 hadoop.proxyuser.{填写自己的用户名}.hosts-->
<property>
    <name>hadoop.proxyuser.hadoop.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hadoop.groups</name>
    <value>*</value>
</property>

6. 配置Hadoop中的hdfs-site.xml

配置文件：/app/hadoop-3.2.2/etc/hadoop

<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>

6. 上传mysql驱动包

下载地址：https://downloads.mysql.com/archives/c-j/

选择版本：8.0.22，操作系统：Platform Independent

参考文档：https://blog.csdn.net/qq_41950447/article/details/90085170
将下载的文件mysql-connector-java-8.0.22.jar上传到/app/apache-hive-3.1.2/lib目录下

7. 更换guava包

将hive中的guava包，更换为hadoop环境的guava包

cd /app/apache-hive-3.1.2/lib
# mv /app/apache-hive-3.1.2/lib/guava-19.0.jar.bak /app/
rm -rf guava-19.0.jar
cp /app/hadoop-3.2.2/share/hadoop/common/lib/guava-27.0-jre.jar .

8. 初始化metadata

准备工作，修改hive-site.xml中的两个配置，初始化前取消检查schema，并自动创建

<!-- 自动创建元数据schema -->
<property>
    <name>datanucleus.schema.autoCreateAll</name>
    <value>true</value>
</property>
<!-- 不检查schema -->
<property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
</property>

初始化操作

cd /app/apache-hive-3.1.2/bin
./schematool -initSchema -dbType mysql

恢复hive-site.xml文件

<!-- 自动创建元数据schema -->
<property>
    <name>datanucleus.schema.autoCreateAll</name>
    <value>false</value>
</property>
<!-- 不检查schema -->
<property>
    <name>hive.metastore.schema.verification</name>
    <value>true</value>
</property>

9. 移除log4j包

# mv /app/apache-hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar /app/apache-hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar.bak
rm -rf /app/apache-hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar

3）、启动Hive

启动hiveserver，启动过程遇到问题可查看日志文件/tmp/hadoop/hive.log

需要将hiveserver作为后台进程启动，推荐使用下面的命令启动
```
nohup hive --service metastore >/applogs/hive/metastore.log 2>&1 &
nohup hive --service hiveserver2 >/applogs/hive/hiveserver2.log 2>&1 &
```

在zookeeper验证是否成功HA

/app/apache-zookeeper-3.6.3/bin/zkCli.sh -server hadoop101
# 再查看是否注册hiverserver2
ls /hiveserver2

通过beeline验证hiveserver服务

# 两种连接方式
# 第一种连接方式
$beeline
> !connect jdbc:hive2://hadoop101:2181,hadoop102:2181,hadoop103:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 hadoop hadoop123
# 第二种连接方式
beeline -u 'jdbc:hive2://hadoop101:2181,hadoop102:2181,hadoop103:2181/mschayao;;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' -n hadoop -p 'hadoop123'

jdbc:hive2://hadoop101:2181,hadoop102:2181> create table test01(id int);
jdbc:hive2://hadoop101:2181,hadoop102:2181> insert into test01 values(1),(2),(3),(4);
# 验证完成后退出
> !quit

通过封装脚本管理服务

#!/bin/bash
HIVE_LOG_DIR=/applogs/hive

mkdir -p $HIVE_LOG_DIR

#检查进程是否运行正常，参数1为进程名，参数2为进程端口
function check_process()
{
    pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')
    ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)
    echo $pid
    [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1
}

function hive_start()
{
    metapid=$(check_process HiveMetastore 9083)
    cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"
    #cmd=$cmd" sleep 5; hdfs dfsadmin -safemode wait >/dev/null 2>&1"
    #cmd=$cmd" sleep 60"
    [ -z "$metapid" ] && eval $cmd || echo "Metastroe服务已启动"

    sleep 5
    server2pid=$(check_process HiveServer 10000)
    cmd="nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"
    [ -z "$server2pid" ] && eval $cmd || echo "HiveServer2服务已启动"
}

function hive_stop()
{
    metapid=$(check_process HiveMetastore 9083)
    [ "$metapid" ] && kill $metapid || echo "Metastore服务未启动"

    server2pid=$(check_process HiveServer 10000)
    [ "$server2pid" ] && kill $server2pid || echo "HiveServer2服务未启动"
}

case $1 in
"start")
    hive_start
    ;;
"stop")
    hive_stop
    ;;
"restart")
    hive_stop
    sleep 2
    hive_start
    ;;
"status")
    check_process HiveMetastore 9083 >/dev/null && echo "Metastore服务运行正常" || echo "Metastore服务运行异常"
    check_process HiveServer 10000 >/dev/null && echo "HiveServer2服务运行正常" || echo "HiveServer2服务运行异常"
    ;;
*)
    echo Invalid Args!
    echo 'Usage: '$(basename $0)' start|stop|restart|status'
    ;;
esac

阿坚87

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Hadoop3.x集成Hive3.1.2手册

1）、安装MySQL卸载内置MariaDBrpm -qa | grep mariadbsudo yum -y remove mariadb-libs-5.5.68-1.el7.x86_64下载并解压文件下载地址：https://downloads.mysql.com/archives/get/p/23/file/mysql-8.0.21-1.el8.x86_64.rpm-bundle.tartar -xvf mysql-8.0.21-1.el7.x86_64.rpm-bundle.ta
复制链接

扫一扫

专栏目录