Hive学习笔记

最新推荐文章于 2024-10-14 16:20:38 发布

DataPeak

最新推荐文章于 2024-10-14 16:20:38 发布

阅读量182

点赞数

分类专栏：大数据文章标签： hive big data hadoop

本文链接：https://blog.csdn.net/MagicalProgrammer/article/details/120273665

版权

大数据专栏收录该内容

16 篇文章

订阅专栏

Hive学习笔记

参加拉勾教育大数据训练营课程笔记

概述

Hive由Facebook开源，最初用于解决海量结构化日志数据的统计。

基于MapReduce的数据仓库工具

已有的Hadoop平台：

HDFS - 海量数据存储

MapReduce - 海量数据处理、分析

Yarn - 集群资源管理和作业调度

Hive产生的原因，解决Hadoop平台痛点：

MapReduce学习成本高，开发难度大 $\to$ 提供SQL查询接口，背后将SQL查询转换为MapReduce任务
HDFS没有字段，数据类型，没有表的概念，数据管理不方便 $\to$ 结构数据映射为表，并提供数据类型
项目周期长 $\to$ 使用SQL开发，大大缩短开发周期，节约开发成本

Hive本质：

SQL转换为MapReduce任务
由HDFS提供数据存储

可以理解为将SQL转换为MapReduce任务的工具

数仓：用于管理决策的，面向主题的、集成的、相对稳定的、反映历史变化的数据集合数据集合，由比尔·恩门于1991年提出

目的：构建面向分析的、集成的数据集合，为企业提供决策支持
数仓本身不产生数据，来源于外部，比如用户系统，交易系统。
存储大量数据

数仓可参考*《The Data Warehouse Toolkit》*一书

Hive vs. RDBMS (relational databases)

查询语言 - 高度相似（HQL & SQL），熟悉SQL的开发人员可以快速上手
数据规模 - Hive支持更大规模的数据集，建立在集群上，翻译HQL为MR（MR执行引擎）进行并行计算，相比之下，RDBMS支持的规模较小
执行引擎 - Hive支持的引擎有MR/Tez/Spark/Flink，而RDBMS都有自己的执行引擎
数据存储 - Hive存储在HDFS上，RDBMS存储在本地文件系统或者裸设备上面。

裸设备 - 裸设备(raw device)，也叫裸分区（原始分区），是一种没有经过格式化，不被Unix通过文件系统来读取的特字符设备文件。由应用程序负责对它进行读写操作。不经过文件系统的缓冲。它是不被操作系统直接管理的设备。这种设备少了操作系统这一层，I/O效率更高。不少数据库都能通过使用裸设备作为存储介质来提高I/O效率。

不过从Linux 5.14开始，已经移除裸设备支持，退出历史舞台。
执行速度 -
可拓展性 -
数据更新 -
事务支持 -

优缺点

优点

学习成本比MapReduce低，开发人员容易上手
处理海量数据。底层使用MapReduce任务
系统水平扩展性强，机器翻倍，性能翻倍
功能扩展 - 自定义函数
源于HDFS的高容错性
同一的元数据管理 - 表定义（表名、字段、字段类型），数据位置

缺点

HQL表达能力有限
迭代计算无法表达 - 机器学习
执行效率不高（基于MR的执行引擎）
自动生成的MR作业，默写情况不只能，性能低
调优困难

架构

请添加图片描述

安装配置

mysql

This problem has happened because validate_password plugin is by default NOT activated. You can solve by these commands:

mysql> select plugin_name, plugin_status from information_schema.plugins where plugin_name like 'validate%';

mysql> install plugin validate_password soname 'validate_password.so';

mysql> select plugin_name, plugin_status from information_schema.plugins where plugin_name like 'validate%';

+-------------------+---------------+
| plugin_name       | plugin_status |
+-------------------+---------------+
| validate_password | ACTIVE        |
+-------------------+---------------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'validate_password%';
+--------------------------------------+--------+
| Variable_name                        | Value  |
+--------------------------------------+--------+
| validate_password_check_user_name    | OFF    |
| validate_password_dictionary_file    |        |
| validate_password_length             | 8      |
| validate_password_mixed_case_count   | 1      |
| validate_password_number_count       | 1      |
| validate_password_policy             | MEDIUM |
| validate_password_special_char_count | 1      |
+--------------------------------------+--------+
7 rows in set (0.00 sec). 
SET GLOBAL validate_password_policy=LOW;

配置

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <!-- hive元数据的存储位置 -->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://h3:3306/hivemetadata?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
  <!-- 指定驱动程序 -->
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <!-- 连接数据库的用户名 -->
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>username to use against metastore database</description>
  </property>
  <!-- 连接数据库的口令 -->
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>dont4get</value>
    <description>password to use against metastore database</description>
  </property>
    <property>
        <!-- 数据默认的存储位置(HDFS) -->
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
        <description>location of default database for the warehouse</description>
    </property>
    <property>
        <!-- 在命令行中，显示当前操作的数据库 -->
        <name>hive.cli.print.current.db</name>
        <value>true</value>
        <description>Whether to include the current database in the Hive prompt.</description>
    </property>
    <property>
        <!-- 在命令行中，显示数据的表头 -->
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>
    <property>
        <!-- 操作小规模数据时，使用本地模式，提高效率 -->
        <name>hive.exec.mode.local.auto</name>
        <value>true</value>
        <description>Let Hive determine whether to run in local mode automatically</description>
    </property>
</configuration>

初始化

[root@h3 opt]# schematool -dbType mysql -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.3.7/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:	 jdbc:mysql://h3:3306/hivemetadata?createDatabaseIfNotExist=true&useSSL=false
Metastore Connection Driver :	 com.mysql.jdbc.Driver
Metastore connection User:	 hive
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed

启动Hive

[root@h3 opt]# hive
which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.8.0_291/bin:/opt/hadoop-2.9.2/bin:/opt/hadoop-2.9.2/sbin:/opt/hive-2.3.7/bin:/root/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.3.7/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/hive-2.3.7/lib/hive-common-2.3.7.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
Time taken: 6.999 seconds, Fetched: 1 row(s)
hive> create database test1;
OK
Time taken: 0.199 seconds
hive> show databases;
OK
default
test1
Time taken: 0.033 seconds, Fetched: 2 row(s)
hive>