Hive 小结。

最新推荐文章于 2024-07-03 12:14:30 发布

youngxuebo

最新推荐文章于 2024-07-03 12:14:30 发布

阅读量166

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/qq_32783151/article/details/115438903

版权

Hive 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一 Hive基本概念

1.1 什么是Hive

Hive：是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，可以将sql语句转换为MapReduce任务进行运行（由Facebook开源，用于解决海量结构化日志的数据统计。）。

本质是：将HQL/SQL转化成MapReduce程序

1）Hive处理的数据存储在HDFS
2）Hive分析数据底层的实现是MapReduce
3）执行程序运行在Yarn上

1.2 Hive的优缺点

优点：

1）操作接口采用类SQL语法，提供快速开发的能力（简单、容易上手）
2）避免了去写MapReduce，减少开发人员的学习成本。
3）Hive的执行延迟比较高，因此Hive常用于数据分析，对实时性要求不高的场合；
4）Hive优势在于处理大数据，对于处理小数据没有优势，因为Hive的执行延迟比较高（底层是MR）。
5）Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数。

缺点：

1）Hive的HQL表达能力有限
（1）迭代式算法无法表达
（2）数据挖掘方面不擅长

2）Hive的效率比较低
（1）Hive自动生成的MapReduce作业，通常情况下不够智能化
（2）Hive调优比较困难，粒度较粗

特点：

1.）可扩展

Hive可以自由的扩展集群的规模（HDFS），一般情况下不需要重启服务。

2.）延展性

Hive支持用户自定义函数（functions），用户可以根据自己的需求来实现自己的函数。

3.）容错

良好的容错性，节点出现问题SQL仍可完成执行。

1.3 Hive架构原理

在这里插入图片描述
如图中所示，Hive通过给用户提供的一系列交互接口，接收到用户的指令(SQL)，使用自己的Driver，结合元数据(MetaStore)，将这些指令翻译成MapReduce，提交到Hadoop中执行，最后，将执行返回的结果输出到用户交互接口。
1）用户接口：Client
CLI（hive shell）、JDBC/ODBC(java访问hive)、WEBUI（浏览器访问hive）
2）元数据：Metastore
元数据包括：表名、表所属的数据库（默认是default）、表的拥有者、列/分区字段、表的类型（是否是外部表）、表的数据所在目录等；
默认存储在自带的derby数据库中，推荐使用MySQL存储Metastore
3）Hadoop
使用HDFS进行存储，使用MapReduce进行计算。
4）驱动器：Driver
（1）解析器（SQL Parser）：将SQL字符串转换成抽象语法树AST，这一步一般都用第三方工具库完成，比如antlr；对AST进行语法分析，比如表是否存在、字段是否存在、SQL语义是否有误。
（2）编译器（Physical Plan）：将AST编译生成逻辑执行计划。
（3）优化器（Query Optimizer）：对逻辑执行计划进行优化。
（4）执行器（Execution）：把逻辑执行计划转换成可以运行的物理计划。对于Hive来说，就是MR/Spark。

二 Hive安装环境

2.1 Hive安装地址

1）Hive官网地址：http://hive.apache.org/
2）文档查看地址：https://cwiki.apache.org/confluence/display/Hive/GettingStarted
3）下载地址：http://archive.apache.org/dist/hive/
4）github地址：https://github.com/apache/hive

2.2 Hive安装部署

1）Hive安装及配置

（1）把apache-hive-2.3.4-bin.tar.gz上传到linux的/opt/software目录下
（2）解压apache-hive-2.3.4-bin.tar.gz到/opt/module/目录下面

[root@bigdata03 opt]# tar -zxvf apache-hive-2.3.4-bin.tar.gz -C /opt/module/

（3）修改apache-hive-2.3.4-bin 的名称为hive-2.3.4

[root@bigdata03 module]# mv apache-hive-2.3.4-bin hive-2.3.4

（4）修改/opt/module/hive/conf目录下的hive-env.sh.template名称为hive-env.sh，hive-log4j.properties.template名称为hive-log4j.properties。

[root@bigdata03 conf]# mv hive-env.sh.template hive-env.sh
[root@bigdata03 conf]# mv hive-log4j2.properties.template hive-log4j2.properties

（5）配置hive-env.sh文件

配置HADOOP_HOME路径
export HADOOP_HOME=/opt/module/hadoop-2.8.4
配置HIVE_CONF_DIR路径
export HIVE_CONF_DIR=/opt/module/hive/conf

（6）在hive-log4j.properties文件中修改log存放位置
hive.log.dir=/opt/module/hive-2.3.4/logs：

注：Hive的log默认存放在/tmp/itstar/hive.log目录下（当前用户名下）。

2）Hadoop集群配置

（1）必须启动hdfs和yarn

[root@bigdata01 conf]# start-dfs.sh
root@bigdata01 conf]# start-yarn.sh

（2）在HDFS上创建/tmp和/user/hive/warehouse两个目录并修改他们的同组权限可写

[root@bigdata01 hadoop-2.8.4]# bin/hadoop fs -mkdir /tmp
[root@bigdata01 hadoop-2.8.4]# bin/hadoop fs -mkdir /user/hive/warehouse
[root@bigdata01 hadoop-2.8.4]# bin/hadoop fs -chmod g+w /tmp
[root@bigdata01 hadoop-2.8.4]# bin/hadoop fs -chmod g+w /user/hive/warehouse

3)Hive元数据配置到MySql

2.5.1 驱动拷贝

上传mysql-connector-java-5.1.27-bin.jar到 /opt/module/hive-2.3.4/lib/

2.5.2 配置Metastore到MySql
1）在/opt/module/hive-2.3.4/conf目录下创建一个hive-site.xml

[root@bigdata03 conf]# vi hive-site.xml

2）根据官方文档配置参数，拷贝数据到hive-site.xml文件中。

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://bigdata03:3306/metastore03?createDatabaseIfNotExist=true</value>
          <description>JDBC connect string for a JDBC metastore</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.jdbc.Driver</value>
          <description>Driver class name for a JDBC metastore</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>root</value>
          <description>username to use against metastore database</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>root</value>
          <description>password to use against metastore database</description>
        </property>        
</configuration>

3）在hive的bin目录下执行./schematool -dbType mysql -initSchema
元数据修改成功：
在这里插入图片描述
4）配置完毕后，如果启动hive异常，初始化异常。
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException:
Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

4)配置环境变量

vi /etc/profile

添加HIVE_HOME环境变量

#HIVE_HOME
export HIVE_HOME=/opt/module/hive-2.3.4
export PATH=$PATH:$HIVE_HOME/bin

source /etc/profile

5)其他配置

2.5.1 Hive数据仓库位置配置

1）Default数据仓库的最原始位置是在hdfs上的：/user/hive/warehouse路径下
2）在仓库目录下，没有对默认的数据库default创建文件夹。如果某张表属于default数据库，直接在数据仓库目录下创建一个文件夹。
3）修改default数据仓库原始位置（将hive-default.xml.template如下配置信息拷贝到hive-site.xml文件中）

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive-2.3.4/warehouse</value>
<description>location of default database for the warehouse</description>
</property>

配置同组用户有执行权限
bin/hdfs dfs -chmod g+w /user/hive-2.3.4/warehouse

2.5.2 查询后信息显示配置

1）在hive-site.xml文件中添加如下配置信息，就可以实现显示当前数据库，以及查询表的头信息配置。

<property>
	<name>hive.cli.print.header</name>
	<value>true</value>
</property>

<property>
	<name>hive.cli.print.current.db</name>
	<value>true</value>
</property>

2）重新启动hive，对比配置前后差异

（1）配置前
在这里插入图片描述
（2）配置后

三、Hive基本操作

3.1 Hive常用交互命令

1）hive -help

[root@bigdata03 conf]# hive -help
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

2）“-e”不进入hive的交互窗口执行sql语句
hive -e “select id from student;”

[root@bigdata03 conf]#hive -e "select id from student;"

Logging initialized using configuration in file:/opt/module/hive-1.2.1/conf/hive-log4j.properties
OK
id
1001
1002
1003
Time taken: 1.608 seconds, Fetched: 3 row(s)

3）“-f”执行脚本中sql语句
（1）在/opt/module/datas目录下创建hivef.sql文件，写入正确的sql语句

[root@bigdata01 hive-1.2.1]# echo "select *from student;" > hivef.sql

（2）执行文件中的sql语句
[itstar@bigdata111hive]$ bin/hive -f /opt/module/datas/hivef.sql

[root@bigdata01 hive-1.2.1]# bin/hive -f /opt/module/hive-1.2.1/hivef.sql

Logging initialized using configuration in file:/opt/module/hive-1.2.1/conf/hive-log4j.properties
OK
student.id	student.name
1001	zhangshan
1002	lishi
1003	zhaoliu
Time taken: 1.447 seconds, Fetched: 3 row(s)

（3）执行文件中的sql语句并将结果写入文件中

bin/hive -f /opt/module/datas/hivef.sql  > ./hive_result.txt

Logging initialized using configuration in file:/opt/module/hive-1.2.1/conf/hive-log4j.properties
Could not open input file for reading. (File file:/opt/module/datas/hivef.sql does not exist)
[root@bigdata01 hive-1.2.1]# bin/hive -f /opt/module/hive-1.2.1/hivef.sql > ./hive_result.txt

Logging initialized using configuration in file:/opt/module/hive-1.2.1/conf/hive-log4j.properties
OK
Time taken: 1.194 seconds, Fetched: 3 row(s)
[root@bigdata01 hive-1.2.1]# ls
bin  conf  examples  hcatalog  hivef.sql  hive_result.txt  lib  LICENSE  logs  NOTICE  README.txt  RELEASE_NOTES.txt  scripts
[root@bigdata01 hive-1.2.1]# cat hive_result.txt
student.id	student.name
1001	zhangshan
1002	lishi
1003	zhaoliu

4）在hive cli命令窗口中如何查看hdfs文件系统

hive (default)>dfs -ls /;
Found 8 items
drwxr-xr-x   - root supergroup          0 2019-11-30 09:02 /Practice
-rw-r--r--   3 root supergroup        159 2019-11-30 09:06 /README.html
drwxr-xr-x   - root supergroup          0 2019-12-12 08:55 /Word
-rw-r--r--   3 root supergroup        136 2019-12-12 08:51 /Word.txt
-rw-r--r--   3 root supergroup       1908 2019-11-30 10:10 /dependency.txt
-rw-r--r--   3 root supergroup  246123562 2019-12-01 06:23 /hadoop-2.8.4.tar.gz
drwx-w----   - root supergroup          0 2021-03-16 18:20 /tmp
drwxr-xr-x   - root supergroup          0 2021-04-05 02:51 /user

5）在hive cli命令窗口中如何查看本地linux系统

hive (default)> ! ls /opt/module/;
hadoop-2.8.4
hive-2.3.4
jdk1.8.0_144
mysql
zookeeper-3.4.10

4）查看在hive中输入的所有历史命令

[root@bigdata01 ~]# cat /root/.hivehistory

3.2 Hive 操作及DDL

（1）启动hive

[root@bigdata03 conf]# hive
which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/module/jdk1.8.0_144/bin:/opt/module/hadoop-2.8.4/bin:/opt/module/hadoop-2.8.4/sbin:/opt/module/zookeeper-3.4.10/bin:/opt/module/hive-2.3.4/bin:/root/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hive-2.3.4/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in file:/opt/module/hive-2.3.4/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

（2）查看数据库

hive (default)> show databases;
OK
database_name
db_hive
default
Time taken: 4.002 seconds, Fetched: 2 row(s)

（3）打开默认数据库

hive (db_hive)> use default;
OK
Time taken: 0.021 seconds

（4）显示default数据库中的表

hive (default)> show tables;
OK
tab_name
student
student_partition

（5）创建一张表

hive (db_hive)> create table student(id int, name string) ;
OK
Time taken: 0.615 seconds
hive (db_hive)> desc student;
OK
col_name	data_type	comment
id                  	int                 	                    
name                	string              	                    
Time taken: 0.082 seconds, Fetched: 2 row(s)

#创建student表, 并声明文件分隔符’\t’
hive> create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

（6）加载/opt/module/datas/student.txt 文件到student数据库表中。

hive> load data local inpath '/opt/module/hive/mydata/student.txt' into table student;

（7）显示数据库中有几张表

hive (db_hive)> show tables;
OK
tab_name
student
teacher
Time taken: 0.023 seconds, Fetched: 2 row(s)

（8）查看表的结构

hive (db_hive)> desc student;
OK
col_name	data_type	comment
id                  	int                 	                    
name                	string              	                    
Time taken: 0.082 seconds, Fetched: 2 row(s)

（9）向表中插入数据

insert into student values(1001,"张三");
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210405050718_83d8968a-1256-4f10-a8ad-60cc53a5a8e8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1617608859684_0001, Tracking URL = http://bigdata01:8088/proxy/application_1617608859684_0001/
Kill Command = /opt/module/hadoop-2.8.4/bin/hadoop job  -kill job_1617608859684_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-04-05 05:07:33,238 Stage-1 map = 0%,  reduce = 0%
2021-04-05 05:07:41,779 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.93 sec
MapReduce Total cumulative CPU time: 930 msec
Ended Job = job_1617608859684_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://bigdata01:9000/user/hive-2.3.4/warehouse/db_hive.db/student/.hive-staging_hive_2021-04-05_05-07-18_214_8767971089156879512-1/-ext-10000
Loading data to table db_hive.student
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 0.93 sec   HDFS Read: 4241 HDFS Write: 83 SUCCESS
Total MapReduce CPU Time Spent: 930 msec
OK
_col0	_col1
Time taken: 25.434 seconds

（10）查询表中数据

hive (db_hive)> select * from student;
OK
student.id	student.name
1001	张三
Time taken: 0.229 seconds, Fetched: 1 row(s)

（11）退出hive
hive> quit;
hive> exit;

youngxuebo

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
Hive 小结。

一 Hive基本概念1.1 什么是HiveHive：是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，可以将sql语句转换为MapReduce任务进行运行（由Facebook开源，用于解决海量结构化日志的数据统计。）。本质是：将HQL/SQL转化成MapReduce程序1）Hive处理的数据存储在HDFS2）Hive分析数据底层的实现是MapReduce3）执行程序运行在Yarn上1.2 Hive的优缺点优点：1）操作接口采用类SQ
复制链接

扫一扫