@R星校长
Hive
Hive简介
http://hive.apache.org/
Apache Hive 数据仓库软件方便了使用 SQL 读取、写入和管理驻留在分布式存储中的大型数据集。结构可以投影到已经存储的数据上。提供命令行工具和 JDBC 驱动程序将用户连接到 Hive。
Hive 产生的原因:非 java 编程者通过 SQL 语句对 hdfs 的数据做 mapreduce 操作。
对比项 | Hive | 关系型数据库 |
---|---|---|
查询语音 | HQL | SQL |
数据存储 | HDFS | Local FS or RawDevice |
执行器 | MapReducer | Executor |
数据insert | 支持批量导入和单条插入 | 支持批量导入和单条插入 |
数据Update和delete | 支持追加,不支持删除 | 行级别更新和删除 |
数据规模 | 大 | 小 |
执行延迟 | 高 | 低 |
分区 | 支持 | 支持 |
索引 | v0.8后支持 | 支持 |
数据加载模式 | 读时模式(快) | 写时模式(慢) |
扩展性 | 高 | 低 |
应用场景 | 海量数据查询 | 实时查询 |
Hive架构
• Hive的架构
– (1)用户接口主要有三个:CLI,JDBC/ODBC 和 WUI。其中最常用的是 CLI,Cli 启动的时候,会同时启动一个 Hive 副本。Client 是 Hive 的客户端,用户连接至 Hive Server。在启动 Client 模式的时候,需要指出 Hive Server 所在节点,并且在该节点启动 Hive Server。 WUI 是通过浏览器访问 Hive。
– (2)Hive 将元数据存储在数据库中,如 mysql、derby(hive 自带的内存数据库)。Hive 中的元数据包括表的名字,表的列和分区及其属性,表的属性(是否为外部表等),表的数据所在目录等。
– (3)解释器(SQL Parser)、编译器(Compiler)、优化器(Optimizer)完成 HQL 查询语句从词法分析、语法分析、编译、优化以及查询计划的生成。生成的查询计划存储在 HDFS 中,并在随后有执行器(Executor)调用 MapReduce 执行。
– (4)Hive 的数据存储在 HDFS 中,大部分的查询、计算由 MapReduce 完成(包含*的查询,比如 select * from tbl 不会生成 MapRedcue 任务)。
– select id,name from psn;
• Hive 的架构
– 编译器将一个 Hive SQL 转换操作符
– 操作符是 Hive 的最小的处理单元
– 每个操作符代表 HDFS 的一个操作或者一道 MapReduce 作业
• Operator(操作符)
– Operator 都是 hive 定义的一个处理过程
– Operator 都定义有:
– protected List <Operator<? extends Serializable >> childOperators;
– protected List <Operator<? extends Serializable >> parentOperators;
– protected boolean done; // 初始化值为 false
查询操作 表扫描操作 限制输出 文件输出操作。
ANTLR 词法语法分析工具解析 hql
Hive三种方式区别和搭建
Hive 中 metastore(元数据存储)的三种模式:
a)内嵌 Derby 模式
b)直连数据库模式
c)远程服务器模式
内嵌 Derby 模式
此模式连接到一个 In-memory 的数据库 Derby,一般用于 Unit Test(单元测试目前用的也少),一台服务器即可,基本不用。
这种方式是最简单的存储方式,只需要在 hive-site.xml 做如下配置便可
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
</configuration>
注:使用 derby 存储方式时,运行 hive 会在当前目录生成一个 derby 文件和一个 metastore_db 目录。这种存储方式的弊端是在同一个目录下同时只能有一个hive 客户端能使用数据库,否则会提示如下错误
[html] view plaincopyprint?
hive> show tables;
FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database 'metastore_db', see the next exception for details.
NestedThrowables:
java.sql.SQLException: Failed to start database 'metastore_db', see the next exception for details.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> show tables;
FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database 'metastore_db', see the next exception for details.
NestedThrowables:
java.sql.SQLException: Failed to start database 'metastore_db', see the next exception for details.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
直连数据库(mysql)模式
通过网络连接到一个数据库中,是最经常使用到的模式。
远程服务器模式
用于非 Java 客户端访问元数据库,在服务器端启动 MetaStoreServer ,客户端利用 Thrift 协议通过 MetaStoreServer 访问元数据库。起到了解耦的作用,如果更换其他数据库只需要修改 MetaStoreServer 即可。
安装
官方文档阅读
Installing Hive from a Stable Release(从稳定版本安装hive)
Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).(下载hive)
Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z (where x.y.z is the release number):
$ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to the installation directory:
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH
Building Hive from Source(源码安装)
Running Hive
Running Hive CLI
Running HiveServer2 and Beeline
Metadata (元数据存储)在嵌入式 Derby 数据库中,其磁盘存储位置由名为 javax.jdo.option.Connectionurl 的 hive 配置变量决定。默认情况下,此位置为 ./metastore_db(参见 conf/hive-default.xml)。在嵌入式模式下使用 Derby 最多一次允许一个用户。若要将 Derby 配置为在服务器模式下运行,请参见服务器模式下使用 Derby 的 Hive。若要为 Hive 元存储配置除 Derby 以外的数据库,请参见 Hive Metastore Administration。
需要用到 mysql,所以我们先安装 mysql。
Mysql 安装
先将四台虚拟机拍快照,然后再进行安装。
Node1 上安装:
需要下载 mysql 的四个 rpm 包:
mysql-community-client-5.7.18-1.el6.x86_64.rpm
mysql-community-common-5.7.18-1.el6.x86_64.rpm
mysql-community-libs-5.7.18-1.el6.x86_64.rpm
mysql-community-server-5.7.18-1.el6.x86_64.rpm
1、rpm -qa | grep -i mysql
2、将上一步找到的 mysql 相关的包都删除
rpm -e --nodeps mysql-libs* //*号指代实际包名, 请实际情况输入完整包名
rpm -e mysql-server*(如果有)
rpm -e mysql-client*(如果有)
3、检查并删除老版本 mysql 的开发头文件和库
命令:
rm -rf /usr/lib/mysql
rm -fr /usr/include/mysql
rm -f /etc/my.cnf
rm -fr /var/lib/mysql
rm -fr /usr/share/mysql
注意:卸载后 /var/lib/mysql 中的数据及 /etc/my.cnf 不会删除,如果确定没用后就手工删除
4、安装 perl
yum install perl -y
5、下载 numactl-2.0.9-2.el6.x86_64.rpm 并安装
rpm -ivh numactl-2.0.9-2.el6.x86_64.rpm
6、安装 mysql(有顺序要求)
rpm -ivh mysql-community-common-5.7.18-1.el6.x86_64.rpm
rpm -ivh mysql-community-libs-5.7.18-1.el6.x86_64.rpm
rpm -ivh mysql-community-client-5.7.18-1.el6.x86_64.rpm
rpm -ivh mysql-community-server-5.7.18-1.el6.x86_64.rpm
执行这个,自动识别安装顺序:
rpm -ivh mysql-community-*
7、修改 /etc/my.cnf 文件,设置数据库的编码方式:
[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
character_set_server=utf8
9、如果出现错误,请查看 /etc/my.cnf 文件中指定的错误 log 日志的文件
10、service mysqld start
11、找到随机密码:
在/var/log/mysqld.log
中有一行:
A temporary password is generated for root@localhost
,后面就是随机密码
# mysql -uroot -p"yAe7QGVJ;HlR"使用随机密码登录系统
12、修改默认密码:
set global validate_password_policy=0;
set global validate_password_length=6;
set password for 'root'@'localhost'=password('123456');
搞定
13、查看编码方式
mysql> show variables like '%character%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
mysql>
14、给 root 设置远程登录权限
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION;
FLUSH PRIVILEGES;
15、chkconfig mysqld on
设置开机启动
[root@node1 ~]# chkconfig mysqld on 开机启动mysql服务
[root@node1 ~]# ./startdfs.sh 启动hadoop集群
[root@node3 ~]# start-yarn.sh 启动NM,和node3上RM
[root@node4 ~]# yarn-daemon.sh start resourcemanager
启动node4上RM
Hive 安装 Remote 一体
将apache-hive-1.2.1-bin.tar.gz
和
mysql-connector-java-5.1.32-bin.jar
上传到 node2 的 /opt/apps 目录下。
[root@node2 apps]# tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /opt
[root@node2 apps]# cd /opt/
[root@node2 opt]# ls
apache-hive-1.2.1-bin hadoop-2.6.5 zookeeper-3.4.6.tar.gz
apps zookeeper-3.4.6
[root@node2 opt]# mv apache-hive-1.2.1-bin/ hive-1.2.1
[root@node2 opt]# ls
apps hive-1.2.1 zookeeper-3.4.6.tar.gz
hadoop-2.6.5 zookeeper-3.4.6
[root@node2 opt]# vim /etc/profile
export HIVE_HOME=/opt/hive-1.2.1
export PATH=$PATH:$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
[root@node2 hive-1.2.1]# source /etc/profile
[root@node2 hive-1.2.1]# cd conf/
[root@node2 conf]# ls
beeline-log4j.properties.template hive-exec-log4j.properties.template
hive-default.xml.template hive-log4j.properties.template
hive-env.sh.template ivysettings.xml
[root@node2 conf]# cp hive-default.xml.template hive-site.xml
[root@node2 conf]# vim hive-site.xml
这种存储方式需要在远端服务器运行一个 mysql 服务器,并且需要在 Hive 服务器启动 meta 服务。
这里用 mysql 的测试服务器,ip 位 192.168.20.61,新建 hive 数据库,字符集位 latine1
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.20.91:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
[root@node2 conf]# vim hive-site.xml
--><configuration>
光标位置 <!-- WARNING!!! This file is auto generated for
:.,$-1d
将以上各个属性配置拷贝过来,然后保存并退出
然后启动hive
[root@node2 conf]# hive
抛错:
The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH 找不到mysql的驱动类
[root@node2 conf]# cp /opt/apps/mysql-connector-java-5.1.32-bin.jar /opt/hive-1.2.1/lib/
[root@node2 conf]# hive
抛错
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
原因是jline jar包冲突了。
[root@node2 lib]# pwd
/opt/hive-1.2.1/lib
[root@node2 lib]# ls jline*
jline-2.12.jar
[root@node2 lib]# cd /opt/hadoop-2.6.5/share/hadoop/yarn/lib
[root@node2 lib]# pwd
/opt/hadoop-2.6.5/share/hadoop/yarn/lib
[root@node2 lib]# ls jline*
jline-0.9.94.jar
将低版本的jline jar包删除
[root@node2 lib]# rm -f jline-0.9.94.jar
[root@node2 lib]# ls jline*
ls: cannot access jline*: No such file or directory
[root@node2 lib]# cp /opt/hive-1.2.1/lib/jline-2.12.jar ./
[root@node2 lib]# hive
Logging initialized using configuration in jar:file:/opt/hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive>
说明安排配置ok了。
hive> show tables;
OK
Time taken: 1.015 seconds
hive> create table test(id int ,age int);
OK
Time taken: 0.343 seconds
hive> show tables;
OK
test
Time taken: 0.038 seconds, Fetched: 1 row(s)
访问 active 的 namenode:
192.168.20.51:50070
或者 192.168.20.52:50070
表中没有具体的数据
执行插入 sql
hive> insert into test values(1,1);
http://node3:8088/cluster
配置 hadoop 好的话,此处可以看到执行的 MR 任务。
Node1 中连接上 mysql 查看:
[root@node1 ~]# mysql -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
......
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql |
| test |
+--------------------+
4 rows in set (0.00 sec)
mysql> use hive
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_hive |
+---------------------------+
| BUCKETING_COLS |
| CDS |
| DBS |保存数据实例信息
| COLUMNS_V2 | 保存列信息
....
| TBLS | 保存表相关信息
| VERSION |
+---------------------------+
29 rows in set (0.00 sec)
mysql> select * from TBLS;
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+
| 1 | 1580565872 | 1 | 0 | root | 0 | 1 | test | MANAGED_TABLE | NULL | NULL |
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+
1 row in set (0.01 sec)
mysql> select * from COLUMNS_V2;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
| 1 | NULL | age | int | 1 |
| 1 | NULL | id | int | 0 |
+-------+---------+-------------+-----------+-------------+
2 rows in set (0.00 sec)
mysql>
注:这里把 hive 的服务端和客户端都放在同一台服务器上了。服务端和客户端可以拆开,
思考:hive 是如何连接到 hdfs 的?
Hive 安装 Remote 分开
思考:hive 是如何连接到 hdfs 的?
答案:通过环境变量。
返回->返回
服务区端配置以上五个参数即可。
规划:
node3 为服务器端
node4 为客户端
具体安装配置步骤:
1.从 node2 上将 /opt/hive_1.2.1 拷贝到 node3 和 node4 上
[root@node2 opt]# scp -r /opt/hive-1.2.1/ node3:/opt/
[root@node2 opt]# scp -r /opt/hive-1.2.1/ node4:/opt/
2.配置 node3 上 hive 的环境变量:
[root@node3 opt]# vim /etc/profile
export HIVE_HOME=/opt/hive-1.2.1export PATH=$PATH:$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
[root@node3 opt]# source /etc/profile
3.配置 node4 上 hive 的环境变量:
[root@node4 opt]# vim /etc/profile
export HIVE_HOME=/opt/hive-1.2.1export PATH=$PATH:$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
[root@node4 opt]# source /etc/profile
4.修改 node3 上的 hive_site.xml 文件(黄色部分为添加的内容)
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive_remote/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.20.61:3306/hive_remote?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>
5.修改 node4 上的 hive_site.xml 文件
[root@node4 ~]# cd /opt/hive-1.2.1/conf/
[root@node4 conf]# vim hive-site.xml
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive_remote/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.20.93:9083</value>
</property>
</configuration>
6.node3 上启动 hive 服务端程序
查看已经占用的端口号:
[root@node3 conf]# ss -nal # netstat -nlpt
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 :::22 :::*
LISTEN 0 128 *:22 *:*
LISTEN 0 100 ::1:25 :::*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:50010 *:*
LISTEN 0 50 :::58170 :::*
LISTEN 0 128 *:50075 *:*
LISTEN 0 128 *:8480 *:*
LISTEN 0 50 ::ffff:192.168.20.53:2881 :::*
LISTEN 0 128 *:50020 *:*
LISTEN 0 128 *:8485 *:*
LISTEN 0 50 :::2181 :::*
LISTEN 0 50 ::ffff:192.168.20.53:3881
:::*
9083并没有被占用。然后执行如下命令:
[root@node3 conf]# hive --service metastore
Starting Hive Metastore Server
该命令为阻塞命令,想查看效果,需要再开一个连接终端。
[root@node3 ~]# ss -nal
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 :::22 :::*
LISTEN 0 128 *:22 *:*
LISTEN 0 100 ::1:25 :::*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:50010 *:*
LISTEN 0 50 :::58170 :::*
LISTEN 0 50 *:9083 *:*
LISTEN 0 128 *:50075 *:*
LISTEN 0 128 *:8480 *:*
LISTEN 0 50 ::ffff:192.168.20.53:2881 :::*
LISTEN 0 128 *:50020 *:*
LISTEN 0 128 *:8485 *:*
LISTEN 0 50 :::2181 :::*
LISTEN 0 50 ::ffff:192.168.20.53:3881 :::*
9083 端口便被占用了。
7.Node4 客户端直接使用 hive 命令即可
[root@node4 conf]# hive
Logging initialized using configuration in jar:file:/opt/hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
删除jline冲突的jar包
[root@node4 conf]# cd /opt/hadoop-2.6.5/share/hadoop/yarn/lib/
[root@node4 lib]# pwd
/opt/hadoop-2.6.5/share/hadoop/yarn/lib
[root@node4 lib]# rm -f jline-0.9.94.jar
[root@node4 lib]# cp /opt/hive-1.2.1/lib/jline-2.12.jar ./
[root@node4 lib]# hive
Logging initialized using configuration in jar:file:/opt/hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive>
8.检测
A. node1 上查看是否多一个数据库实例 hive_remote
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| hive_remote |
| mysql |
| test |
+--------------------+
5 rows in set (0.00 sec)
B. Node4 上创建表 psn
建表前:
hive> create table psn(id int,age int);
建表后:
C. node4 上插入数据
hive> insert into psn values(1,1);
Query ID = root_20200201233940_fdd57c2a-33d6-49d5-98a8-1895064944f2
Total jobs = 3
Launching Job 1 out of 3
[root@node3 opt]# hdfs dfs -cat /user/hive_remote/warehouse/psn/*
11
[root@node3 opt]# hdfs dfs -get /user/hive_remote/warehouse/psn/*
[root@node3 opt]# ls
000000_0 hadoop-2.6.5 zookeeper-3.4.6
apps hive-1.2.1 zookeeper-3.4.6.tar.gz
[root@node3 opt]# cat -A 000000_0
1^A1$
^A默认的分隔符,如何改为自定义的分隔符呢?在后续建表时可以指定,带着疑问继续学习。
Hive SQL
数据库实例操作
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Hive数据类型
data_type
: primitive_type
| array_type
| map_type
| struct_type 结构体类型(比如只有属性没有方法的类)
| union_type -- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY -- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP -- (Note: Available in Hive 0.8.0 and later)
| DECIMAL -- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later)
| DATE -- (Note: Available in Hive 0.12.0 and later)
| VARCHAR -- (Note: Available in Hive 0.12.0 and later)
| CHAR -- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
表相关操作
表基础知识
完整的 DDL 建表语法规则
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY -- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP -- (Note: Available in Hive 0.8.0 and later)
| DECIMAL -- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later)
| DATE -- (Note: Available in Hive 0.12.0 and later)
| VARCHAR -- (Note: Available in Hive 0.12.0 and later)
| CHAR -- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| JSONFILE -- (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
column_constraint_specification:
: [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
default_value:
: [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ]
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
[, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 分区
[CLUSTERED BY (col_name, col_name, ...) 分桶
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format] row format delimited fields terminated by “分隔符”
[STORED AS file_format]
[LOCATION hdfs_path]
基本数据类型
复合数据类型
-
array字段的元素访问方式:
- 下标获取元素,下标从0开始
- 获取第一个元素
- array[0]
- 获取第一个元素
- 下标获取元素,下标从0开始
-
map字段的元素访问方式
- 通过键获取值
- 获取a这个key对应的value
- map[‘a’]
- 获取a这个key对应的value
- 通过键获取值
-
struct字段的元素获取方式
- 定义一个字段c的类型为struct{a int;b string}
- 获取a和b的值
- 使用c.a 和c.b 获取其中的元素值
- 这里可以把这种类型看成是一个对象
- 使用c.a 和c.b 获取其中的元素值
- 获取a和b的值
- 定义一个字段c的类型为struct{a int;b string}
create table complex(
col1 array<int>,
col2 map<string,int>,
col3 struct<a:string,b:int,c:double>
)
Hive 建表案例
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
Hive 查看表描述
DESC [EXTENDED|FORMATTED] table_name
hive> desc formatted psn;
OK
# col_name data_type comment
id int
age int
# Detailed Table Information
Database: default
Owner: root
CreateTime: Sat Feb 01 23:37:46 CST 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mycluster/user/hive_remote/warehouse/psn
Table Type: MANAGED_TABLE
Table Parameters: 表信息
COLUMN_STATS_ACCURATE true
numFiles 1 文件数
numRows 1 行数
rawDataSize 3 原始数据大小
totalSize 4 总大小
transient_lastDdlTime 1580571583
# Storage Information 存储信息
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params: 存储描述参数
serialization.format 1
Time taken: 0.144 seconds, Fetched: 32 row(s)
删除表:
hive> drop table psn;
hive> desc formatted person;
OK
# col_name data_type comment
id int
name string
likes array<string>
address map<string,string>
# Detailed Table Information
Database: default
Owner: root
CreateTime: Sun Feb 02 23:02:29 CST 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mycluster/user/hive_remote/warehouse/person
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1580655749
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
colelction.delim -
field.delim ,
line.delim \n
mapkey.delim :
serialization.format ,
Time taken: 0.193 seconds, Fetched: 34 row(s)
数据:
id,姓名,爱好,地址
1,小明1,lol-book-movie,beijing:xisanqi-shanghai:pudong
2,小明2,lol-book-movie,beijing:xisanqi-shanghai:pudong
3,小明3,lol-book-movie,beijing:xisanqi-shanghai:pudong
4,小明4,lol-book-movie,beijing:xisanqi-shanghai:pudong
5,小明5,lol-movie,beijing:xisanqi-shanghai:pudong
6,小明6,lol-book-movie,beijing:xisanqi-shanghai:pudong
7,小明7,lol-book,beijing:xisanqi-shanghai:pudong
8,小明8,lol-book,beijing:xisanqi-shanghai:pudong
9,小明9,lol-book-movie,beijing:xisanqi-shanghai:pudong
Loading files into tables
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
复制 node4 连接窗口,并执行以下操作:
[root@node4 ~]# mkdir data
[root@node4 ~]# cd data
[root@node4 data]# vim person01.txt
1,小明1,lol-book-movie,beijing:xisanqi-shanghai:pudong
2,小明2,lol-book-movie,beijing:xisanqi-shanghai:pudong
3,小明3,lol-book-movie,beijing:xisanqi-shanghai:pudong
4,小明4,lol-book-movie,beijing:xisanqi-shanghai:pudong
5,小明5,lol-movie,beijing:xisanqi-shanghai:pudong
6,小明6,lol-book-movie,beijing:xisanqi-shanghai:pudong
7,小明7,lol-book,beijing:xisanqi-shanghai:pudong
8,小明8,lol-book,beijing:xisanqi-shanghai:pudong
9,小明9,lol-book-movie,beijing:xisanqi-shanghai:pudong
去含有 hive 客户端的 node4 窗口中执行:
hive> load data local inpath '/root/data/person01.txt' into table person;
Loading data to table default.person
Table default.person stats: [numFiles=1, totalSize=496]
OK
Time taken: 0.439 seconds
hive> select * from person;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
4 小明4 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
5 小明5 ["lol","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
6 小明6 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"}
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"}
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
Time taken: 0.086 seconds, Fetched: 9 row(s)
select id,name,likes[0],address['beijing'] from person;
+-----+-------+------+----------+--+
| id | name | _c2 | _c3 |
+-----+-------+------+----------+--+
| 1 | 小明1 | lol | xisanqi |
| 2 | 小明2 | lol | xisanqi |
| 3 | 小明3 | lol | xisanqi |
默认分隔符演示:
node4 hive>
create table person2
(
id int,
name string,
likes array<string>,
address map<string,string>
);
[root@node4 data]#
[root@node4 data]# cp person01.txt person02.txt
[root@node4 data]# vim person02.txt
1^A小明1^Alol^Bbook^Bmovie^Abeijing^Cxisanqi^Bshanghai^Cpudong
注意:^A 输入:Ctrl+V Ctrl A,^B和^C同理。
node4 hive>
hive> load data local inpath '/root/data/person02.txt' into table person2;
Loading data to table default.person2
Table default.person2 stats: [numFiles=1, totalSize=57]
OK
Time taken: 0.604 seconds
hive> select * from person2;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
node4 hive>
create table person3
(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003';
hive> desc formatted person3;
......
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
colelction.delim \u0002
field.delim \u0001
mapkey.delim \u0003
serialization.format \u0001
Time taken: 0.087 seconds, Fetched: 32 row(s)
hive> load data local inpath '/root/data/person02.txt' into table person3;
Loading data to table default.person3
Table default.person3 stats: [numFiles=1, totalSize=57]
OK
Time taken: 0.532 seconds
hive> select * from person3;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
Time taken: 0.102 seconds, Fetched: 1 row(s)
hive>
内部表与外部表
(1) Hive 内部表
CREATE TABLE [IF NOT EXISTS] table_name
删除表时,元数据与表中的数据都会被删除
Table Type: MANAGED_TABLE 内部表
(2) Hive 外部表
CREATE EXTERNAL TABLE [IF NOT EXISTS] table_name LOCATION hdfs_path
删除外部表只删除 metastore 的元数据,不删除 hdfs 中的表数据
Table Type: EXTERNAL_TABLE external
首先将 person01.txt 文件上传到 hdfs 的 /usr 目录
[root@node4 ~]# hdfs dfs -mkdir /usr
[root@node4 ~]# hdfs dfs -put /root/data/person01.txt /usr
创建外部表
create external table person4
(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/usr/';
hive> desc formatted person4;
OK
# col_name data_type comment
id int
name string
likes array<string>
address map<string,string>
# Detailed Table Information
Database: default
Owner: root
CreateTime: Mon Feb 03 00:08:07 CST 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mycluster/usr
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1580659687
hive> select * from person4;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
4 小明4 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
5 小明5 ["lol","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
6 小明6 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"}
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"}
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
Time taken: 0.108 seconds, Fetched: 9 row(s)
内部表和外部表的区别:
1、创建表的时候,内部表直接存储在默认的 hdfs 路径,外部表需要自己指定路径
2、删除表的时候,内部表会将数据和元数据全部删除,外部表只删除元数据,数据不删除
hive> drop table person4;
OK
Time taken: 0.148 seconds
查看 hdfs 对应的目录和文件还存在
hive> drop table person;
OK
Time taken: 0.226 seconds
查看 hdfs 对应的目录和文件不存在。
注意:hive:读时检查(实现解耦,提高数据加载的效率)
关系型数据库:写时检查
演示:
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
hive> create table person(
> id int,
> name string,
> likes array<string>,
> address map<string,string>
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> COLLECTION ITEMS TERMINATED BY '-'
> MAP KEYS TERMINATED BY ':'
> LINES TERMINATED BY '\n';
OK
Time taken: 0.152 seconds
hive> load data local inpath '/root/install.log' into table person;
Loading data to table default.person
Table default.person stats: [numFiles=1, totalSize=8815]
OK
Time taken: 0.627 seconds
hive> select * from person;
OK
NULL NULL NULL NULL
NULL key ID c105b9de: NOKEY NULL NULL
NULL NULL NULL NULL
NULL NULL NULL NULL
表分区
Hive 分区 partition (订单介绍)
必须在表定义时指定对应的 partition 字段
a、单分区建表语句:
create table day_table (id int, content string) partitioned by (dt string);
单分区表,按天分区,在表结构中存在 id,content,dt 三列。
以 dt 为文件夹区分
b、 双分区建表语句:dt=20200905/hour=11
create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表,按天和小时分区,在表结构中新增加了 dt 和 hour 两列。
先以 dt 为文件夹,再以 hour 子文件夹区分
理论上分区的个数可以任意多,但是常用的为单分区和双分区。
注意:定义分区和格式化语句的位置。
单分区
create table person5
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(age int)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
hive> create table person5
> (
> id int,
> name string,
> likes array<string>,
> address map<string,string>
> )
> partitioned by(age int)
> row format delimited
> fields terminated by ','
> collection items terminated by '-'
> map keys terminated by ':';
OK
Time taken: 1.007 seconds
hive> desc formatted person5;
OK
# col_name data_type comment
id int
name string
likes array<string>
address map<string,string>
# Partition Information
# col_name data_type comment
age int
注意:分区字段不能出现在建表脚本中。
添加数据:
hive> load data local inpath '/root/data/person01.txt' into table person5 partition(age=10);
Loading data to table default.person5 partition (age=10)
Partition default.person5{age=10} stats: [numFiles=1, numRows=0, totalSize=496, rawDataSize=0]
OK
Time taken: 0.892 seconds
hive> select * from person5;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
4 小明4 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
5 小明5 ["lol","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
6 小明6 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 10
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 10
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
Time taken: 0.305 seconds, Fetched: 9 row(s)
hive> load data local inpath '/root/data/person01.txt' into table person5 partition(age=20);
Loading data to table default.person5 partition (age=20)
Partition default.person5{age=20} stats: [numFiles=1, numRows=0, totalSize=496, rawDataSize=0]
OK
Time taken: 0.609 seconds
hive> select * from person5;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
4 小明4 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
5 小明5 ["lol","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
6 小明6 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 10
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 10
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20
4 小明4 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20
5 小明5 ["lol","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20
6 小明6 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20
双分区
create table person6
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(age int,sex string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
hive> create table person6
> (
> id int,
> name string,
> likes array<string>,
> address map<string,string>
> )
> partitioned by(age int,sex string)
> row format delimited
> fields terminated by ','
> collection items terminated by '-'
> map keys terminated by ':';
OK
Time taken: 0.09 seconds
hive> desc formatted person6;
OK
# col_name data_type comment
......
# Partition Information
# col_name data_type comment
age int
sex string
hive> load data local inpath '/root/data/person01.txt' into table person6 partition(age=20);
FAILED: SemanticException [Error 10006]: Line 1:78 Partition not found '20'
hive> load data local inpath '/root/data/person01.txt' into table person6 partition(age=20,sex='man');
Loading data to table default.person6 partition (age=20, sex=man)
Partition default.person6{age=20, sex=man} stats: [numFiles=1, numRows=0, totalSize=496, rawDataSize=0]
OK
Time taken: 0.581 seconds
hive> select * from person6;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
4 小明4 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
5 小明5 ["lol","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
6 小明6 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
Time taken: 0.09 seconds, Fetched: 9 row(s)
hive> load data local inpath '/root/data/person01.txt' into table person6 partition(sex='man',age=10);
Loading data to table default.person6 partition (age=10, sex=man)
Partition default.person6{age=10, sex=man} stats: [numFiles=1, numRows=0, totalSize=496, rawDataSize=0]
OK
Time taken: 0.704 seconds
hive> select * from person6;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
4 小明4 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
5 小明5 ["lol","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
6 小明6 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 10 man
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
2 小明2 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
3 小明3 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
4 小明4 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
5 小明5 ["lol","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
6 小明6 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
7 小明7 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
8 小明8 ["lol","book"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
9 小明9 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"} 20 man
Time taken: 0.122 seconds, Fetched: 18 row(s)
加载数据时,分区先写谁都可以,但是属性名称不能写错了。
hive> load data local inpath '/root/data/person01.txt' into table person6 partition(sex='man',age=10);。
该行命令共做了两件事情:
- 在hdfs上创建对应的目录。
- 将数据文件上传到对应的目录上。
Hive 表添加分区语法
(表已创建,在此基础上添加分区,注意只能添加在表创建时定义好的分区的值):
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
例:
hive> ALTER TABLE person6 ADD PARTITION (age=30);
FAILED: ValidationFailureSemanticException partition spec {age=30} doesn't contain all (2) partition columns
hive> ALTER TABLE person6 ADD PARTITION (age=30,sex='man');
OK
Time taken: 0.309 second
表在创建时定义的时双分区,条件分区值时,不能只指定一个,需要同时指定两个分区的值。
只建目录没有数据。如果添加的分区已经存在,则抛出 AlreadyExistsException。
删除分区
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
[IGNORE PROTECTION] [PURGE];
双分区表,可以通过指定两个分区值进行删除:
hive> alter table person6 drop partition(age=30,sex='man');
Dropped the partition age=30/sex=man
OK
Time taken: 0.85 seconds
也可以通过指定一个分区值进行删除:
hive> alter table person6 drop partition(sex='man');
Dropped the partition age=10/sex=man
Dropped the partition age=20/sex=man
OK
Time taken: 0.178 seconds
修复分区
先将 hdfs 上 /usr 目录下的内容删除干净
[root@node4 ~]# hdfs dfs -ls /usr
Found 1 items
-rw-r--r-- 2 root supergroup 496 2020-02-03 10:39 /usr/person01.txt
[root@node4 ~]# hdfs dfs -rm /usr/person01.txt
Deleted /usr/person01.txt
创建目录 /usr/age=10
[root@node4 ~]# hdfs dfs -mkdir /usr/age=10
将 person01.txt 文件上传到 /usr/age=10 目录下
[root@node4 ~]# hdfs dfs -put /root/data/person01.txt /usr/age=10
创建外部表 person7
create external table person7
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(age int)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/usr/';
创建完表后,并不会将对应的分区信息同步到 MySQL 数据库的元数据中
Recover Partitions (MSCK REPAIR TABLE)
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively.
However, users can run a metastore check command with the repair table option:
语法格式:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
hive> msck repair table person7;
OK
Partitions not in metastore: person7:age=10
Repair: Added partition to metastore person7:age=10
Time taken: 0.206 seconds, Fetched: 2 row(s)
使用已有表建表(扩展)
a. Create Table Like:
CREATE TABLE empty_key_value_store LIKE key_value_store;
只创建相同结构的空表,没有具体的表中的数据。
hive> select * from person3;
OK
1 小明1 ["lol","book","movie"] {"beijing":"xisanqi","shanghai":"pudong"}
Time taken: 0.271 seconds, Fetched: 1 row(s)
hive> create table person8 like person3;
OK
Time taken: 0.122 seconds
hive> select * from person8;
OK
Time taken: 0.084 seconds
hive> desc formatted person8;
OK
# col_name data_type comment
id int
name string
likes array<string>
address map<string,string>
# Detailed Table Information
Database: default
Owner: root
CreateTime: Mon Feb 03 15:28:26 CST 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mycluster/user/hive_remote/warehouse/person8
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1580714906
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
colelction.delim \u0002
field.delim \u0001
mapkey.delim \u0003
serialization.format \u0001
Time taken: 0.08 seconds, Fetched: 32 row(s)
b. Create Table As Select (CTAS)
CREATE TABLE new_key_value_store
AS
SELECT columA, columB FROM key_value_store;
hive> create table person9 as select id,name from person3;
Query ID = root_20200203153150_4fa5b05c-dedf-4f12-9040-3b54136022b3
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1580696778004_0001, Tracking URL = http://node3:8088/proxy/application_1580696778004_0001/
Kill Command = /opt/hadoop-2.6.5/bin/hadoop job -kill job_1580696778004_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-02-03 15:31:59,674 Stage-1 map = 0%, reduce = 0%
2020-02-03 15:32:05,056 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.27 sec
MapReduce Total cumulative CPU time: 1 seconds 270 msec
Ended Job = job_1580696778004_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://mycluster/user/hive_remote/warehouse/.hive-staging_hive_2020-02-03_15-31-50_028_2423222847610020074-1/-ext-10001
Moving data to: hdfs://mycluster/user/hive_remote/warehouse/person9
Table default.person9 stats: [numFiles=1, numRows=1, totalSize=10, rawDataSize=9]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.27 sec HDFS Read: 3100 HDFS Write: 81 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 270 msec
OK
Time taken: 17.476 seconds
hive> select * from person9;
OK
1 小明1
Time taken: 0.075 seconds, Fetched: 1 row(s)
hive> desc formatted person9;
OK
# col_name data_type comment
id int
name string
# Detailed Table Information
Database: default
Owner: root
CreateTime: Mon Feb 03 15:32:07 CST 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mycluster/user/hive_remote/warehouse/person9
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 1
rawDataSize 9
totalSize 10
transient_lastDdlTime 1580715127
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.09 seconds, Fetched: 32 row(s)
删除数据
hive> delete from person;
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
hive> truncate table person;
OK
Time taken: 0.164 seconds
hive> select * from person;
OK
Time taken: 0.069 seconds
delete 经过事务,truncate 不经过事务。
添加数据
Loading files into tables
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Syntax
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
copy: load 的是本地(LOCAL)的文件
move: load 的是 hdfs 上的文件,不适用 local。如果也 copy 的话,hdfs 上将会产生不必要的文件。
[root@node4 hive]# hdfs dfs -mkdir /test
[root@node4 hive]# hdfs dfs -put person1.txt /test/
hive> LOAD DATA INPATH '/test/person1.txt' INTO TABLE person5 PARTITION (age=30);
Inserting data into Hive Tables from queries
Query Results can be inserted into tables by using the insert clause.
Syntax
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
以上方式不建议使用。
将大表中的数据拆分到多个小表中。
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
只查询一次,建议使用
INSERT OVERWRITE: 表示覆盖
INSERT INTO: 表示追加
create table person2_1(
id int,
name string
);
create table person2_2(
id int,
likes array<string>
);
from person
insert overwrite table person2_1
select id,name
insert into person2_2
select id,likes;
使用场景,将一个复杂的表 person(加入它有 50 个列),将其中一些列的数据添加到 person2 中,将另外一些列的数据添加到 person3 中。仅需要查询一次,减少磁盘 IO。
update 了解,基本不用。