目录
一. Hive的安装
hive配置步骤
- step1
- setp2
- step3
- step4
- step5
启动hive
- 启动metastore服务
- 启动hiveserver服务
- 启动hive客户端
- 查询MySQL数据库中的元数据
二. Hive的基本使用
1. 进入hive控制台
2. 新建表
创建新表
导入数据t_hive.txt到t_hive表
3. 查看表和数据
查看表
正则匹配表名
查看表数据
查看表结构
4. 修改表
-增加一个字段
-重命令表名
5. 删除表
三. Hive交互式模式
四. 数据导入
1. 从操作本地文件系统加载数据(LOCAL)
-在HDFS中查找刚刚导入的数据
-从HDFS加载数据
五. 数据导出
1. 从HDFS复制到HDFS其他位置
2. 通过Hive导出到本地文件系统
3. 查看本地操作系统
六. Hive查询HiveQL
1. 普通查询:排序,列别名,嵌套子查询
2. 连接查询:JOIN
3. 聚合查询1:count, avg
4. 聚合查询2:count, distinct
5. 聚合查询3:GROUP BY, HAVING
七. Hive视图
-删除视图
八. Hive分区表
1. 创建数据
2. 创建数据表
3. 创建分区数据表
4. 导入数据
5. 查看分区表
6. 查询数据
前言
Hive是Hadoop一个程序接口,Hive让数据分析人员快速上手,Hive使用了类SQL的语法,Hive让JAVA的世界变得简单而轻巧,Hive让Hadoop普及到了程序员以外的人。
从Hive开始,让分析师们也能玩转大数据。
一. Hive的安装
系统环境
装好hadoop的环境后,我们可以把Hive装在namenode机器上(c1)。
hadoop的环境,请参考:让Hadoop跑在云端系列文章,RHadoop实践系列之一:Hadoop环境搭建
下载: apache-hive-1.2.2-bin.tar.gz
解压到:/home/cos/apache-hive-1.2.2-bin.tar.gz(解压到哪一个目录自己定义)
hive配置1、step1
(1)、解压安装包
tar -zxvf apache-hive-1.2.2-bin.tar.gz
(2)、添加环境变量
[user@user ~]$ vim .bashrc
导入下面的环境变量
export HIVE_HOME=自己hive的解压路径
export PATH=$PATH:$HIVE_HOME/bin
在安装之前,需要先安装MySql,Hive默认使用内嵌的DerBy作为元数据库,但是鉴于其单会话限制,这里使用Mysql作为元数据库
检查是否装有Mysql:
netstat -tap | grep mysql
如果有输出,说明已经有mysql,如果没有需要安装
2、setp2
安装Mysql,装完Mysql的跳过这一步
[user@user home]# wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
[user@user home]#rpm -ivh mysql-community-release-el7-5.noarch.rpm
[user@user home]# yum install mysql-community-server已加载插件:fastestmirror, langpacks
base | 3.6 kB 00:00
extras | 3.4 kB 00:00
mysql-connectors-community | 2.5 kB 00:00
mysql-tools-community | 2.5 kB 00:00
mysql56-community | 2.5 kB 00:00
updates | 3.4 kB 00:00
(1/4): mysql-connectors-community/x86_64/primary_db | 14 kB 00:03
(2/4): mysql-tools-community/x86_64/primary_db | 33 kB 00:03
(3/4): mysql56-community/x86_64/primary_db | 168 kB 00:04
(4/4): updates/7/x86_64/primary_db | 4.8 MB 00:14
Loading mirror speeds from cached hostfile
* base: mirrors.aliyun.com
* extras: mirrors.aliyun.com
* updates: mirrors.aliyun.com
正在解决依赖关系
–> 正在检查事务
……
……
验证中 : mysql-community-libs-5.6.36-2.el7.x86_64 11/11已安装:
mysql-community-server.x86_64 0:5.6.36-2.el7作为依赖被安装:
mysql-community-client.x86_64 0:5.6.36-2.el7
mysql-community-common.x86_64 0:5.6.36-2.el7
mysql-community-libs.x86_64 0:5.6.36-2.el7
perl-Compress-Raw-Bzip2.x86_64 0:2.061-3.el7
perl-Compress-Raw-Zlib.x86_64 1:2.061-4.el7
perl-DBI.x86_64 0:1.627-4.el7
perl-Data-Dumper.x86_64 0:2.145-3.el7
perl-IO-Compress.noarch 0:2.061-2.el7
perl-Net-Daemon.noarch 0:0.48-5.el7
perl-PlRPC.noarch 0:0.2020-14.el7
完毕!
3、step3
启动mysql服务
[user@localhost ~]$ service mysqld start[user@localhost ~]$ mysql -u root
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.6.36 MySQL Community Server (GPL)Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.
mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -ADatabase changed
//添加root用户的密码
mysql> update user set password = password(‘123456’) where >user=’root’;Query OK, 4 rows affected (0.00 sec)
Rows matched: 4 Changed: 4 Warnings: 0mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)mysql> exit
Bye[user@localhost ~]$ mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 10
Server version: 5.6.36 MySQL Community Server (GPL)Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.
创建hive用户,数据库等
mysql> use mysql;
mysql> GRANT USAGE ON . TO ‘hive’@’localhost’ IDENTIFIED BY ‘hive’ WITH GRANT OPTION;
mysql> grant all on hive.* to hive@’%’ identified by ‘hive’;
mysql> grant all on hive.* to hive@’localhost’ identified by ‘hive’;
mysql> create database hive;
mysql> flush privileges;
mysql> ALTER DATABASE hive CHARACTER SET latin1;
4、step4
配置hive中的文件
[user@localhost conf]$ sudo cp hive-default.xml.template hive-site.xml
[sudo] password for user:
[user@localhost conf]$ ls
beeline-log4j.properties.template hive-log4j.properties.template
hive-default.xml.template hive-site.xml
hive-env.sh.template ivysettings.xml
hive-exec-log4j.properties.template
[user@localhost conf]$ sudo vim hive-site.xml
[sudo] password for user:
主要修改以下参数
<property>
<name>javax.jdo.option.ConnectionURL </name>
<value>jdbc:mysql://localhost:3306/hive </value> //hive 为自己新建数据库的名字
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName </name>
<value>com.mysql.jdbc.Driver </value> //驱动
</property>
<property>
<name>javax.jdo.option.ConnectionPassword </name>
<value>hive </value> //数据库密码
</property>
<property>
<name>datanucleus.autoCreateSchema </name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value> //数据库用户名
<description>Username to use against metastore database</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/hdpsrc/hive/iotmp</value>// 自定义目录
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/home/hdpsrc/hive/iotmp</value> //自定义目录
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/home/hdpsrc/hive/iotmp</value> 自定义目录
<description>Location of Hive run time structured log file</description>
</property>
<property> <name>hive.server2.logging.operation.log.location</name>
<value>自定义目录/operation_logs</value>//注意operation_logs文件夹也需要自己新建
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
[user@localhost conf]$ sudo cp hive-log4j.properties.template hive-log4j.properties
[sudo] password for user:
[user@localhost conf]$ sudo vim hive-log4j.properties
#log4j.appender.EventCounter=org.apache.hadoop.hive.shims.HiveEventCounter
log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter
配置完后以上文件后,把mysql-connector的jar包放到hive的lib目录下,这一步不能少
5、step5
启动hadoop
[user@localhost hadoop-2.7.3]$ ./sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: namenode running as process 3931. Stop it first.
localhost: datanode running as process 4055. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 4255. Stop it first.
starting yarn daemons
resourcemanager running as process 4451. Stop it first.
localhost: nodemanager running as process 4566. Stop it first.
[user@localhost hadoop-2.7.3]$ jps
4451 ResourceManager
4566 NodeManager
4055 DataNode
3931 NameNode
10061 Jps
4255 SecondaryNameNode
[user@localhost apache-hive-1.2.2-bin]$ hive
hive> show databases;
OK
default
Time taken: 1.926 seconds, Fetched: 1 row(s)
可能出现的问题:
1. 在安装mysql数据库是出现:/var/run/yum.pid已被锁定,PID为1610的另一个程序正在运行。
解决办法 rm -f /var/run/yum.pid
删除文件后再次运行yum可用。
2. hadoop起来后,输入hive,报RuntimeException的异常。
尝试换个其他版本的mysql-contentor jar包,jar包从windows系统直接拖进虚拟机可以会出现问题,建议从虚拟机里下jar包
接下来,在hdfs上面,创建目录
$HADOOP_HOME /bin/hadoop fs -mkdir /tmp
$HADOOP_HOME /bin/hadoop fs -mkdir /user/warehouse
//如果报错,输入hadoop fs -mkdir -p /user/warehouse
$HADOOP_HOME /bin/hadoop fs -chmod g+w /tmp //赋予权限
$HADOOP_HOME /bin/hadoop fs -chmod g+w /user/warehouse//赋予权限
启动hive
启动metastore服务
~ bin/hive –service metastore &
输出信息:Starting Hive Metastore Server…
启动hiveserver服务
~ bin/hive –service hiveserver2 &
输出信息:Starting Hive Thrift Server…启动hive客户端
~ bin/hive shell
hive> show tables
OK
查询MySQL数据库中的元数据
~ mysql –u root -p
mysql> use hive;
Database changed
mysql> show tables;
+-------------------------+
| Tables_in_hive |
+-------------------------+
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| DATABASE_PARAMS |
| DBS |
| IDXS |
| INDEX_PARAMS |
| PARTITIONS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_PRIVS |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SORT_COLS |
| TABLE_PARAMS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
+-------------------------+
23 rows in set (0.00 sec)
Hive已经成功安装,下面是hive的使用攻略。
二. Hive的基本使用
- 进入hive控制台
~ cd cd /home/apache-hive-1.2.2-bin/
~ bin/hive shell
输出:Logging initialized using configuration in。。。
2.新建表
创建数据(文本以tab分隔)
sudo vim /home/hive/t_hive.txt //新建t_hive.txt文件,位置自己定
输入数据,数字之间距离为tap间隔(直接粘贴,上传到hive上可能会出现问题)
16 2 3
61 12 13
41 2 31
17 21 3
71 2 31
1 12 34
11 2 34
创建新表
hive> CREATE TABLE t_hive (a int, b int, c int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.489 seconds
>导入数据t_hive.txt到t_hive表
hive> LOAD DATA LOCAL INPATH '/home/hive/t_hive.txt' OVERWRITE INTO TABLE t_hive ;
>Loading data to table default.t_hive
>OK
Time taken: 0.397 seconds
>**3.查看表和数据**
**查看表**
hive> show tables;
OK
t_hive
Time taken: 0.099 seconds
>**正则匹配表名**
hive>show tables '*t*';
OK
t_hive
Time taken: 0.065 seconds
>**查看表数据**
hive> select * from t_hive;
OK
16 2 3
61 12 13
41 2 31
17 21 3
71 2 31
1 12 34
11 2 34
Time taken: 0.264 seconds
查看表结构
hive> desc t_hive;
OK
a int
b int
c int
Time taken: 0.1 seconds
4.修改表
增加一个字段
hive> ALTER TABLE t_hive ADD COLUMNS (new_col String);
OK
Time taken: 0.186 seconds
hive> desc t_hive;
OK
a int
b int
c int
new_col string
Time taken: 0.086 seconds
重命令表名
~ ALTER TABLE t_hive RENAME TO t_hadoop;
OK
Time taken: 0.45 seconds
hive> show tables;
OK
t_hadoop
Time taken: 0.07 seconds
5.删除表
hive> DROP TABLE t_hadoop;
OK
Time taken: 0.767 seconds
hive> show tables;
OK
Time taken: 0.064 seconds
三. Hive交互式模式
• quit,exit: 退出交互式shell
• reset: 重置配置为默认值
• set <key>=<value> : 修改特定变量的值(如果变量名拼写错误,不会报错)
• set : 输出用户覆盖的hive配置变量
• set -v : 输出所有Hadoop和Hive的配置变量
• add FILE[S] *, add JAR[S] *, add ARCHIVE[S] * : 添加 一个或多个 file, jar, archives到分布式缓存
• list FILE[S], list JAR[S], list ARCHIVE[S] : 输出已经添加到分布式缓存的资源。
• list FILE[S] *, list JAR[S] *,list ARCHIVE[S] * : 检查给定的资源是否添加到分布式缓存
• delete FILE[S] *,delete JAR[S] *,delete ARCHIVE[S] * : 从分布式缓存删除指定的资源
• ! <command> : 从Hive shell执行一个shell命令
• dfs <dfs command> : 从Hive shell执行一个dfs命令
• <query string> : 执行一个Hive 查询,然后输出结果到标准输出
• source FILE <filepath>: 在CLI里执行一个hive脚本文件
四. 数据导入
还以刚才的t_hive为例。
创建表结构
hive> CREATE TABLE t_hive (a int, b int, c int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;
1.从操作本地文件系统加载数据(LOCAL)
hive> LOAD DATA LOCAL INPATH '/home/hive/t_hive.txt' OVERWRITE INTO TABLE t_hive ;
Loading data to table default.t_hive
OK
Time taken: 0.612 seconds
在HDFS中查找刚刚导入的数据
[user@localhost hadoop-2.7.3] cdbin/[user@localhostbin] hadoop fs -cat /user/warehouse/t_hive/t_hive.txt
16 2 3
61 12 13
41 2 31
17 21 3
71 2 31
1 12 34
11 2 34
从HDFS加载数据
创建表t_hive2
hive> CREATE TABLE t_hive2 (a int, b int, c int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;从HDFS加载数据
hive> LOAD DATA INPATH ‘/user/warehouse/t_hive/t_hive.txt’ OVERWRITE INTO TABLE t_hive2;
Loading data to table default.t_hive2
Table default.t_hive2 stats: [numFiles=1, numRows=0, totalSize=68, rawDataSize=0]
OK
Time taken: 2.649 seconds
hive>
查看数据
hive> select * from t_hive2;
OK
16 2 3
61 12 13
41 2 31
17 21 3
71 2 31
1 12 34
11 2 34
Time taken: 0.287 seconds
从其他表导入数据
hive> INSERT OVERWRITE TABLE t_hive2 SELECT * FROM t_hive ;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201307131407_0002, Tracking URL = http://c1.wtmart.com:50030/jobdetails.jsp?jobid=job_201307131407_0002
Kill Command = /home/cos/toolkit/hadoop-1.0.3/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://c1.wtmart.com:9001 -kill job_201307131407_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-07-16 10:32:41,979 Stage-1 map = 0%, reduce = 0%
2013-07-16 10:32:48,034 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.03 sec
2013-07-16 10:32:49,050 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.03 sec
2013-07-16 10:32:50,068 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.03 sec
2013-07-16 10:32:51,082 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.03 sec
2013-07-16 10:32:52,093 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.03 sec
2013-07-16 10:32:53,102 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.03 sec
2013-07-16 10:32:54,112 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.03 sec
MapReduce Total cumulative CPU time: 1 seconds 30 msec
Ended Job = job_201307131407_0002
Ended Job = -314818888, job is filtered out (removed at runtime).
Moving data to: hdfs://c1.wtmart.com:9000/tmp/hive-cos/hive_2013-07-16_10-32-31_323_5732404975764014154/-ext-10000
Loading data to table default.t_hive2
Deleted hdfs://c1.wtmart.com:9000/user/hive/warehouse/t_hive2
Table default.t_hive2 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 56, raw_data_size: 0]
7 Rows loaded to t_hive2
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.03 sec HDFS Read: 273 HDFS Write: 56 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 30 msec
OK
Time taken: 23.227 seconds
hive> select * from t_hive2;
OK
16 2 3
61 12 13
41 2 31
17 21 3
71 2 31
1 12 34
11 2 34
Time taken: 0.134 seconds
创建表并从其他表导入数据
删除表
hive> DROP TABLE t_hive;
创建表并从其他表导入数据
hive> CREATE TABLE t_hive AS SELECT * FROM t_hive2 ;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201307131407_0003, Tracking URL = http://c1.wtmart.com:50030/jobdetails.jsp?jobid=job_201307131407_0003
Kill Command = /home/cos/toolkit/hadoop-1.0.3/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://c1.wtmart.com:9001 -kill job_201307131407_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-07-16 10:36:48,612 Stage-1 map = 0%, reduce = 0%
2013-07-16 10:36:54,648 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.13 sec
2013-07-16 10:36:55,657 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.13 sec
2013-07-16 10:36:56,666 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.13 sec
2013-07-16 10:36:57,673 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.13 sec
2013-07-16 10:36:58,683 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.13 sec
2013-07-16 10:36:59,691 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.13 sec
MapReduce Total cumulative CPU time: 1 seconds 130 msec
Ended Job = job_201307131407_0003
Ended Job = -670956236, job is filtered out (removed at runtime).
Moving data to: hdfs://c1.wtmart.com:9000/tmp/hive-cos/hive_2013-07-16_10-36-39_986_1343249562812540343/-ext-10001
Moving data to: hdfs://c1.wtmart.com:9000/user/hive/warehouse/t_hive
Table default.t_hive stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 56, raw_data_size: 0]
7 Rows loaded to hdfs://c1.wtmart.com:9000/tmp/hive-cos/hive_2013-07-16_10-36-39_986_1343249562812540343/-ext-10000
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.13 sec HDFS Read: 272 HDFS Write: 56 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 130 msec
OK
Time taken: 20.13 seconds
hive> select * from t_hive;
OK
16 2 3
61 12 13
41 2 31
17 21 3
71 2 31
1 12 34
11 2 34
Time taken: 0.109 seconds
仅复制表结构不导数据
hive> CREATE TABLE t_hive3 LIKE t_hive;
hive> select * from t_hive3;
OK
Time taken: 0.077 seconds
从MySQL数据库导入数据
我们将在介绍Sqoop时讲。
五. 数据导出
从HDFS复制到HDFS其他位置
~ hadoop fs -cp /user/warehouse/t_hive/
~ hadoop fs -ls /t_hive
Found 1 items
-rw-r--r-- 1 cos supergroup 56 2013-07-16 10:41 /t_hive/000000_0
~ hadoop fs -cat /t_hive/000000_0
1623
611213
41231
17213
71231
11234
11234
通过Hive导出到本地文件系统
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/t_hive' SELECT * FROM t_hive;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201307131407_0005, Tracking URL = http://c1.wtmart.com:50030/jobdetails.jsp?jobid=job_201307131407_0005
Kill Command = /home/cos/toolkit/hadoop-1.0.3/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://c1.wtmart.com:9001 -kill job_201307131407_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-07-16 10:46:24,774 Stage-1 map = 0%, reduce = 0%
2013-07-16 10:46:30,823 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
2013-07-16 10:46:31,833 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
2013-07-16 10:46:32,844 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
2013-07-16 10:46:33,856 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
2013-07-16 10:46:34,865 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
2013-07-16 10:46:35,873 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
2013-07-16 10:46:36,884 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.87 sec
MapReduce Total cumulative CPU time: 870 msec
Ended Job = job_201307131407_0005
Copying data to local directory /tmp/t_hive
Copying data to local directory /tmp/t_hive
7 Rows loaded to /tmp/t_hive
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.87 sec HDFS Read: 271 HDFS Write: 56 SUCCESS
Total MapReduce CPU Time Spent: 870 msec
OK
Time taken: 23.369 seconds
查看本地操作系统
hive> ! cat /tmp/t_hive/000000_0;
hive> 1623
611213
41231
17213
71231
11234
11234
六. Hive查询HiveQL
注:以下代码将去掉map,reduce的日志输出部分。
普通查询:排序,列别名,嵌套子查询
hive> FROM (
SELECT b,c as c2 FROM t_hive
) t
SELECT t.b, t.c2
WHERE b>2
LIMIT 2;
12 13
21 3
连接查询:JOIN
hive> SELECT t1.a,t1.b,t2.a,t2.b
FROM t_hive t1 JOIN t_hive2 t2 on t1.a=t2.a
WHERE t1.c>10;
1 12 1 12
11 2 11 2
41 2 41 2
61 12 61 12
71 2 71 2
聚合查询1:count, avg
hive> SELECT count(*), avg(a) FROM t_hive;
7 31.142857142857142
聚合查询2:count, distinct
hive> SELECT count(DISTINCT b) FROM t_hive;
3
聚合查询3:GROUP BY, HAVING
GROUP BY
hive> SELECT avg(a),b,sum(c) FROM t_hive GROUP BY b,c
16.0 2 3
56.0 2 62
11.0 2 34
61.0 12 13
1.0 12 34
17.0 21 3
HAVING
hive> SELECT avg(a),b,sum(c) FROM t_hive GROUP BY b,c HAVING sum(c)>30
56.0 2 62
11.0 2 34
1.0 12 34
七. Hive视图
Hive视图和数据库视图的概念是一样的,我们还以t_hive为例。
hive> CREATE VIEW v_hive AS SELECT a,b FROM t_hive where c>30;
hive> select * from v_hive;
41 2
71 2
1 12
11 2
删除视图
hive> DROP VIEW IF EXISTS v_hive;
OK
Time taken: 0.495 seconds
八. Hive分区表
分区表是数据库的基本概念,但很多时候数据量不大,我们完全用不到分区表。Hive是一种OLAP数据仓库软件,涉及的数据量是非常大的,所以分区表在这个场景就显得非常重要!!
下面我们重新定义一个数据表结构:t_hft
创建数据
~ sudo vim /home/cos/demo/t_hft_20130627.csv
000001,092023,9.76
000002,091947,8.99
000004,092002,9.79
000005,091514,2.2
000001,092008,9.70
000001,092059,9.45
~ vi /home/cos/demo/t_hft_20130628.csv
000001,092023,9.76
000002,091947,8.99
000004,092002,9.79
000005,091514,2.2
000001,092008,9.70
000001,092059,9.45
创建数据表
DROP TABLE IF EXISTS t_hft;
CREATE TABLE t_hft(
SecurityID STRING,
tradeTime STRING,
PreClosePx DOUBLE
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
创建分区数据表
根据业务:按天和股票ID进行分区设计
DROP TABLE IF EXISTS t_hft;
CREATE TABLE t_hft(
SecurityID STRING,
tradeTime STRING,
PreClosePx DOUBLE
) PARTITIONED BY (tradeDate INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
导入数据
20130627
hive> LOAD DATA LOCAL INPATH ‘/home/cos/demo/t_hft_20130627.csv’ OVERWRITE INTO TABLE t_hft PARTITION (tradeDate=20130627);
Copying data from file:/home/cos/demo/t_hft_20130627.csv
Copying file: file:/home/cos/demo/t_hft_20130627.csv
Loading data to table default.t_hft partition (tradedate=20130627)
20130628
hive> LOAD DATA LOCAL INPATH ‘/home/cos/demo/t_hft_20130628.csv’ OVERWRITE INTO TABLE t_hft PARTITION (tradeDate=20130628);
Copying data from file:/home/cos/demo/t_hft_20130628.csv
Copying file: file:/home/cos/demo/t_hft_20130628.csv
Loading data to table default.t_hft partition (tradedate=20130628)
查看分区表
hive> SHOW PARTITIONS t_hft;
tradedate=20130627
tradedate=20130628
Time taken: 0.082 seconds
查询数据
hive> select * from t_hft where securityid=’000001’;
000001 092023 9.76 20130627
000001 092008 9.7 20130627
000001 092059 9.45 20130627
000001 092023 9.76 20130628
000001 092008 9.7 20130628
000001 092059 9.45 20130628
hive> select * from t_hft where tradedate=20130627 and PreClosePx<9;
000002 091947 8.99 20130627
000005 091514 2.2 20130627
Hive基于使用完成,这些都是日常的操作。后面我会继续讲一下,HiveQL优化及Hive的运维。
可能遇到的问题
1. hive –service hiveserver &
Exception in thread “main” java.lang.ClassNotFoundException:org.apache.hadoop.hive.service.HiveServer
- 这是原来表结构
hive> desc t_hive;
OK
a int
b int
c int
Time taken: 3.201 seconds, Fetched: 3 row(s)
想新增加一个字段,报错
hive> ALTER TABLE t_hive ADD COLUMNS (new_col String);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table.
For direct MetaStore DB connections, we don’t support retries at the client level.
hive.log 报错内容
2017-05-19 14:53:41,036 ERROR [main]: ql.Driver (SessionState.java:printError(960)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. For direct MetaStore DB connections, we don’t support retries at the client level.
2017-05-19 14:53:41,036 INFO [main]: metadata.Hive (Hive.java:logDumpPhase(3294)) - Dumping metastore api call timing information for : execution phase
2017-05-19 14:53:41,037 INFO [main]: metadata.Hive (Hive.java:dumpAndClearMetaCallTiming(3285)) - Total time spent in this metastore function was greater than 1000ms : getTable_(String, String, )=2108
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) -
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) -
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) -
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) -
2017-05-19 14:53:41,037 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) -
解决方法:重新换一个版本的mysql驱动,替换掉hive的lib目录中之前的版本,我换成了5.1.41版本的,可以用。