hive实战

1. 安装hive

下载hive,下载地址http://mirror.bjtu.edu.cn/apache//hive/,解压该文件:

xuqiang@ubuntu:~/hadoop/src/hive$ tar zxvf hive-0.7.0-bin.tar.gz

设置环境变量:

xuqiang@ubuntu:~/hadoop/src/hive$ cd hive-0.7.0-bin/

xuqiang@ubuntu:~/hadoop/src/hive/hive-0.7.0-bin$ export HIVE_HOME=`pwd`添加HIVE_HOME到环境变量PATH中:

xuqiang@ubuntu:~/hadoop/src/hive$ export PATH=$HIVE_HOME/bin:$PATH;

在运行hive之前,请确保变量HADOOP_HOME已经设置,如果没有设置,可以使用export命令设置该变量。

然后需要在hdfs上创建如下的目录来保存hive相关的数据。

xuqiang@ubuntu:~/hadoop/src/hive$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

xuqiang@ubuntu:~/hadoop/src/hive$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse

xuqiang@ubuntu:~/hadoop/src/hive$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp

xuqiang@ubuntu:~/hadoop/src/hive$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

此时运行hive的环境已经准备好了,在命令行中键入如下命令开始运行hive:

xuqiang@ubuntu:~/hadoop/src/hive/hive-0.7.0-bin$ $HIVE_HOME/bin/hive

<2>. Hive实战

这里我们将完成这样的一个过程,首先创建一个表,从本机上加载数据到该表中,查询该表,得到我们感兴趣的数据。

首先创建表(具体语法将在下面给出):

hive> create table cite(citing INT, cited INT)

> row format delimited

> fields terminated by ','

> stored as textfile;

创建完表之后,我们可以使用show tables命令查看新建的表:

hive> show tables;

OK

cite

Time taken: 1.257 seconds

查看新建表的结构:

hive> describe cite;

OK

citing int

cited int

Time taken: 0.625 seconds

我们加载本地数据到该表中去:

hive> load data local inpath '/home/xuqiang/hadoop/data/cite75_99.txt'

> overwrite into table cite;

Copying data from file:/home/xuqiang/hadoop/data/cite75_99.txt

Copying file: file:/home/xuqiang/hadoop/data/cite75_99.txt

Loading data to table default.cite

Deleted hdfs://localhost:9000/user/hive/warehouse/cite

OK

Time taken: 89.766 seconds

查询前10行数据:

hive> select * from cite limit 10;

OK

NULL NULL

3858241 956203

3858241 1324234

3858241 3398406

3858241 3557384

3858241 3634889

3858242 1515701

3858242 3319261

3858242 3668705

3858242 3707004

Time taken: 0.778 seconds

查询该文件中存在多少条数据,这时hive将执行一个map-reduce的过程来计算该值:

hive> select count(1) from cite;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapred.reduce.tasks=

Starting Job = job_201106150005_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201106150005_0004

Kill Command = /home/xuqiang/hadoop/src/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201106150005_0004

2011-06-15 05:33:20,724 Stage-1 map = 0%, reduce = 0%

2011-06-15 05:33:46,325 Stage-1 map = 2%, reduce = 0%

2011-06-15 05:33:49,827 Stage-1 map = 3%, reduce = 0%

2011-06-15 05:33:53,208 Stage-1 map = 4%, reduce = 0%

2011-06-15 05:33:55,259 Stage-1 map = 7%, reduce = 0%

2011-06-15 05:34:40,450 Stage-1 map = 9%, reduce = 0%

2011-06-15 05:34:52,706 Stage-1 map = 48%, reduce = 0%

2011-06-15 05:34:57,961 Stage-1 map = 50%, reduce = 0%

2011-06-15 05:35:28,420 Stage-1 map = 50%, reduce = 17%

2011-06-15 05:35:36,653 Stage-1 map = 58%, reduce = 17%

2011-06-15 05:35:40,844 Stage-1 map = 61%, reduce = 17%

2011-06-15 05:35:49,131 Stage-1 map = 62%, reduce = 17%

2011-06-15 05:35:56,428 Stage-1 map = 67%, reduce = 17%

2011-06-15 05:36:34,380 Stage-1 map = 90%, reduce = 17%

2011-06-15 05:36:52,601 Stage-1 map = 100%, reduce = 17%

2011-06-15 05:37:10,299 Stage-1 map = 100%, reduce = 67%

2011-06-15 05:37:16,471 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201106150005_0004

OK

16522439

Time taken: 274.531 seconds

好的,最后我们删除刚刚新建的表:

hive> drop table cite;

OK

Time taken: 5.724 seconds

<3>. 存储模型

通常情况下hive将数据存储到hadoop上/user/hive/warehouse目录下,关系型数据库使用索引index去加快查询速度,而hive使用的是以恶搞所谓的partition columns的概念,例如比如说存在某一行叫做state,可以根据state中存储的数据值,将state分为50个partitions。如果存在date列的话,那么通常按照时间进行partition,hive在对分区的列上进行查询的速度会比较快,原因是hadoop在数据存储上将不同的分区存储在了不同的目录文件下。例如对于上面的列state和date,可能的存储模型如下:

clip_image002

当然每个分区内的数据文件可能还是比较大,幸好在hive中存在一个所谓的buckets的概念,buckets根据hash值将数据分割成更小的数据文件,还是上面的例子,如果使用buckets的话,可能的存储模型如下:

clip_image004

<4>. 深入hql查询语言

我们将通过实际hql语句来分析hql的语法。

clip_image006

该条语句创建表page_view,表中有5列,同时在见表语句中指出了各个列的数据类型,在hive中内建支持的数据类型如下:

clip_image008

这些类型的层次结构如下:

clip_image010

层次结构中允许从子类型隐式的转换成父类型。

接着回到上面的见表语句,在列ip中添加了注释:

Ip STRING COMMENT (“Ip address of user”)

然后建表语句中添加了表的分区:

Partitioned by (dt string, country string)

注意的是这里的两列并不是表中的列,实际上这里的分区的两列dt和country仅仅是为了分区,实际上可能并不存储这些数据。

Additionally the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. When specified in this way

然后建表语句指定buckets大小,这里是32。

最后指定数据源的文件格式。

好的现在我们给出更多的表操作的例子,由于sql极其相似,这里仅仅给出了语法,并没有给出解释。

clip_image012

描述表page_view结构。

clip_image014

改变表的名字

clip_image016

新增加一行

clip_image018

删除分区

clip_image020

删除表

clip_image022

显示所有的表

clip_image024

从文件page_view.txt文件将数据加载到表page_view中,注意的是这里的路径可以使用绝对路径或者相对路径。这里使用了local,表明数据文件的来源是本地,而不是在hdfs上。

Hql上的查询语句和sql很类似,这里没有一一给出,下面仅仅将给出hql中运算符和内建聚类函数。

运算符:

clip_image026

内建聚类函数:

clip_image028

<5>. 参考资料

http://wiki.apache.org/hadoop/Hive/GettingStarted

Dive into the world of SQL on Hadoop and get the most out of your Hive data warehouses. This book is your go-to resource for using Hive: authors Scott Shaw, Ankur Gupta, David Kjerrumgaard, and Andreas Francois Vermeulen take you through learning HiveQL, the SQL-like language specific to Hive, to analyze, export, and massage the data stored across your Hadoop environment. From deploying Hive on your hardware or virtual machine and setting up its initial configuration to learning how Hive interacts with Hadoop, MapReduce, Tez and other big data technologies, Practical Hive gives you a detailed treatment of the software. In addition, this book discusses the value of open source software, Hive performance tuning, and how to leverage semi-structured and unstructured data. What You Will Learn Install and configure Hive for new and existing datasets Perform DDL operations Execute efficient DML operations Use tables, partitions, buckets, and user-defined functions Discover performance tuning tips and Hive best practices Who This Book Is For Developers, companies, and professionals who deal with large amounts of data and could use software that can efficiently manage large volumes of input. It is assumed that readers have the ability to work with SQL. Table of Contents Chapter 1: Setting the Stage for Hive: Hadoop Chapter 2: Introducing Hive Chapter 3: Hive Architecture Chapter 4: Hive Tables DDL Chapter 5: Data Manipulation Language (DML) Chapter 6: Loading Data into Hive Chapter 7: Querying Semi-Structured Data Chapter 8: Hive Analytics Chapter 9: Performance Tuning: Hive Chapter 10: Hive Security Chapter 11: The Future of Hive Appendix A: Building a Big Data Team Appendix B: Hive Functions
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值