快学Big Data -- Hive(十四)

Hive 总结

 

官网:http://hive.apache.org/

 

概述

Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能。

主要用途:用来做离线数据分析,比直接用mapreduce开发效率更高,里哟摩纳哥HDFS作为储存系统,利用mapreduce作为运算的一个工具。

Hive使用内存数据库derby占用内存小,但是数据存在内存不稳定。

Hive 2.0 是从hive 1.2.x 直接跳转过来的,hive2.0之后并不一定支持mapReduce  

外部表与内部表的区别

1、先来说下Hive中内部表与外部表的区别:?

    Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据。

     需要注意的是传统数据库对表数据验证是 schema on write(写时模式),而 Hive 在load时是不检查数据是否符合schema的,hive 遵循的是 schema on read(读时模式),只有在读的时候hive才检查、解析具体的数据字段、schema。读时模式的优势是load data 非常迅速,因为它不需要读取数据进行解析,仅仅进行文件的复制或者移动。写时模式的优势是提升了查询性能,因为预先解析之后可以对列建立索引,并压缩,但这样也会花费要多的加载时间。?

2、内部表直接使用create语句创建表结构,外部表需要使用load去加载外部的文件,都是保存在HDFS里面

特性

1-1)、 可扩展 

Hive可以自由的扩展集群的规模,一般情况下不需要重启服务。

1-2)、延展性 

Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数。

 

1-3)、容错 

良好的容错性,节点出现问题SQL仍可完成执行。

 

安装Hive

1-1)、安装

[root@hadoop1 local]# tar -zxvf apache-hive-1.2.1-bin.tar.gz

[root@hadoop1 local]# mv apache-hive-1.2.1-bin   hive

1-2)、修改快捷方式

[root@hadoop1 hive]# vi /etc/profile

加入以下配置

export HIVE_HOME=/usr/local/hive

[root@hadoop1 hive]# source /etc/profile

1-3)、解决JAR冲突的问题

hive的lib目录中jline.2.12.jar复制到/usr/local/hadoop-2.6.4/share/hadoop/yarn/lib中

1-4)、本地储存元数据

  1. 、复制配置文件

讲hadoop下的etc/hadoop/下一下文件复制到hive的conf下

[root@hadoop1 conf]# cp core-site.xml  hdfs-site.xml  /opt/hive-1.2/conf/

B)、运行Hive

启动查看hive查看是否报错

[root@hadoop1 bin]# ./hive

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

 

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive> show databases;

OK

default

Time taken: 2.263 seconds, Fetched: 1 row(s)

hive> create database hive;

OK

Time taken: 0.408 seconds

hive> use hive;

OK

Time taken: 0.025 seconds

hive> show tables;

OK

Time taken: 0.081 seconds

hive> create table test_hive(id int,name string);

OK

Time taken: 0.618 seconds

hive> create table test_hive(id int,name string);

OK

Time taken: 0.618 seconds

hive> insert into test_hive values(1,"xiaozhang");

Query ID = root_20160924004917_77b8b458-1552-4855-8485-853c649291e2

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Job running in-process (local Hadoop)

2016-09-24 00:49:24,162 Stage-1 map = 100%,  reduce = 0%

Ended Job = job_local927918446_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hive.db/test_hive/.hive-staging_hive_2016-09-24_00-49-17_955_4462960730369907542-1/-ext-10000

Loading data to table hive.test_hive

Table hive.test_hive stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 12 HDFS Write: 94 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Time taken: 7.403 seconds

hive> select * from test_hive;

OK

1 xiaozhang

Time taken: 0.131 seconds, Fetched: 1 row(s)

 

C)、查看默认的保存路径

查看hive数据库下的表的信息

[root@hadoop1 bin]# Hadoop  fs -ls  /user/hive/warehouse/hive.db/

 

 

 

1-5)、MySql储存元数据

A)、前提是需要安装Hadoop集群与Mysql

B)、在/etc/profile中配置路径

C)、修改配置文件

1-1)、修改hive-site.xml

在/home/hive/conf中新建vi  hive-site.xml 并加入以下内容:

 

[root@hadoop1 conf]# vi hive-site.xml

<configuration>

# 配置链接URL

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&useSSL=true

</value>

<description>JDBC connect string for a JDBC metastore</description>

</property>

 

# 配置链接驱动

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

 

# 链接驱动的用户名字

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>root</value>

<description>username to use against metastore database</description>

</property>

 

# 链接驱动的密码

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>123456</value>

<description>password to use against metastore database</description>

</property>

</configuration>

1-2)、修改Hive-env.sh

[root@hadoop1 conf]# mv hive-env.sh.template  hive-env.sh

[root@hadoop1 conf]# cat hive-env.sh

加入一下配置:

export HADOOP_HEAPSIZE=2048

 

HADOOP_HOME=/usr/local/hadoop-2.6.4

 

export HIVE_CONF_DIR=/opt/hive-1.2/conf

export JAVA_HOME=/opt/jdk1.7

D)、运行Hive

 [root@hadoop1 bin]# ./hive

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

 

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive> show databases;

OK

default

Time taken: 2.71 seconds, Fetched: 1 row(s)

hive> create database hiveTest;

OK

Time taken: 0.249 seconds

hive> use hiveTest;

OK

Time taken: 0.073 seconds

hive> show tables;

OK

Time taken: 0.081 seconds

hive> create table hive(id int,name string);

OK

Time taken: 0.469 seconds

hive> insert into hive values(1,"xiaozhang");

Query ID = root_20160924184519_6bf172ec-bf83-43e2-a722-802b7d1b791f

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Job running in-process (local Hadoop)

2016-09-24 18:45:25,965 Stage-1 map = 0%,  reduce = 0%

2016-09-24 18:45:26,988 Stage-1 map = 100%,  reduce = 0%

Ended Job = job_local1476697233_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/hive/.hive-staging_hive_2016-09-24_18-45-19_555_3195082767256194605-1/-ext-10000

Loading data to table hivetest.hive

Table hivetest.hive stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 12 HDFS Write: 93 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Time taken: 8.335 seconds

hive> select * from hive;

OK

1 xiaozhang

Time taken: 0.219 seconds, Fetched: 1 row(s)

E)、在MySql 上查看元数据的信息                                 

 

 

 

 

 

 

 

 

 

 

 

 

可以看出hive保存元数据时,表的名字与字段会分开的,数据保存在HDFS里面。默认的路径为hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/hive,其中TMS会保存表的字段的信息,DBS表会保存在HDFS上的元数据

 

F)、导入mysql包 

请把mysql-connector-java-5.0.3-bin.jar放在hive安装的目录下的lib目录下,下载地址请请求:http://download.csdn.net/download/dly1911618/1035034

Hive的初级应用

1-1)、插入普通的数据

hive>  create table t_test(id int ,name string, password string) row format delimited fields terminated by ',';

OK

Time taken: 0.196 seconds

 

hive> insert into t_test(id,name,password) values('1','323','sfdf');

Query ID = root_20160924191209_07686e5a-177e-4194-88a5-58536f15698a

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Job running in-process (local Hadoop)

2016-09-24 19:12:13,483 Stage-1 map = 100%,  reduce = 0%

Ended Job = job_local956434617_0002

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/t_test/.hive-staging_hive_2016-09-24_19-12-09_727_5241291650450368428-1/-ext-10000

Loading data to table hivetest.t_test

Table hivetest.t_test stats: [numFiles=1, numRows=1, totalSize=11, rawDataSize=10]

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 104 HDFS Write: 186 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Time taken: 4.957 seconds

hive> select * from t_test;

OK

1 323 sfdf

Time taken: 0.14 seconds, Fetched: 1 row(s)

 

[root@hadoop1 bin]# hadoop fs -cat /user/hive/warehouse/hivetest.db/t_test/000000_0

1,323,sfdf

 

注意:在默认的情况下会把表,数据库看做一个文件夹处理,只有数据才会看做文件,默认的事000000——0如果有多个文件则会一次累加。

 

查看保存本地的数据库:

1-2)、 SQL执行方式

A)、查看帮助

# hive -help

log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.

usage: hive

 -d,--define <key=value>          Variable subsitution to apply to hive

                                  commands. e.g. -d A=B or --define A=B

    --database <databasename>     Specify the database to use

 -e <quoted-query-string>         SQL from command line

 -f <filename>                    SQL from files

 -H,--help                        Print help information

    --hiveconf <property=value>   Use value for given property

    --hivevar <key=value>         Variable subsitution to apply to hive

                                  commands. e.g. --hivevar A=B

 -i <filename>                    Initialization SQL file

 -S,--silent                      Silent mode in interactive shell

 -v,--verbose                     Verbose mode (echo executed SQL to the

                                  console)

B)、hive  -e 命令的使用

[root@hadoop3 hive]# hive -e  "show databases"

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

 

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties

OK

default

hivetest

Time taken: 2.623 seconds, Fetched: 2 row(s)

C)、静音模式

[root@hadoop3 hive]# hive -S  -e  "show databases"

在静音的模式下不会显示mapreduce的操作过程

 

D)、附加文件的形式

# vi show-databases.sql

show databases;

 

# hive -S -f show-databases.sql

log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.

Default

 

E)、env命名空间可作为向Hive传递变量的一个可选的方式

# export TRAINSVAR=trains_var

# hive -S -e "select * from ${TRAINSVAR} limit 10"

****************************

1-3)、外部表数据(使用hdfs上传数据)

准备数据:

[root@hadoop1 /]# vi /test.bat

1,123,ffgg

2,sfgf,eferg

3,dvf,dvrfg

A)、查看表的字段信息

hive> desc t_test;

OK

id                   int                                      

name                 string                                   

password             string                                   

Time taken: 0.21 seconds, Fetched: 3 row(s)

B)、查看执行结果

[root@hadoop1 /]# hadoop fs -put test.bat /user/hive/warehouse/hivetest.db/t_test

hive> select * from t_test;

OK

1 323 sfdf

1 123 ffgg

2 sfgf eferg

3 dvf dvrfg

Time taken: 0.125 seconds, Fetched: 4 row(s)

 

[root@hadoop1 /]# hadoop fs -cat  /user/hive/warehouse/hivetest.db/t_test/test.bat

1,123,ffgg

2,sfgf,eferg

3,dvf,dvrfg

 

注意:要上传的数据后缀必须是任意格式的,如果字段的类型与数据库的类型不一致,则会当做NULL来处理,以及多的字段会自动去除,不足的字段也会用NULL来补齐

1-4)、内部表数据(使用hive 的命令导入与导出数据)

A)、查看表的结构

hive> desc t_test;

OK

id                   int                                      

name                 string                                   

password             string                                   

Time taken: 0.21 seconds, Fetched: 3 row(s)

B)、准备数据

[root@hadoop1 /]# vi t_test.bat

4,dsfef,fgrg

8,drfg,fgrg

9,dgr,rgr

10,dgfrg,rgr

11,dgr,rgrgr

12,dsfe,sfe

C)、本地和HDFS把数据导入到Hive中

1-1)、上传之前

[root@hadoop1 testData]# ls

test.bat  t_test.bat

hive>  load data local inpath '/usr/local/hive/testData/t_test.bat' into table t_test;

Loading data to table hivetest.t_test

Table hivetest.t_test stats: [numFiles=3, numRows=0, totalSize=120, rawDataSize=0]

OK

Time taken: 1.059 seconds

1-2)、上传之后

[root@hadoop3 testData]# ls

test.bat  t_test.bat

1-3)、HDFS上传

[root@hadoop3 testData]# ls

test.bat  t_test.bat  t_hadoop.bat

[root@hadoop3 testData]# hadoop fs -put t_hadoop.bat  /user

 

 

HDFS上传到表:

hive> load  data  inpath  '/user/t_hadoop.bat' into table t_test;

Loading data to table hivetest.t_test

Table hivetest.t_test stats: [numFiles=4, numRows=0, totalSize=182, rawDataSize=0]

OK

Time taken: 0.728 seconds

 

 

查看HDFS上的文件:

[root@hadoop3 testData]# hadoop fs -ls /user

Found 2 items

drwxr-xr-x   - root supergroup          0 2016-09-24 00:46 /user/hive

drwxr-xr-x   - root supergroup          0 2016-09-02 05:25 /user/root

 

 

 

注意: 在使用本地上传数据时,会先把数据临时复制一份到HDFS上,目录是有上传的目录一致,等到上传完再把数据移动到指定的目录。在HDFS中则是直接移动文件。其中本地上传需要在inpath前加local关键字

D)、把数据导出的本地

[root@hadoop1 testData]# ls

[root@hadoop1 testData]#

hive>  insert overwrite local directory '/usr/local/hive/testData/t_test.text'  row format delimited fields terminated by '\t'  select * from t_test;

Query ID = root_20160924201517_fda8eced-7f74-4aa1-9b1f-cc6fdc153064

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Job running in-process (local Hadoop)

2016-09-24 20:15:19,324 Stage-1 map = 100%,  reduce = 0%

Ended Job = job_local296856771_0005

Copying data to local directory /usr/local/hive/testData/t_test.text

Copying data to local directory /usr/local/hive/testData/t_test.text

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 1274 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Time taken: 1.692 seconds

 

 

[root@hadoop1 testData]# cd t_test.text/

[root@hadoop1 t_test.text]# ls

000000_0

[root@hadoop3 t_test.text]# cat 000000_0

1323sfdf

345dffefer

78efefeefefeg

1323fegfeefegeg

3545dfdgdgfg

4dsfeffgrg

8drfgfgrg

9dgrrgr

10dgfrgrgr

11dgrrgrgr

12dsfesfe

1123ffgg

2sfgfeferg

3dvfdvrfg

 

E)、把数据导出到HDFS

[root@hadoop1 testData]# ls

[root@hadoop1 testData]#

hive>  insert overwrite directory '/usr/local/hive/testData/t_test.text'  row format delimited fields terminated by '\t'   select * from t_test;

 

F)、导出到另外一张表中

[root@hadoop1 testData]# insert  table t_test_table select  * from t_test;

 

1-5)、HIVE 加载SQL文件的形式执行脚本

  1. 、写配置文件

[root@hadoop1 testDate]# vi common.property

set hivevar:db.default=default;

set hivevar:db.allinfo=allinfo;

set hivevar:db.xiaoxu=xiaoxu;

set hivevar:hdfs.url=hdfs://hadoop1:9000;

set mapred.job.queue.name=pms;

set hive.exec.reducers.max=20;

set mapred.reduce.tasks=200;

set hive.exec.parallel=true;

set hive.exec.parallel.thread.number=20;

set mapred.job.name=extract_trfc_page_kpi;

 

Hivevar: 是在内部文件中传参的设置

B)、写SQL文件

[root@hadoop1 testDate]# vi  textTable .sql

-- 设置MR的task的个数

set mapred.reduce.tasks = 50;

 

-- 设置HDFS的路径

set hdfs.url=${hivevar:hdfs.url};

 

-- 设置获取配置文件的初始化数据库的信息

set source.db.name.default=${hivevar:db.default};

set source.db.name.allinfo=${hivevar:db.allinfo};

set out.db.name=${hivevar:db.xiaoxu};

 

--  创建数据库

CREATE DATABASE IF NOT  EXISTS ${hiveconf:source.db.name.allinfo};

CREATE DATABASE IF NOT  EXISTS ${hiveconf:out.db.name};

 

 

-- 创建表结构

CREATE TABLE IF NOT EXISTS ${hiveconf:out.db.name}.textTable(

id                   bigint,                                   

name                 string,                                   

age                  int

)

 

-- 执行逻辑

CREATE TABLE IF NOT EXISTS ${hiveconf:out.db.name}.textTable AS

SELECT ps.id,ps.name,ps.age FROM ${hiveconf:source.db.name.default}.person ps;

 

-- 使用内存保存数据

WITH memoryData1 AS ( SELECT ps.id,ps.name,ps.age FROM ${hiveconf:out.db.name}.textTable  ps WHERE age='23'),

memoryData2 AS ( SELECT ps.id,ps.name,ps.age FROM ${hiveconf:out.db.name}.textTable ps WHERE age='24' )

SELECT * FROM  memoryData1 UNION ALL SELECT * FROM  memoryData2;

 

Hiveconf: 是接受内部参数的,与hivevar相配合使用,其中WITH关键字是使用是吧加载的数据放到内存中,建议加载到内存中的数据不要过大。

C)、测试结果

[root@hadoop1 testDate]# hive -i  common.property -f textTable .sql

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties

OK

Time taken: 1.642 seconds

OK

Time taken: 0.045 seconds

OK

Time taken: 0.247 seconds

Query ID = root_20170310190113_12f2d1a9-6f30-4991-b09a-11f2cb82b043

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Job running in-process (local Hadoop)

2017-03-10 19:01:19,218 Stage-1 map = 100%,  reduce = 0%

Ended Job = job_local546671993_0001

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 130 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1 xiaozhang 23

2 xiaowang 24

9 daye 24

Time taken: 6.281 seconds, Fetched: 3 row(s)

 

以上可以看出执行的过程,map阶段以及reduce阶段的执行情况,其中hive -i是加载配置文件-f是加载SQL文件,详细的请使用hive -help命令查看

[root@hadoop1 testDate]# hive -help

usage: hive

 -d,--define <key=value>          Variable subsitution to apply to hive

                                  commands. e.g. -d A=B or --define A=B

    --database <databasename>     Specify the database to use

 -e <quoted-query-string>         SQL from command line

 -f <filename>                    SQL from files

 -H,--help                        Print help information

    --hiveconf <property=value>   Use value for given property

    --hivevar <key=value>         Variable subsitution to apply to hive

                                  commands. e.g. --hivevar A=B

 -i <filename>                    Initialization SQL file

 -S,--silent                      Silent mode in interactive shell

 -v,--verbose                     Verbose mode (echo executed SQL to the

                                  console)

1-6)、查看创建表的属性信息

hive> show create table result;

OK

CREATE TABLE `result`(

  `id` bigint,

  `name` string,

  `age` int)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT

  'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://hadoop1:9000/user/hive/warehouse/xiaoxu.db/result'

TBLPROPERTIES (

  'COLUMN_STATS_ACCURATE'='true',

  'numFiles'='1',

  'numRows'='10',

  'rawDataSize'='120',

  'totalSize'='130',

  'transient_lastDdlTime'='1489208970')

Time taken: 0.287 seconds, Fetched: 19 row(s)

 

在以上可以看出result表的结构信息,以及HDFS保存的路径信息。

 

1-7)、Hive  命令外部传参

A)、传一个参数实例

[root@hadoop1 testDate]# cat hiveVar.sql

set ouput.date=${hivevar:db.test};

select * from ${hiveconf:ouput.date}.person where age=${hivevar:age}

 

 

[root@hadoop1 testDate]# hive -i common.property  --hivevar age='23' -f hiveVar.sql

****************************

OK

Time taken: 9.516 seconds

OK

1 xiaozhang 23

Time taken: 1.028 seconds, Fetched: 1 row(s)

 

B)、带多个参数的实例

[root@hadoop1 testDate]# vi hiveVar.sql

set ouput.date=${hivevar:db.test};

select * from ${hiveconf:ouput.date}.person WHERE AGE BETWEEN ${hivevar:age} AND age<${hivevar:age1};

 

 

[root@hadoop1 testDate]# hive -i common.property  --hivevar age='10' --hivevar age1='25' -f hiveVar.sql

 

Time taken: 7.524 seconds

OK

1 xiaozhang 23

2 xiaowang 24

3 xiaoli 25

4 xiaoxiao 26

5 xiaoxiao 27

6 xiaolizi 39

7 xiaodaye 10

8 dageda 12

9 daye 24

10 dada 25

 

1-8)、Insert into 与Insert  overwrite的区别

A)、查看元数据

hive> select * from default.person;

OK

1 xiaozhang 23

2 xiaowang 24

3 xiaoli 25

4 xiaoxiao 26

5 xiaoxiao 27

6 xiaolizi 39

7 xiaodaye 10

8 dageda 12

9 daye 24

10 dada 25

 

B)、overwrite 插入数据

hive> insert overwrite table default.hiveVar select * from default.person;

 

hive> select * from default.hiveVar;

OK

1 xiaozhang 23

2 xiaowang 24

3 xiaoli 25

4 xiaoxiao 26

5 xiaoxiao 27

6 xiaolizi 39

7 xiaodaye 10

8 dageda 12

9 daye 24

10 dada 25

 

执行上面的insert多次后,查询数据还是保持原来的数据个数,可以看出overwrite是把重复的数据给覆盖掉了。

C)、into 插入数据

hive> insert into table default.hiveVar select * from default.person;

 

hive> select * from default.hiveVar;

OK

1 xiaozhang 23

2 xiaowang 24

3 xiaoli 25

4 xiaoxiao 26

5 xiaoxiao 27

6 xiaolizi 39

7 xiaodaye 10

8 dageda 12

9 daye 24

10 dada 25

1 xiaozhang 23

2 xiaowang 24

3 xiaoli 25

4 xiaoxiao 26

5 xiaoxiao 27

6 xiaolizi 39

7 xiaodaye 10

8 dageda 12

9 daye 24

10 dada 25

 

 

执行上面的 into 多次后,查询数据是原来的数据个数的插入的次数倍。可以看出into是在尾部追加数据之前的数据还再保留。

 

D)、两者的区别

insert overwrite 会覆盖已经存在的数据,我们假设要插入的数据和已经存在的N条数据一样,那么插入后只会保留一条数据;

 

insert into 只是简单的copy插入,不做重复性校验,如果插入前有N条数据和要插入的数据一样,那么插入后会有N+1条数据;

 

1-9)、在当前Hive模式下查看HDFS中文件的信息

hive> dfs -ls /;

Found 14 items

drwx-wx-wx   - root          supergroup          0 2016-11-05 02:58 /tmp

drwxr-xr-x   - root          supergroup          0 2016-11-05 03:14 /user

drwxr-xr-x   - root          supergroup          0 2016-11-06 08:02 /usr

hive> dfs -ls /user/;

Found 1 items

drwxr-xr-x   - root supergroup          0 2016-11-05 03:14 /user/hive

 

Hive 高级应用

1-1)、表分区

A)、准备数据

[root@hadoop1 testData]# vi t_part.bat

132213,dfggr

35,dfdfggfr

234,fhththtg

78,ddvbfbfb

90,dwwnrrg

213,qazxwe

B)、创建表分区

hive> create table t_part(id int,name string) partitioned by(country string) row format delimited  fields terminated by ',';

OK

Time taken: 0.191 seconds  

 

 

create  table if not exists result.table(

statistics_date         string,

statist_day             string

)

stored as parquetfile;

 

 

表示每行的字段以,分割 ,并制定了分区的名字为country,storedas 对数据进行了压缩。

C)、上传数据

hive> load data local inpath '/usr/local/hive/testData/t_part.bat' into table t_part  partition(country='CHA');

Loading data to table hivetest.t_part partition (country=CHA)

Partition hivetest.t_part{country=CHA} stats: [numFiles=1, numRows=0, totalSize=72, rawDataSize=0]

OK

Time taken: 1.479 seconds

 

hive> select * from t_part;

OK

132213 dfggr CHA

35 dfdfggfr CHA

234 fhththtg CHA

78 ddvbfbfb CHA

90 dwwnrrg CHA

213 qazxwe CHA

Time taken: 0.156 seconds, Fetched: 6 row(s)

   

hive> load data local inpath '/usr/local/hive/testData/t_part.bat' into table t_part  partition(country='USA');

Loading data to table hivetest.t_part partition (country=USA)

Partition hivetest.t_part{country=USA} stats: [numFiles=1, numRows=0, totalSize=72, rawDataSize=0]

OK

Time taken: 1.22 seconds

hive> select * from t_part;

OK

132213 dfggr CHA

35 dfdfggfr CHA

234 fhththtg CHA

78 ddvbfbfb CHA

90 dwwnrrg CHA

213 qazxwe CHA

 

132213 dfggr USA

35 dfdfggfr USA

234 fhththtg USA

78 ddvbfbfb USA

90 dwwnrrg USA

213 qazxwe USA

Time taken: 0.147 seconds, Fetched: 12 row(s)

 

每个分区中含有不同的数据。

D)、查看数据

hive> select COUNT(*) from t_part where country="CHA";

Query ID = root_20160924203813_13857cea-0688-4a3c-a6d8-9ba308fc3b16

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

2016-09-24 20:38:15,758 Stage-1 map = 100%,  reduce = 100%

Ended Job = job_local1484897114_0007

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 3632 HDFS Write: 288 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

6

Time taken: 1.85 seconds, Fetched: 1 row(s)

 

 

hive> select * from t_part where country="CHA";

OK

132213 dfggr CHA

35 dfdfggfr CHA

234 fhththtg CHA

78 ddvbfbfb CHA

90 dwwnrrg CHA

213 qazxwe CHA

Time taken: 0.413 seconds, Fetched: 6 row(s)

 

注意:分区在于隐藏于一个字段,便于查询数据,可以把较大的数据分为不同的小块,查询起来比较快。重要性比较大,在实际开发中是比较常用的。

E)、查看HDFS结果

[root@hadoop1  testData]# hadoop fs -ls /user/hive/warehouse/hivetest.db/t_part

Found 2 items

drwxr-xr-x   - root supergroup          0 2016-09-24 20:28 /user/hive/warehouse/hivetest.db/t_part/country=CHA

drwxr-xr-x   - root supergroup          0 2016-09-24 20:30 /user/hive/warehouse/hivetest.db/t_part/country=USA

 

 

[root@hadoop1 alertDate]# hadoop fs -ls  /user/hive/warehouse/test.db/t_part_1

Found 2 items

drwxr-xr-x   - root supergroup  0 2016-07-29 00:12 /user/hive/warehouse/test.db/t_part_1/country=CHA

drwxr-xr-x   - root supergroup  0 2016-07-29 00:13 /user/hive/warehouse/test.db/t_part_1/country=USA

 

[root@hadoop1 alertDate]# hadoop fs -ls  /user/hive/warehouse/test.db/t_part_1/country=CHA

Found 1 items

-rwxr-xr-x 3 root supergroup 105 2016-07-29 00:12 /user/hive/warehouse/test.db/t_part_1/country=CHA/test.text

 

[root@hadoop1 testData]# hadoop fs -cat  /user/hive/warehouse/hivetest.db/t_part/country=CHA/t_part.bat

132213,dfggr

35,dfdfggfr

234,fhththtg

78,ddvbfbfb

90,dwwnrrg

213,qazxwe

 

F)、查看分区

查看表t_part的分区

hive> show partitions  t_part;

OK

country=CHA

Time taken: 0.475 seconds, Fetched: 1 row(s)

1-2)、创建外部表

[root@hadoop1 testData]# hadoop fs -mkdir  /data/exp

       

hive> CREATE EXTERNAL TABLE IF NOT EXISTS td_ext_a(id string, name string,password string) row format delimited fields terminated by ',' location '/data/exp';

OK

Time taken: 7.838 seconds

 

[root@hadoop1  testData]# hadoop fs -put text.text  /data/ex

[root@hadoop1 testData]# hadoop fs -cat /data/exp/text.text

A,2015-01,5

A,2015-01,15

B,2015-01,5

A,2015-01,8

B,2015-01,25

A,2015-01,5

A,2015-02,4

A,2015-02,6

 

hive> select * from td_ext_a;

OK

A 2015-01 5

A 2015-01 15

B 2015-01 5

A 2015-01 8

B 2015-01 25

A 2015-01 5

A 2015-02 4

A 2015-02 6

Time taken: 0.987 seconds, Fetched: 8 row(s)

 

注意:location 指向的是HDFS的文件夹的路径,创建表后需要上传数据,然后再从hive中查询数据即可。

1-3)、数据的压缩

hive> create  table td_ext_b(id string, name string,password string) row format delimited fields terminated by ',' STORED AS textfile;

OK

Time taken: 0.612 seconds

 

hive> load data local inpath '/usr/local/hive/testData/text.text' into table td_ext_b;

Loading data to table hivetest.td_ext_b

Table hivetest.td_ext_b stats: [numFiles=1, totalSize=104]

OK

Time taken: 1.914 seconds

 

hive>  select * from td_ext_b;

OK

A 2015-01 5

A 2015-01 15

B 2015-01 5

A 2015-01 8

B 2015-01 25

A 2015-01 5

A 2015-02 4

A 2015-02 6

Time taken: 0.22 seconds, Fetched: 8 row(s)

 

使用了textfile数据压缩方式

1-4)、复制表

把td_ext_b表里的数据快速复制到td_ext_c表中

hive> create table  td_ext_c as  select  *  from  td_ext_b;

Query ID = root_20160925051042_6957d79c-306f-4c6f-8511-7fb854cb6caf

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Job running in-process (local Hadoop)

2016-09-25 05:10:46,127 Stage-1 map = 0%,  reduce = 0%

2016-09-25 05:10:47,147 Stage-1 map = 100%,  reduce = 0%

Ended Job = job_local616044424_0008

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/.hive-staging_hive_2016-09-25_05-10-42_436_3642217413855049913-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/td_ext_c

Table hivetest.td_ext_c stats: [numFiles=1, numRows=8, totalSize=104, rawDataSize=96]

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 2128 HDFS Write: 425 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Time taken: 6.796 seconds

 

可以看出及时复制一个表结构也要走mapreduce,是多么耗时间啊,会去执行数据的个数。

 

hive> select * from td_ext_c;

OK

A 2015-01 5

A 2015-01 15

B 2015-01 5

A 2015-01 8

B 2015-01 25

A 2015-01 5

A 2015-02 4

A 2015-02 6

Time taken: 0.139 seconds, Fetched: 8 row(s)

1-5)、创建分桶表

Hive采用对列值哈希来组织数据的方式, 称之为分桶, 适合采样和map-join.

A)、创建分桶表

hive> create table stu__buck(id string,name string,password string) clustered  by (id) into 4 buckets row format delimited  fields terminated by ',';

OK

Time taken: 0.332 seconds

 

clustered  by (id) into 4 buckets  分桶命令

B)、插入数据

使用overwrite 的方式插入数据

hive>  insert  overwrite  table stu__buck  select * from td_ext_a  cluster by (id);

Query ID = root_20160925051355_dc812af1-f359-4246-a1ad-507370b66540

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

2016-09-25 05:13:59,078 Stage-1 map = 0%,  reduce = 0%

2016-09-25 05:14:02,167 Stage-1 map = 100%,  reduce = 100%

Ended Job = job_local1833084919_0009

Loading data to table hivetest.stu__buck

Table hivetest.stu__buck stats: [numFiles=1, numRows=8, totalSize=104, rawDataSize=96]

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 4818 HDFS Write: 1028 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Time taken: 7.47 seconds

 

cluster by (id) 手动设置插入分桶表中,如果set hive.enforce.bucketing=true;则不需要设置cluster by ,例如:hive>  insert overwrite table stu__buck  select * from td_ext_a;

 

C)、查看数据

hive> select * from stu__buck  cluster by(id);

Query ID = root_20160925051436_323b6d35-176b-415c-89cc-fe3041b2f383

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

2016-09-25 05:14:38,270 Stage-1 map = 100%,  reduce = 100%

Ended Job = job_local1558510236_0010

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 5174 HDFS Write: 1206 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

A 2015-01 5

A 2015-01 15

A 2015-01 8

A 2015-01 5

A 2015-02 4

A 2015-02 6

B 2015-01 5

B 2015-01 25

Time taken: 1.55 seconds, Fetched: 8 row(s)

1-6)、创建表常用命令

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

   [(col_name data_type [COMMENT col_comment], ...)]

   [COMMENT table_comment]

   [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

   [CLUSTERED BY (col_name, col_name, ...)

   [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

   [ROW FORMAT row_format]

   [STORED AS file_format]

   [LOCATION hdfs_path]

 

说明:

A)、CREATE TABLE 

CREATE TABLE  创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXISTS 选项来忽略这个异常。

B)、EXTERNAL

EXTERNAL关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。

1-1)、LIKE 允许用户复制现有的表结构,但是不复制数据。

1-2)、ROW FORMAT

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

    用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,将会使用自带的 SerDe。在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的 SerDe,Hive通过 SerDe 确定表的具体的列的数据。

C)、STORED

STORED AS SEQUENCEFILE|TEXTFILE|RCFILE

如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCEFILE。

D)、CLUSTERED BY

CLUSTERED BY对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。Hive也是 针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

把表(或者分区)组织成桶(Bucket)有两个理由:

(1)获得更高的查询处理效率。桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。

(2)使取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。

1-7)、DDL 其他操作

 A)、增加/删除分区

 语法结构

ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...

partition_spec:

: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

 

ALTER TABLE table_name DROP partition_spec, partition_spec,...

ALTER TABLE table_name ADD partition(part=’a’) partition(part=’a’)

 B)、重新命名

ALTER TABLE table_name RENAME TO new_table_name

 

hive> alter table a rename to a_a;

OK

Time taken: 0.76 seconds

 C)、增加/更新列

   语法结构

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

 

注:ADD是代表新增一字段,字段位置在所有列后面(partition列前),REPLACE则是表示替换表中所有字段。

 

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

 

1-8)、DML 其他操作

A)Load 的使用

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO

TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

 

hive> load data local inpath '/home/alertDate/text.dat' into table b;

Loading data to table t_test.b

Table t_test.b stats: [numFiles=4, numRows=0, totalSize=143, rawDataSize=0]

OK

Time taken: 0.586 seconds   

 

hive> load  data  inpath  '/home/alertDate' into table t_test5;

hive> load data local inpath '/home/alertDate/test.dat' into table t_test5;

Loading data to table test.t_test5

Table test.t_test5 stats: [numFiles=1, totalSize=105]

OK

 Time taken: 0.905 seconds

 

注意local inpath 与inpath的区别

B)、INSERT 的使用

利用查询语句,将查询结果插入新的表

INSERT OVERWRITE [INTO] TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement

 

插入一条数据

INSERT INTO TABLE VALUES(XX,YY,ZZ);

C)、SELECT 的使用

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list [HAVING condition]]

[CLUSTER BY col_list 

| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]

]

[LIMIT number]

 

注:1、order by 会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。

2、sort by不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只保证每个reducer的输出有序,不保证全局有序。

3、distribute by(字段)根据指定的字段将数据分到不同的reducer,且分发算法是hash散列。

4、Cluster by(字段) 除了具有Distribute by的功能外,还会对该字段进行排序。

 

因此,如果分桶和sort字段是同一个时,此时,cluster by = distribute by + sort by

 

 

分桶表的作用:最大的作用是用来提高join操作的效率;

(思考这个问题:

select a.id,a.name,b.addr from a join b on a.id = b.id;

如果a表和b表已经是分桶表,而且分桶的字段是id字段

做这个join操作时,还需要全表做笛卡尔积吗?)

1-9)、查看函数命令

查看数据库

hive> show databases;

OK

default

hivetest

Time taken: 0.079 seconds, Fetched: 2 row(s)   

 

查看表

hive> show tables;

OK

hive

stu__buck

t_part

t_partition

t_test

td_ext_a

td_ext_b

td_ext_c

Time taken: 0.077 seconds, Fetched: 8 row(s)

 

查看分区表

hive> show partitions t_part;

OK

country=CHA

country=USA

Time taken: 0.219 seconds, Fetched: 2 row(s)

 

查看函数

hive> show functions;

OK

!

!=

%

&

*

+

-

/

<

<=

<=>

<>

=

==

>

>=

^

abs

acos

add_months

and

array

array_contains

ascii

asin

assert_true

atan

avg

base64

between

bin

case

cbrt

ceil

ceiling

coalesce

collect_list

collect_set

compute_stats

concat

concat_ws

context_ngrams

conv

corr

cos

count

covar_pop

covar_samp

create_union

cume_dist

current_database

current_date

current_timestamp

current_user

date_add

date_format

date_sub

datediff

day

dayofmonth

decode

degrees

dense_rank

div

e

elt

encode

ewah_bitmap

ewah_bitmap_and

ewah_bitmap_empty

ewah_bitmap_or

exp

explode

factorial

field

find_in_set

first_value

floor

format_number

from_unixtime

from_utc_timestamp

get_json_object

greatest

hash

hex

histogram_numeric

hour

if

in

in_file

index

initcap

inline

instr

isnotnull

isnull

java_method

json_tuple

lag

last_day

last_value

lcase

lead

least

length

levenshtein

like

ln

locate

log

log10

log2

lower

lpad

ltrim

map

map_keys

map_values

matchpath

max

min

minute

month

months_between

named_struct

negative

next_day

ngrams

noop

noopstreaming

noopwithmap

noopwithmapstreaming

not

ntile

nvl

or

parse_url

parse_url_tuple

percent_rank

percentile

percentile_approx

pi

pmod

posexplode

positive

pow

power

printf

radians

rand

rank

reflect

reflect2

regexp

regexp_extract

regexp_replace

repeat

reverse

rlike

round

row_number

rpad

rtrim

second

sentences

shiftleft

shiftright

shiftrightunsigned

sign

sin

size

sort_array

soundex

space

split

sqrt

stack

std

stddev

stddev_pop

stddev_samp

str_to_map

struct

substr

substring

sum

tan

to_date

to_unix_timestamp

to_utc_timestamp

translate

trim

trunc

ucase

unbase64

unhex

unix_timestamp

upper

var_pop

var_samp

variance

weekofyear

when

windowingtablefunction

xpath

xpath_boolean

xpath_double

xpath_float

xpath_int

xpath_long

xpath_number

xpath_short

xpath_string

year

|

~

Time taken: 0.033 seconds, Fetched: 216 row(s)

 

以上的函数建议多记忆一下,在以后的使用有较大的帮助。

 

 

查看一个表扩展相关的信息

hive> desc extended t_test;

OK

id                   int                                      

name                 string                                   

password             string                                   

   

Detailed Table Information Table(tableName:t_test, dbName:hivetest, owner:root, createTime:1474769518, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:password, type:string, comment:null)], location:hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/t_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=4, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1474772423, numRows=0, totalSize=182, rawDataSize=0}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)

Time taken: 0.181 seconds, Fetched: 5 row(s)

 

查看表的格式化的信息

hive> desc formatted  t_test;

OK

# col_name             data_type            comment             

   

id                   int                                      

name                 string                                   

password             string                                   

   

# Detailed Table Information    

Database:            hivetest              

Owner:               root                  

CreateTime:          Sat Sep 24 19:11:58 PDT 2016  

LastAccessTime:      UNKNOWN               

Protect Mode:        None                  

Retention:           0          

Location:            hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/t_test  

Table Type:          MANAGED_TABLE         

Table Parameters:    

COLUMN_STATS_ACCURATE true                

numFiles             4                   

numRows              0                   

rawDataSize          0                   

totalSize            182                 

transient_lastDdlTime 1474772423          

   

# Storage Information    

SerDe Library:       org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  

InputFormat:         org.apache.hadoop.mapred.TextInputFormat  

OutputFormat:        org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  

Compressed:          No                    

Num Buckets:         -1                    

Bucket Columns:      []                    

Sort Columns:        []                    

Storage Desc Params:    

field.delim          ,                   

serialization.format ,                   

Time taken: 0.27 seconds, Fetched: 34 row(s)

 

注意:show  partitions  table_name  需要表创建了分区,desc extended t_name;

以及desc formatted table_name;可以查看更详细表的信息

保存select查询结果的几种方式

1-1)、将查询结果保存到一张新的hive表中

hive> create table t_table  as  select * from td_ext_a;

Query ID = root_20160729132825_745c6f57-93da-4563-bed3-788fd781598e

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1469693904965_0008, Tracking URL = http://hadoop2:8088/proxy/application_1469693904965_0008/

Kill Command = /home/hadoop-2.6.4/bin/hadoop job  -kill job_1469693904965_0008

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2016-07-29 13:28:40,662 Stage-1 map = 0%,  reduce = 0%

 

使用AS可以快速的复制一张表出来。

 

hive> select * from t_table;

OK

A 2015-01 5

A 2015-01 15

B 2015-01 5

A 2015-01 8

B 2015-01 25

A 2015-01 5

A 2015-02 4

A 2015-02 6

Time taken: 0.093 seconds, Fetched: 8 row(s)

 

1-2)、将查询结果保存到一张已经存在的hive表中

hive> insert into table t_table select * from t_table;

Query ID = root_20160729133109_7e7e140c-13e5-470c-828e-e789b8e410f0

Total jobs = 3

Launching Job 1 out of 3n

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1469693904965_0009, Tracking URL = http://hadoop2:8088/proxy/application_1469693904965_0009/

Kill Command = /home/hadoop-2.6.4/bin/hadoop job  -kill job_1469693904965_0009

 

hive> select * from t_table;

OK

A 2015-01 5

A 2015-01 15

B 2015-01 5

A 2015-01 8

B 2015-01 25

A 2015-01 5

A 2015-02 4

A 2015-02 6

A 2015-01 5

A 2015-01 15

B 2015-01 5

A 2015-01 8

B 2015-01 25

A 2015-01 5

A 2015-02 4

A 2015-02 6

Time taken: 0.079 seconds, Fetched: 16 row(s)

 

 

做快速查询与快速插入数据时都是比较耗时的,走mapreduce

1-3)、将查询结果保存到指定的文件目录(可以是本地,也可以是hdfs)

hive> insert overwrite local directory '/data/test.text' select * from t_table;

Query ID = root_20160729133824_dff25a98-38af-4a0a-b93e-02aa42356a45

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1469693904965_0011, Tracking URL = http://hadoop2:8088/proxy/application_1469693904965_0011/

Kill Command = /home/hadoop-2.6.4/bin/hadoop job  -kill job_1469693904965_0011

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2016-07-29 13:38:39,662 Stage-1 map = 0%,  reduce = 0%

 

在语句中local 表示把数据放到本地,否则会放到HDFS上

 

[root@hadoop2 alertDate]# hadoop fs -ls /data/testFile

[root@hadoop2 alertDate]#

 

 

注意:将查询结果保存到hdfs的目录上也是要走mapreduce的,发现在这几个都要走mapreduce,看来需要注意一下操作了。

 

JOIN的使用

1-1)、数据的准备  

hive> select * from a;                         

OK                                             

1 2222 243434                                 

2 sdff sdfff                                  

3 sdff sdfff                                  

Time taken: 0.1 seconds, Fetched: 3 row(s)     

                                               

                                               

hive> select * from b;                         

OK                                             

3 sdff sdfff                                  

5 sdff sdfff                                  

1 sdff sdfff                                  

Time taken: 0.078 seconds, Fetched: 3 row(s)   

1-2)、数据的查询

Inner  的查询

hive> select * from a inner join b on a.id = b.id;                                                                                                                                                                                                   

Query ID = root_20160729141536_fe7d3f21-e31f-49b2-9401-6c2685151cd0                                                                                                                                                                                  

Total jobs = 1                                                                                                                                                                                                                                       

Execution log at: /tmp/root/root_20160729141536_fe7d3f21-e31f-49b2-9401-6c2685151cd0.log                                                                                                                                                             

2016-07-29 14:15:50 Starting to launch local task to process map join; maximum memory = 518979584                                                                                                                                                   

2016-07-29 14:15:53 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/root/6143ddd7-42f9-4958-bbdc-bd3a22a640b1/hive_2016-07-29_14-15-36_256_2258338164921734285-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable

2016-07-29 14:15:53 Uploaded 1 File to: file:/tmp/root/6143ddd7-42f9-4958-bbdc-bd3a22a640b1/hive_2016-07-29_14-15-36_256_2258338164921734285-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (350 bytes)                              

2016-07-29 14:15:53 End of local task; Time Taken: 2.802 sec.                                                                                                                                                                                        

Execution completed successfully                                                                                                                                                                                                                     

MapredLocal task succeeded                                                                                                                                                                                                                           

Launching Job 1 out of 1                                                                                                                                                                                                                             

OK                                                                                                                                                                                                                                                   

1 2222 243434 1 sdff sdfff                                                                                                                                                                                                                        

3 sdff sdfff 3 sdff sdfff                                                                                                                                                                                                                          

 

 

Left  join查询

 

hive> select * from a left join b on a.id = b.id;                                                                                                                                                                                                      

Query ID = root_20160729141725_8b017579-c074-4128-90ea-7d9a424b44b6                                                                                                                                                                                    

Total jobs = 1                                                                                                                                                                                                                                         

Execution log at: /tmp/root/root_20160729141725_8b017579-c074-4128-90ea-7d9a424b44b6.log                                                                                                                                                               

2016-07-29 14:17:37 Starting to launch local task to process map join; maximum memory = 518979584                                                                                                                                                     

2016-07-29 14:17:39 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/root/6143ddd7-42f9-4958-bbdc-bd3a22a640b1/hive_2016-07-29_14-17-25_723_6872206778159417490-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable  

2016-07-29 14:17:39 Uploaded 1 File to: file:/tmp/root/6143ddd7-42f9-4958-bbdc-bd3a22a640b1/hive_2016-07-29_14-17-25_723_6872206778159417490-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (350 bytes)                                

2016-07-29 14:17:39 End of local task; Time Taken: 1.933 sec.                                                                                                                                                                                          

OK                                                                                                                                                                                                                                                     

1 2222 243434 1 sdff sdfff                                                                                                                                                                                                                          

2 sdff sdfff  NULL NULL NULL                                                                                                                                                                                                                         

3 sdff sdfff 3 sdff sdfff

 

 

right join 查询

hive> select * from a right join b where a.id=b.id;

Query ID = root_20160729154358_38dd1b7a-b8ac-49b5-9a56-37f22636fc11

Total jobs = 1

Execution log at: /tmp/root/root_20160729154358_38dd1b7a-b8ac-49b5-9a56-37f22636fc11.log

2016-07-29 15:44:10 Starting to launch local task to process map join; maximum memory = 518979584

2016-07-29 15:44:13 Dump the side-table for tag: 0 with group count: 3 into file: file:/tmp/root/f4a99bd2-d473-48ac-bb84-4321ff29e9c5/hive_2016-07-29_15-43-58_345_9096767528169264921-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable

******    

Total MapReduce CPU Time Spent: 3 seconds 430 msec

OK

3 sdff sdfff 3 sdff sdfff

1 2222 243434 1 sdff sdfff

Time taken: 53.886 seconds, Fetched: 2 row(s)

 

Hive 常用函数的使用

1-1)、Split 的使用

 hive> select * from a;

OK

1 2222 243434

2 sdff sdfff

3 sdff sdfff

Time taken: 0.095 seconds, Fetched: 3 row(s)

 

 

hive> select split(name,2) as  name from a;

OK

["","","","",""]

["sdff"]

["sdff"]

Time taken: 0.099 seconds, Fetched: 3 row(s)

 

hive> select split(name,1) as  name from a;

OK

["2222"]

["sdff"]

["sdff"]

Time taken: 0.33 seconds, Fetched: 3 row(s)

 

hive> select split(name,0) as  name from a;

OK

["2222"]

["sdff"]

["sdff"]

Time taken: 0.13 seconds, Fetched: 3 row(s)

 

注意split的切的索引的问题,0和1是相同的结果

 

1-2)、Get_json_object的使用

hive> select * from a;

OK

1 2222 243434

2 sdff sdfff

3 sdff sdfff

6 {"name":"xiaozhang"} sdff

 

 

hive> select get_json_object(name,'$.name') as name from a;

OK

NULL

NULL

NULL

xiaozhang

Time taken: 0.366 seconds, Fetched: 4 row(s)

1-3)、更多函数的使用

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Hive 查找需要的配置信息

A)、查找带有关键字warehouse的属性信息

# hive -S -e "SET" | grep warehouse

log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.

hive.metastore.warehouse.dir=/apps/hive/warehouse

hive.warehouse.subdir.inherit.perms=true

 Hive.exec.parallel.thread.number

 

B)、查找带有关键字thread的属性信息

# hive -S -e "set" | grep "thread"

log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.

dfs.client.file-block-storage-locations.num-threads=10

dfs.datanode.block.id.layout.upgrade.threads=12

dfs.datanode.directoryscan.threads=1

dfs.datanode.fsdatasetcache.max.threads.per.volume=4

dfs.datanode.max.transfer.threads=16384

hive.compactor.worker.threads=0

hive.exec.parallel.thread.number=8

hive.metastore.fshandler.threads=15

hive.metastore.server.max.threads=100000

hive.metastore.server.min.threads=200

hive.mv.files.thread=15

hive.orc.compute.splits.num.threads=10

hive.server2.async.exec.threads=100

hive.server2.thrift.max.worker.threads=500

hive.server2.thrift.min.worker.threads=5

hive.spark.client.rpc.threads=8

hive.stats.gather.num.threads=10

hive.txn.heartbeat.threadpool.size=5

javax.jdo.option.Multithreaded=true

 

 

Hive 参数优化说明

对数据进行均衡分配

set hive.groupby.skewindata=true

 

开启任务并行执行(提高大表关联小表的速度)

 set  hive.auto.convert.join=true

73 jfzq !`Qq `11=]\  同一个sql允许并行任务的最大线程数(默认是8,可以根据集群的规模以及性能调整,调整后会占用map与reduce的资源)

set hive.exec.parallel.thread.number=8

 

设置开启动态分区开关

set hive.exec.dynamic.partition=true;

 

插入数据是设定立即压缩

SET hive.exec.compress.output=true;

 

当使用select 查询数据时候,显示的结果会带有表的字段名称

hive (default)> set  hive.cli.print.header=true;

 

更多的配置请查看:http://blog.csdn.net/xfg0218/article/details/61417668

 

 

或者使用命令查看优化的项:

hive> set;

_hive.hdfs.session.path=/tmp/hive/root/bd4f450f-2f3f-460d-8539-5ee573701e59

_hive.local.session.path=/tmp/root/bd4f450f-2f3f-460d-8539-5ee573701e59

_hive.tmp_table_space=/tmp/hive/root/bd4f450f-2f3f-460d-8539-5ee573701e59/_tmp_space.db

datanucleus.autoCreateSchema=true

datanucleus.autoStartMechanismMode=checked

datanucleus.cache.level2=false

datanucleus.cache.level2.type=none

datanucleus.connectionPoolingType=BONECP

datanucleus.fixedDatastore=false

datanucleus.identifierFactory=datanucleus1

datanucleus.plugin.pluginRegistryBundleCheck=LOG

datanucleus.rdbms.useLegacyNativeValueStrategy=true

datanucleus.storeManagerType=rdbms

datanucleus.transactionIsolation=read-committed

datanucleus.validateColumns=false

datanucleus.validateConstraints=false

datanucleus.validateTables=false

dfs.block.access.key.update.interval=600

dfs.block.access.token.enable=false

dfs.block.access.token.lifetime=600

 

***********************************************

 

具体的请查看:http://blog.csdn.net/xfg0218/article/details/61930858

 

Hive 性能优化

1-1)、有小表链接大表时

set hive.auto.convert.join=true;

 

这样设置,hive就会自动的识别比较小的表,继而用mapJoin来实现两个表的联合。

在本质上mapjoin根本就没有运行MR进程,仅仅是在内存就进行了两个表的联合。

    1. 、对分桶的表

前提是对表进行分桶处理

 

set hive.optimize.bucketmapjoin = true;

 

可见mapjoin和使用bucket进行mapjoin区别很大,当map端join小表仍然不能全部放进内存时,对两个表在join的key上都做hash bucket, 之后可以用bucket mapjoin, 因为bucket划分数据后,相同hash值的key都放在了同一个bucket中, 在join时, 小表将会被复制到所有节点,map加载小表的bucket到内存hashtable, 与大表对应的bucket文件进行join。减少了单台机器执行的时间。

 

Hive Shell 参数的使用

1-1)、Hive 的参数

  语法结构

hive [-hiveconf x=y]* [<-i filename>]* [<-f filename>|<-e query-string>] [-S]

说明:

  1. -i 从文件初始化HQL。
  2. -e从命令行执行指定的HQL
  3. -f 执行HQL脚本
  4. -v 输出执行的HQL语句到控制台
  5. -p <port> connect to Hive Server on port number
  6. -hiveconf x=y Use this to set hive/hadoop configuration variables.

 

 

[root@hadoop1 testDate]# hive  -i  common.property  -f  provinceNameByIdNo.sql

 

[root@hadoop1 testDate]# hive -e "show databases";

 

1-2)、添加以下的参数

set mapred.reduce.tasks = 80;

set hive.exec.reducers.bytes.per.reducer=5000000;

 

set  hive.execution.engine=mr;   --会直接显示MR的进度,建议不要开启会降低执行效率

set  hive.mapred.reduce.tasks.speculative.execution=false;

set  mapred.max.split.size=1933812999;

set  mapred.min.split.size.per.node=1933812999;

set  mapred.min.split.size.per.rack=1933812999;

set  hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set  hive.groupby.skewindata=true;

 

1933812999是根据HDFS中该表的文件的大小除以6再把最后的两位换成99即可

 

Hive列存储简介

主要开源的有https://parquet.apache.org/https://orc.apache.org/

 

创建表

#  create  table xiaoxu_test(id int,name string,age int);

# insert into xiaoxu_test values(1,"xiaozhang",1);

# insert into xiaoxu_test values(2,"xiaozhang",2);

 

1-1)、orc详解

A)、创建表并插入数据

# create  table orc_xiaoxu( id int,name string,age int ) STORED AS ORC TBLPROPERTIES ("orc.compress"="NONE");

# INSERT OVERWRITE TABLE orc_xiaoxu SELECT * FROM xiaoxu_test;

B)、查看数据

hive> select * from orc_xiaoxu;

OK

2 xiaozhang 2

1 xiaozhang 1

Time taken: 0.052 seconds, Fetched: 2 row(s)

C)、查看hadoop上的数据

# hadoop fs -cat  /apps/hive/warehouse/xiaoxu.db/orc_xiaoxu/000000_0 

 

1-2)、parquet详解

A)、创建表并插入数据

# hive

# user xiaoxu

# create table orc_xiaoxu( id int,name string,age int );

# insert into orc_xiaoxu values(1,"xiaowang",1);

# insert into orc_xiaoxu values(2"xiaoli",2);

B)、创建parquet格式的数据表

# CREATE TABLE parquet_xiaoxu(id int,name string,age int)STORED AS PARQUET;

# INSERT OVERWRITE TABLE parquet_xiaoxu SELECT * FROM orc_xiaoxu;

hive> select * from parquet_xiaoxu;

OK

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

2 xiaozhang 2

1 xiaowang 1

Time taken: 0.39 seconds, Fetched: 2 row(s)

C)、查看hadoop上的数据

# hadoop fs -cat  /apps/hive/warehouse/xiaoxu.db/parquet_xiaoxu/000000_0

 

 

Hive行转列

    1. 、需求

将一下数据变成

a       b       1,2,3

c       d       4,5,6

 

变为:

a       b       1

a       b       2

a       b       3

c       d       4

c       d       5

c       d       6

    1. 、实现思路

A)、创建表

drop table test_jzl_20140701_test;

create table test_jzl_20140701_test(

col1 string,

col2 string,

col3 string

)

row format delimited fields terminated by ' '

stored as textfile;

B)、加载数据

load data local inpath '/home/jiangzl/shell/test.txt' into table test_jzl_20140701_test;

select * from test_jzl_20140701_test

C)、explode 实现行转列

select col1,col2,name

from test_jzl_20140701_test  

lateral view explode(split(col3,',')) col3 as name;

 

效果如下

a       b       1

a       b       2

a       b       3

c       d       4

c       d       5

c       d       6

 

Hive  Web图形界面安装

1-1)、下载hive的源码

http://mirrors.hust.edu.cn/apache/hive/

 

1-2)、修改配置文件

[root@hadoop1 conf]# vi hive-site.xml

<property>

<name>hive.hwi.listen.host</name>

<value>192.168.132.140</value>

</property>

<property>

<name>hive.hwi.listen.port</name>

<value>9999</value>

</property>

<property>

<name>hive.hwi.war.file</name>

<value>/lib/hive-hwi-1.2.1.war</value>

</property>

 

1-3)、添加JAR

[root@hadoop1 conf]#  ln -s $JAVA_HOME/lib/tools.jar $HIVE_HOME/lib/

 

1-4)、启动命令

[root@hadoop1 conf]#  hive  --service hwi 2> /tmp/hwi2.log &

 

1-5)、查看web界面

http://hadoop1:9999/hwi/

 

 

Hive与Tez结合

1-1)、安装编译Tez需要的软件

请查找与Hive对应的版本,本实例的Hive的1.2所以用Tez0.5.4的版本,由于Tez官方提供的是源码。所以还需要使用protobuf以及mvn进行编译,如果想自己编译但是网速不好,请下载.m2的JAR包:或者直接使用作者已经编译好的Tez0.5.4的包,所有的资料请下载:

链接:http://pan.baidu.com/s/1c27ri6W 密码:6ya4

A)、安装protobuf xz 

[root@hadoop1 opt]# tar -zxvf protobuf-2.5.0.tar.gz

[root@hadoop1 protobuf-2.5.0]# cd protobuf-2.5.0

[root@hadoop1 protobuf-2.5.0]# ./configure

[root@hadoop1 protobuf-2.5.0]# make && make install

 

B)、安装Tez

[root@hadoop1 opt]# tar -zxvf apache-tez-0.5.4-src.tar.gz

[root@hadoop1 opt]# mv apache-tez-0.5.4-src tez-0.5.4

[root@hadoop1 opt]# cd tez-0.5.4/

进入目录修改pom.xml中的hadoop.version 为2.6.4版本

[root@hadoop1 tez-0.5.4]# vi pom.xml

 

C)、安装mvn

[root@hadoop1 opt]# chmod a+x apache-maven-3.1.0-bin.zip

[root@hadoop1 opt]# unzip apache-maven-3.1.0-bin.zip

[root@hadoop1 opt]# mv apache-maven-3.1.0  maven-3.1.0

添加环境变量:

[root@hadoop1 maven-3.1.0]# vi /etc/profile

查看版本:

[root@hadoop1 conf]# mvn -version

Apache Maven 3.1.0 (893ca28a1da9d5f51ac03827af98bb730128f9f2; 2013-06-27 19:15:32-0700)

Maven home: /opt/maven-3.1.0

Java version: 1.7.0_76, vendor: Oracle Corporation

Java home: /opt/jdk1.7/jre

Default locale: en_US, platform encoding: UTF-8

OS name: "linux", version: "2.6.32-642.15.1.el6.x86_64", arch: "amd64", family: "unix"

1-2)、对Tez进行打包

进入下载源码的tez的目录下执行以下的命令进行编译

[root@hadoop1 tez-0.5.4]# mvn clean package -Dtar -DskipTests=true -Dmaven.javadoc.skip=true

 

1-3)、把JAR包上传到HDFS中

打包成功后进入tez-0.5.4/tez-dist/target目录,将tez-0.5.4.tar.gz上传至hdfs中的tez的目录下

 

[root@hadoop1 target]# hadoop  fs  -put  tez-0.5.4.tar.gz   /tez/

 

1-4)、修改Hive下的hive-site.xml

在hive-site.xml配置文件中新增以下内容:

<property>

     <name>tez.lib.uris</name>

     <value>hdfs://hadoop1:9000/tez/tez-0.5.4.tar.gz</value>

</property>

<property>

     <name>hive.execution.engine</name>

     <value>tez</value>

</property>

 

1-5)、解压出编译后的文件

[root@hadoop1 opt]# mkdir tez-0.5.4-compiling

[root@hadoop1 tez-0.5.4-compiling]# tar -zxvf tez-0.5.4.tar.gz

 

1-6)、在hive-env.sh中添加如下配置

在hive-env.sh配置文件中新增以下内容:

export TEZ_HOME=/opt/tez-0.5.4-compiling/

for jar in `ls ${TEZ_HOME}|grep jar`;do

   export HADOOP_CLASSPATH=${TEZ_HOME}/$jar:${HADOOP_CLASSPATH}

done

for jar in `ls ${TEZ_HOME}/lib/|grep jar`;do

   export HADOOP_CLASSPATH=${TEZ_HOME}/lib/$jar:${HADOOP_CLASSPATH}

Done

 

1-7)、配置相关的JAR

tez-0.5.4中需要用到tez的相关jar包,所以需要将tez-0.5.4.tar.gz中的所有jar包拷贝到tez-0.5.4的lib目录下

 

1-8)、tez-0.5.4的bin/hive-config.sh中加入lzo

JAR的下载请下载以上的资料中。

 

1-9)、在tez-0.5.4的bin/hive-config.sh中加入以下配置

Export HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/common/lib/hadoop-lzo-0.4.3jar

 

或者在${HIVE_HOME}路径下建auxlib文件夹,将hadoop-lzo-0.4.20.jar拷贝至该目录中。该lzo包必须和hadoop的保持一致。

 

1-10)、查看效果

 

Beeline 链接Hive

1-1)、修改hive-site.xml

[root@hadoop1 conf]# cat hive-site.xml

在之前的基础上添加一下信息:

root 是运行的用户名,加入在hadoop用户下运行即可配置成

Hadoop.proxyuser.hadoop.hosts

Hadoop.proxyuser.hadoop.groups

 

<property>

<name>hadoop.proxyuser.root.hosts</name>

<value>*</value>

<description>xiugai root jiaose</description>

</property>

<property>

<name>hadoop.proxyuser.root.groups</name>

<value>*</value>

</property>

 

<property>

<name>hive.server2.transport.mode</name>

<value>binary</value>

<description>

      Expects one of [binary, http].

      Transport mode of HiveServer2.

</description>

</property>

 

<property>

    <name>hive.hwi.listen.host</name>

    <value>0.0.0.0</value>

    <description>This is the host address the Hive Web Interface will listen on</description>

  </property>

 

 <property>

    <name>hive.server2.webui.host</name>

    <value>0.0.0.0</value>

    <description>The host address the HiveServer2 WebUI will listen on</description>

  </property>

1-2)、启动hiveserver2  

启动hiveserver服务,可以得知默认hiveserver运行在端口10000,最小100工作线程,最大2147483647工作线程。

 

[root@hadoop1 root]# hive --service hiveserver2 &

[1] 6332

[root@hadoop1 root]# SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

17/03/16 07:00:06 WARN conf.HiveConf: HiveConf of name hive.server2.webui.host does not exist

 

到这里会卡住,不会再动了,表示已经启动成功,如果想后台执行可以关闭该SHELL窗口

1-3)、查看运行的端口

[root@hadoop1 start_sh]# netstat -nltp|grep 10000

tcp        0      0 0.0.0.0:10000               0.0.0.0:*                   LISTEN      6332/java

 

有进程表示启动已经成。

1-4)、使用beeline嵌入链接客户端

[root@hadoop1 bin]# beeline

Beeline version 1.6.2 by Apache Hive

beeline> !connect jdbc:hive2://hadoop1:10000/default

Connecting to jdbc:hive2://hadoop1:10000/default

Enter username for jdbc:hive2://hadoop1:10000/default: root

Enter password for jdbc:hive2://hadoop1:10000/default: ******

log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Connected to: Apache Hive (version 1.2.1)

Driver: Spark Project Core (version 1.6.2)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://hadoop1:10000/default> show databases;

+----------------+--+

| database_name  |

+----------------+--+

| allinfo        |

| default        |

+----------------+--+

3 rows selected (0.47 seconds)

1-5)、使用beeline远程链接客户端

[root@hadoop1 bin]# beeline  -u  jdbc:hive2://hadoop1:10000   -n root   -p 123456  -e “show databases”

DBeaver 4.0  链接Hive 工具

官网 : https://dbeaver.jkiss.org/download/ 

Dbeaver 可以连大部分的数据库,现在只展示链接Hive。

1-1)、软件下载

链接:http://pan.baidu.com/s/1geQ6KeB 密码:w5ld 如果无法下载请联系作者。

1-2)、链接Hive

请选择Apache  Hive进行链接

 

 

 

填写链接信息,前提是hiveserver2服务已经启动,可通过beelin测试一下

 

选择驱动

 

 

 

JAR包已放在了安装目录下的lib文件夹下的lib目录,请自行添加。

 

1-3)、查看Hive信息

 

 

 

Hive 常见错误

1-1)、字符问题

hive> create table userName(id int,name string);

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.)

 

解决方法:

在mysql中执行以下操作

 

set global binlog_format='MIXED';
alter database hive character set latin1;

 

更换hive下的lib下的mysql驱动,更改为5.1的即可

 

1-2)、找不到spark的JAR错误

    其主要的原因是:在hive的文件中,发现了这样的命令,原来初始当spark存在的时候,进行spark中相关的JAR包的加载。而自从spark升级到2.0.0之后,原有的lib的整个大JAR包已经被分散的小JAR包的替代,所以肯定没有办法找到这个spark-assembly的JAR包。这就是问题所在。

 

[root@hadoop1 bin]# hive

ls: cannot access /opt/spark2.0/lib/spark-assembly-*.jar: No such file or directory

**********

 

[root@hadoop1 bin]# vi hive

# add Spark assembly jar to the classpath

if [[ -n "$SPARK_HOME" ]]

then

  sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`

  CLASSPATH="${CLASSPATH}:${sparkAssemblyPath}"

fi

 

 

把sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`修改成sparkAssemblyPath=`ls ${SPARK_HOME}/jars/*.jar`

 

1-3)、查看表结构时出现中文乱码

hive表在show create table 的时候字段的comment中文乱码,在desc的时候中文正常

 

解决方法:https://www.iteblog.com/archives/1687.html

官网说明:https://issues.apache.org/jira/browse/HIVE-11837

 

Hive数据导出与导入方案

1-1)、Hive 导入到Hbase

详细的过程请查看 《Hbase导数据的几种方式》总结中的资料

 

1-2)、Hive 导入到Mysql

A)、使用sqoop导数据

1-1)、本分导入

sqoop export --connect  jdbc:mysql://hadoop1:3306/dbname  --username root  --password 123456 --table sp_address_code  --columns "ID,PLACE_TYPE,PLACE_CODE,PLACE_NAME,UP_PLACE_CODE"  --hcatalog-database xiaoxu --hcatalog-table xiaoxu

1-2)、全部导入

sqoop export --connect  jdbc:mysql://hadoop1:3306/xuaoxu  --username root --password 123456 --table sp_address_code  --hcatalog-database xiaoxu  --hcatalog-table xiaoxu

1-3)、Hive 导入到Redis

执行下面之前必须开启beeline服务,端口是10000。Mysql的方案类似此方案

  1. 、编写Hive读取数据语句

[root@hadoop1 testData]# vi hive-redis.sh

source /etc/profile

 

hadoop fs -rmr /hive-redis

 

beeline -u jdbc:hive2://hadoop1:10000  -n root  -p 123456 -e "

insert overwrite directory '/hive-redis'  SELECT CONCAT_ws('\t',redis_cmd,redis_key,hkey,hval)

FROM (

select

'HSET' as redis_cmd,

'sp_address' AS redis_key,

CONCAT_ws('_',PLACE_CODE,PLACE_TYPE) as hkey,

CONCAT_ws(',',PLACE_CODE,PLACE_NAME,PLACE_TYPE,UP_PLACE_CODE) as hval

from xiaoxu.sp_address limit 1234

) as sp_address;"

 

hadoop fs -cat  /hive-redis/*  |redis-cli

 

运行并把日志输出:

[root@hadoop1 testData]# sh hive-redis.sh  >>  /opt/testData/hive-redis.log  2>&1  &

B)、查看导出HDFS中的结果

[root@hadoop1 testData]# hadoop fs -cat   /hive-redis/000000_0|head

HSET sp_address 320205_03 320205,锡山区,03,320200

HSET sp_address 320204_03 320204,北塘区,03,320200

HSET sp_address 320203_03 320203,南长区,03,320200

HSET sp_address 320202_03 320202,崇安区,03,320200

HSET sp_address 320201_03 320201,市辖区,03,320200

HSET sp_address 320125_03 320125,高淳县,03,320100

HSET sp_address 320124_03 320124,溧水县,03,320100

HSET sp_address 320116_03 320116,六合区,03,320100

HSET sp_address 320115_03 320115,江宁区,03,320100

HSET sp_address 320114_03 320114,雨花台区,03,320100

cat: Unable to write to output stream.

C)、查看redis结果

127.0.0.1:6379> KEYS *

  1. "sp_address"

 

 

查看全部的数据

127.0.0.1:6379> HGETALL "sp_address"

2464) "320700,\xe8\xbf\x9e\xe4\xba\x91\xe6\xb8\xaf\xe5\xb8\x82,02,320000"

2465) "340100_02"

2466) "340100,\xe5\x90\x88\xe8\x82\xa5\xe5\xb8\x82,02,340000"

2467) "331100_02"

2468) "331100,\xe4\xb8\xbd\xe6\xb0\xb4\xe5\xb8\x82,02,330000"

 

 

查看部分数据

127.0.0.1:6379> HGET "sp_address" "340100_02"

"340100,\xe5\x90\x88\xe8\x82\xa5\xe5\xb8\x82,02,340000"

 

D)、查看输出的Log日志

详细的运行过程请查看:http://blog.csdn.net/xfg0218/article/details/70493195

 

1-4)、Hive导入到ES

A)、下载相对应的JAR

链接:http://pan.baidu.com/s/1bp9vU2N 密码:1nbz

 

B)、创建hive-es外部表

hive>

CREATE EXTERNAL TABLE sp_address_es (

    ID      STRING,

    PLACE_TYPE STRING,

PLACE_NAME STRING,

UP_PLACE_CODE STRING

)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'sp_address/sp_address', 'es.nodes'='hadoop1','es.port'='9003','es.mapping.id' = 'ID');

 

C)、插入数据

hive> insert  into  table  sp_address_es  select ID,PLACE_TYPE,PLACE_NAME,UP_PLACE_CODE  from  xiaoxu.es_sp_address;

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

盒马coding

你的支持是我最大的动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值