Hive 总结
概述
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能。
主要用途:用来做离线数据分析,比直接用mapreduce开发效率更高,里哟摩纳哥HDFS作为储存系统,利用mapreduce作为运算的一个工具。
Hive使用内存数据库derby占用内存小,但是数据存在内存不稳定。
Hive 2.0 是从hive 1.2.x 直接跳转过来的,hive2.0之后并不一定支持mapReduce
外部表与内部表的区别
1、先来说下Hive中内部表与外部表的区别:?
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据。
需要注意的是传统数据库对表数据验证是 schema on write(写时模式),而 Hive 在load时是不检查数据是否符合schema的,hive 遵循的是 schema on read(读时模式),只有在读的时候hive才检查、解析具体的数据字段、schema。读时模式的优势是load data 非常迅速,因为它不需要读取数据进行解析,仅仅进行文件的复制或者移动。写时模式的优势是提升了查询性能,因为预先解析之后可以对列建立索引,并压缩,但这样也会花费要多的加载时间。?
2、内部表直接使用create语句创建表结构,外部表需要使用load去加载外部的文件,都是保存在HDFS里面
特性
1-1)、 可扩展
Hive可以自由的扩展集群的规模,一般情况下不需要重启服务。
1-2)、延展性
Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数。
1-3)、容错
良好的容错性,节点出现问题SQL仍可完成执行。
安装Hive
1-1)、安装
[root@hadoop1 local]# tar -zxvf apache-hive-1.2.1-bin.tar.gz
[root@hadoop1 local]# mv apache-hive-1.2.1-bin hive
1-2)、修改快捷方式
[root@hadoop1 hive]# vi /etc/profile
加入以下配置
export HIVE_HOME=/usr/local/hive
[root@hadoop1 hive]# source /etc/profile
1-3)、解决JAR冲突的问题
hive的lib目录中jline.2.12.jar复制到/usr/local/hadoop-2.6.4/share/hadoop/yarn/lib中
1-4)、本地储存元数据
讲hadoop下的etc/hadoop/下一下文件复制到hive的conf下
[root@hadoop1 conf]# cp core-site.xml hdfs-site.xml /opt/hive-1.2/conf/
启动查看hive查看是否报错
[root@hadoop1 bin]# ./hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 2.263 seconds, Fetched: 1 row(s)
hive> create database hive;
OK
Time taken: 0.408 seconds
hive> use hive;
OK
Time taken: 0.025 seconds
hive> show tables;
OK
Time taken: 0.081 seconds
hive> create table test_hive(id int,name string);
OK
Time taken: 0.618 seconds
hive> create table test_hive(id int,name string);
OK
Time taken: 0.618 seconds
hive> insert into test_hive values(1,"xiaozhang");
Query ID = root_20160924004917_77b8b458-1552-4855-8485-853c649291e2
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-09-24 00:49:24,162 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local927918446_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hive.db/test_hive/.hive-staging_hive_2016-09-24_00-49-17_955_4462960730369907542-1/-ext-10000
Loading data to table hive.test_hive
Table hive.test_hive stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 12 HDFS Write: 94 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 7.403 seconds
hive> select * from test_hive;
OK
1 xiaozhang
Time taken: 0.131 seconds, Fetched: 1 row(s)
查看hive数据库下的表的信息
[root@hadoop1 bin]# Hadoop fs -ls /user/hive/warehouse/hive.db/
1-5)、MySql储存元数据
1-1)、修改hive-site.xml
在/home/hive/conf中新建vi hive-site.xml 并加入以下内容:
[root@hadoop1 conf]# vi hive-site.xml
<configuration>
# 配置链接URL
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&useSSL=true
</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
# 配置链接驱动
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
# 链接驱动的用户名字
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
# 链接驱动的密码
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>
1-2)、修改Hive-env.sh
[root@hadoop1 conf]# mv hive-env.sh.template hive-env.sh
[root@hadoop1 conf]# cat hive-env.sh
加入一下配置:
export HADOOP_HEAPSIZE=2048
HADOOP_HOME=/usr/local/hadoop-2.6.4
export HIVE_CONF_DIR=/opt/hive-1.2/conf
export JAVA_HOME=/opt/jdk1.7
[root@hadoop1 bin]# ./hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 2.71 seconds, Fetched: 1 row(s)
hive> create database hiveTest;
OK
Time taken: 0.249 seconds
hive> use hiveTest;
OK
Time taken: 0.073 seconds
hive> show tables;
OK
Time taken: 0.081 seconds
hive> create table hive(id int,name string);
OK
Time taken: 0.469 seconds
hive> insert into hive values(1,"xiaozhang");
Query ID = root_20160924184519_6bf172ec-bf83-43e2-a722-802b7d1b791f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-09-24 18:45:25,965 Stage-1 map = 0%, reduce = 0%
2016-09-24 18:45:26,988 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local1476697233_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/hive/.hive-staging_hive_2016-09-24_18-45-19_555_3195082767256194605-1/-ext-10000
Loading data to table hivetest.hive
Table hivetest.hive stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 12 HDFS Write: 93 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 8.335 seconds
hive> select * from hive;
OK
1 xiaozhang
Time taken: 0.219 seconds, Fetched: 1 row(s)
E)、在MySql 上查看元数据的信息
可以看出hive保存元数据时,表的名字与字段会分开的,数据保存在HDFS里面。默认的路径为hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/hive,其中TMS会保存表的字段的信息,DBS表会保存在HDFS上的元数据
请把mysql-connector-java-5.0.3-bin.jar放在hive安装的目录下的lib目录下,下载地址请请求:http://download.csdn.net/download/dly1911618/1035034
Hive的初级应用
1-1)、插入普通的数据
hive> create table t_test(id int ,name string, password string) row format delimited fields terminated by ',';
OK
Time taken: 0.196 seconds
hive> insert into t_test(id,name,password) values('1','323','sfdf');
Query ID = root_20160924191209_07686e5a-177e-4194-88a5-58536f15698a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-09-24 19:12:13,483 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local956434617_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/t_test/.hive-staging_hive_2016-09-24_19-12-09_727_5241291650450368428-1/-ext-10000
Loading data to table hivetest.t_test
Table hivetest.t_test stats: [numFiles=1, numRows=1, totalSize=11, rawDataSize=10]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 104 HDFS Write: 186 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 4.957 seconds
hive> select * from t_test;
OK
1 323 sfdf
Time taken: 0.14 seconds, Fetched: 1 row(s)
[root@hadoop1 bin]# hadoop fs -cat /user/hive/warehouse/hivetest.db/t_test/000000_0
1,323,sfdf
注意:在默认的情况下会把表,数据库看做一个文件夹处理,只有数据才会看做文件,默认的事000000——0如果有多个文件则会一次累加。
查看保存本地的数据库:
1-2)、 SQL执行方式
# hive -help
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
usage: hive
-d,--define <key=value> Variable subsitution to apply to hive
commands. e.g. -d A=B or --define A=B
--database <databasename> Specify the database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)
B)、hive -e 命令的使用
[root@hadoop3 hive]# hive -e "show databases"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
OK
default
hivetest
Time taken: 2.623 seconds, Fetched: 2 row(s)
[root@hadoop3 hive]# hive -S -e "show databases"
在静音的模式下不会显示mapreduce的操作过程
# vi show-databases.sql
show databases;
# hive -S -f show-databases.sql
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
Default
E)、env命名空间可作为向Hive传递变量的一个可选的方式
# export TRAINSVAR=trains_var
# hive -S -e "select * from ${TRAINSVAR} limit 10"
****************************
1-3)、外部表数据(使用hdfs上传数据)
准备数据:
[root@hadoop1 /]# vi /test.bat
1,123,ffgg
2,sfgf,eferg
3,dvf,dvrfg
hive> desc t_test;
OK
id int
name string
password string
Time taken: 0.21 seconds, Fetched: 3 row(s)
[root@hadoop1 /]# hadoop fs -put test.bat /user/hive/warehouse/hivetest.db/t_test
hive> select * from t_test;
OK
1 323 sfdf
1 123 ffgg
2 sfgf eferg
3 dvf dvrfg
Time taken: 0.125 seconds, Fetched: 4 row(s)
[root@hadoop1 /]# hadoop fs -cat /user/hive/warehouse/hivetest.db/t_test/test.bat
1,123,ffgg
2,sfgf,eferg
3,dvf,dvrfg
注意:要上传的数据后缀必须是任意格式的,如果字段的类型与数据库的类型不一致,则会当做NULL来处理,以及多的字段会自动去除,不足的字段也会用NULL来补齐
1-4)、内部表数据(使用hive 的命令导入与导出数据)
hive> desc t_test;
OK
id int
name string
password string
Time taken: 0.21 seconds, Fetched: 3 row(s)
[root@hadoop1 /]# vi t_test.bat
4,dsfef,fgrg
8,drfg,fgrg
9,dgr,rgr
10,dgfrg,rgr
11,dgr,rgrgr
12,dsfe,sfe
1-1)、上传之前
[root@hadoop1 testData]# ls
test.bat t_test.bat
hive> load data local inpath '/usr/local/hive/testData/t_test.bat' into table t_test;
Loading data to table hivetest.t_test
Table hivetest.t_test stats: [numFiles=3, numRows=0, totalSize=120, rawDataSize=0]
OK
Time taken: 1.059 seconds
1-2)、上传之后
[root@hadoop3 testData]# ls
test.bat t_test.bat
1-3)、HDFS上传
[root@hadoop3 testData]# ls
test.bat t_test.bat t_hadoop.bat
[root@hadoop3 testData]# hadoop fs -put t_hadoop.bat /user
HDFS上传到表:
hive> load data inpath '/user/t_hadoop.bat' into table t_test;
Loading data to table hivetest.t_test
Table hivetest.t_test stats: [numFiles=4, numRows=0, totalSize=182, rawDataSize=0]
OK
Time taken: 0.728 seconds
查看HDFS上的文件:
[root@hadoop3 testData]# hadoop fs -ls /user
Found 2 items
drwxr-xr-x - root supergroup 0 2016-09-24 00:46 /user/hive
drwxr-xr-x - root supergroup 0 2016-09-02 05:25 /user/root
注意: 在使用本地上传数据时,会先把数据临时复制一份到HDFS上,目录是有上传的目录一致,等到上传完再把数据移动到指定的目录。在HDFS中则是直接移动文件。其中本地上传需要在inpath前加local关键字
[root@hadoop1 testData]# ls
[root@hadoop1 testData]#
hive> insert overwrite local directory '/usr/local/hive/testData/t_test.text' row format delimited fields terminated by '\t' select * from t_test;
Query ID = root_20160924201517_fda8eced-7f74-4aa1-9b1f-cc6fdc153064
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-09-24 20:15:19,324 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local296856771_0005
Copying data to local directory /usr/local/hive/testData/t_test.text
Copying data to local directory /usr/local/hive/testData/t_test.text
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 1274 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.692 seconds
[root@hadoop1 testData]# cd t_test.text/
[root@hadoop1 t_test.text]# ls
000000_0
[root@hadoop3 t_test.text]# cat 000000_0
1323sfdf
345dffefer
78efefeefefeg
1323fegfeefegeg
3545dfdgdgfg
4dsfeffgrg
8drfgfgrg
9dgrrgr
10dgfrgrgr
11dgrrgrgr
12dsfesfe
1123ffgg
2sfgfeferg
3dvfdvrfg
[root@hadoop1 testData]# ls
[root@hadoop1 testData]#
hive> insert overwrite directory '/usr/local/hive/testData/t_test.text' row format delimited fields terminated by '\t' select * from t_test;
[root@hadoop1 testData]# insert table t_test_table select * from t_test;
1-5)、HIVE 加载SQL文件的形式执行脚本
[root@hadoop1 testDate]# vi common.property
set hivevar:db.default=default;
set hivevar:db.allinfo=allinfo;
set hivevar:db.xiaoxu=xiaoxu;
set hivevar:hdfs.url=hdfs://hadoop1:9000;
set mapred.job.queue.name=pms;
set hive.exec.reducers.max=20;
set mapred.reduce.tasks=200;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=20;
set mapred.job.name=extract_trfc_page_kpi;
Hivevar: 是在内部文件中传参的设置
[root@hadoop1 testDate]# vi textTable .sql
-- 设置MR的task的个数
set mapred.reduce.tasks = 50;
-- 设置HDFS的路径
set hdfs.url=${hivevar:hdfs.url};
-- 设置获取配置文件的初始化数据库的信息
set source.db.name.default=${hivevar:db.default};
set source.db.name.allinfo=${hivevar:db.allinfo};
set out.db.name=${hivevar:db.xiaoxu};
-- 创建数据库
CREATE DATABASE IF NOT EXISTS ${hiveconf:source.db.name.allinfo};
CREATE DATABASE IF NOT EXISTS ${hiveconf:out.db.name};
-- 创建表结构
CREATE TABLE IF NOT EXISTS ${hiveconf:out.db.name}.textTable(
id bigint,
name string,
age int
)
-- 执行逻辑
CREATE TABLE IF NOT EXISTS ${hiveconf:out.db.name}.textTable AS
SELECT ps.id,ps.name,ps.age FROM ${hiveconf:source.db.name.default}.person ps;
-- 使用内存保存数据
WITH memoryData1 AS ( SELECT ps.id,ps.name,ps.age FROM ${hiveconf:out.db.name}.textTable ps WHERE age='23'),
memoryData2 AS ( SELECT ps.id,ps.name,ps.age FROM ${hiveconf:out.db.name}.textTable ps WHERE age='24' )
SELECT * FROM memoryData1 UNION ALL SELECT * FROM memoryData2;
Hiveconf: 是接受内部参数的,与hivevar相配合使用,其中WITH关键字是使用是吧加载的数据放到内存中,建议加载到内存中的数据不要过大。
[root@hadoop1 testDate]# hive -i common.property -f textTable .sql
Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
OK
Time taken: 1.642 seconds
OK
Time taken: 0.045 seconds
OK
Time taken: 0.247 seconds
Query ID = root_20170310190113_12f2d1a9-6f30-4991-b09a-11f2cb82b043
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2017-03-10 19:01:19,218 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local546671993_0001
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 130 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1 xiaozhang 23
2 xiaowang 24
9 daye 24
Time taken: 6.281 seconds, Fetched: 3 row(s)
以上可以看出执行的过程,map阶段以及reduce阶段的执行情况,其中hive -i是加载配置文件-f是加载SQL文件,详细的请使用hive -help命令查看
[root@hadoop1 testDate]# hive -help
usage: hive
-d,--define <key=value> Variable subsitution to apply to hive
commands. e.g. -d A=B or --define A=B
--database <databasename> Specify the database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)
1-6)、查看创建表的属性信息
hive> show create table result;
OK
CREATE TABLE `result`(
`id` bigint,
`name` string,
`age` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hadoop1:9000/user/hive/warehouse/xiaoxu.db/result'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'numRows'='10',
'rawDataSize'='120',
'totalSize'='130',
'transient_lastDdlTime'='1489208970')
Time taken: 0.287 seconds, Fetched: 19 row(s)
在以上可以看出result表的结构信息,以及HDFS保存的路径信息。
1-7)、Hive 命令外部传参
[root@hadoop1 testDate]# cat hiveVar.sql
set ouput.date=${hivevar:db.test};
select * from ${hiveconf:ouput.date}.person where age=${hivevar:age}
[root@hadoop1 testDate]# hive -i common.property --hivevar age='23' -f hiveVar.sql
****************************
OK
Time taken: 9.516 seconds
OK
1 xiaozhang 23
Time taken: 1.028 seconds, Fetched: 1 row(s)
[root@hadoop1 testDate]# vi hiveVar.sql
set ouput.date=${hivevar:db.test};
select * from ${hiveconf:ouput.date}.person WHERE AGE BETWEEN ${hivevar:age} AND age<${hivevar:age1};
[root@hadoop1 testDate]# hive -i common.property --hivevar age='10' --hivevar age1='25' -f hiveVar.sql
Time taken: 7.524 seconds
OK
1 xiaozhang 23
2 xiaowang 24
3 xiaoli 25
4 xiaoxiao 26
5 xiaoxiao 27
6 xiaolizi 39
7 xiaodaye 10
8 dageda 12
9 daye 24
10 dada 25
1-8)、Insert into 与Insert overwrite的区别
hive> select * from default.person;
OK
1 xiaozhang 23
2 xiaowang 24
3 xiaoli 25
4 xiaoxiao 26
5 xiaoxiao 27
6 xiaolizi 39
7 xiaodaye 10
8 dageda 12
9 daye 24
10 dada 25
hive> insert overwrite table default.hiveVar select * from default.person;
hive> select * from default.hiveVar;
OK
1 xiaozhang 23
2 xiaowang 24
3 xiaoli 25
4 xiaoxiao 26
5 xiaoxiao 27
6 xiaolizi 39
7 xiaodaye 10
8 dageda 12
9 daye 24
10 dada 25
执行上面的insert多次后,查询数据还是保持原来的数据个数,可以看出overwrite是把重复的数据给覆盖掉了。
hive> insert into table default.hiveVar select * from default.person;
hive> select * from default.hiveVar;
OK
1 xiaozhang 23
2 xiaowang 24
3 xiaoli 25
4 xiaoxiao 26
5 xiaoxiao 27
6 xiaolizi 39
7 xiaodaye 10
8 dageda 12
9 daye 24
10 dada 25
1 xiaozhang 23
2 xiaowang 24
3 xiaoli 25
4 xiaoxiao 26
5 xiaoxiao 27
6 xiaolizi 39
7 xiaodaye 10
8 dageda 12
9 daye 24
10 dada 25
执行上面的 into 多次后,查询数据是原来的数据个数的插入的次数倍。可以看出into是在尾部追加数据之前的数据还再保留。
insert overwrite 会覆盖已经存在的数据,我们假设要插入的数据和已经存在的N条数据一样,那么插入后只会保留一条数据;
insert into 只是简单的copy插入,不做重复性校验,如果插入前有N条数据和要插入的数据一样,那么插入后会有N+1条数据;
1-9)、在当前Hive模式下查看HDFS中文件的信息
hive> dfs -ls /;
Found 14 items
drwx-wx-wx - root supergroup 0 2016-11-05 02:58 /tmp
drwxr-xr-x - root supergroup 0 2016-11-05 03:14 /user
drwxr-xr-x - root supergroup 0 2016-11-06 08:02 /usr
hive> dfs -ls /user/;
Found 1 items
drwxr-xr-x - root supergroup 0 2016-11-05 03:14 /user/hive
Hive 高级应用
1-1)、表分区
[root@hadoop1 testData]# vi t_part.bat
132213,dfggr
35,dfdfggfr
234,fhththtg
78,ddvbfbfb
90,dwwnrrg
213,qazxwe
hive> create table t_part(id int,name string) partitioned by(country string) row format delimited fields terminated by ',';
OK
Time taken: 0.191 seconds
create table if not exists result.table(
statistics_date string,
statist_day string
)
stored as parquetfile;
表示每行的字段以,分割 ,并制定了分区的名字为country,storedas 对数据进行了压缩。
hive> load data local inpath '/usr/local/hive/testData/t_part.bat' into table t_part partition(country='CHA');
Loading data to table hivetest.t_part partition (country=CHA)
Partition hivetest.t_part{country=CHA} stats: [numFiles=1, numRows=0, totalSize=72, rawDataSize=0]
OK
Time taken: 1.479 seconds
hive> select * from t_part;
OK
132213 dfggr CHA
35 dfdfggfr CHA
234 fhththtg CHA
78 ddvbfbfb CHA
90 dwwnrrg CHA
213 qazxwe CHA
Time taken: 0.156 seconds, Fetched: 6 row(s)
hive> load data local inpath '/usr/local/hive/testData/t_part.bat' into table t_part partition(country='USA');
Loading data to table hivetest.t_part partition (country=USA)
Partition hivetest.t_part{country=USA} stats: [numFiles=1, numRows=0, totalSize=72, rawDataSize=0]
OK
Time taken: 1.22 seconds
hive> select * from t_part;
OK
132213 dfggr CHA
35 dfdfggfr CHA
234 fhththtg CHA
78 ddvbfbfb CHA
90 dwwnrrg CHA
213 qazxwe CHA
132213 dfggr USA
35 dfdfggfr USA
234 fhththtg USA
78 ddvbfbfb USA
90 dwwnrrg USA
213 qazxwe USA
Time taken: 0.147 seconds, Fetched: 12 row(s)
每个分区中含有不同的数据。
hive> select COUNT(*) from t_part where country="CHA";
Query ID = root_20160924203813_13857cea-0688-4a3c-a6d8-9ba308fc3b16
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-09-24 20:38:15,758 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1484897114_0007
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 3632 HDFS Write: 288 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
6
Time taken: 1.85 seconds, Fetched: 1 row(s)
hive> select * from t_part where country="CHA";
OK
132213 dfggr CHA
35 dfdfggfr CHA
234 fhththtg CHA
78 ddvbfbfb CHA
90 dwwnrrg CHA
213 qazxwe CHA
Time taken: 0.413 seconds, Fetched: 6 row(s)
注意:分区在于隐藏于一个字段,便于查询数据,可以把较大的数据分为不同的小块,查询起来比较快。重要性比较大,在实际开发中是比较常用的。
[root@hadoop1 testData]# hadoop fs -ls /user/hive/warehouse/hivetest.db/t_part
Found 2 items
drwxr-xr-x - root supergroup 0 2016-09-24 20:28 /user/hive/warehouse/hivetest.db/t_part/country=CHA
drwxr-xr-x - root supergroup 0 2016-09-24 20:30 /user/hive/warehouse/hivetest.db/t_part/country=USA
[root@hadoop1 alertDate]# hadoop fs -ls /user/hive/warehouse/test.db/t_part_1
Found 2 items
drwxr-xr-x - root supergroup 0 2016-07-29 00:12 /user/hive/warehouse/test.db/t_part_1/country=CHA
drwxr-xr-x - root supergroup 0 2016-07-29 00:13 /user/hive/warehouse/test.db/t_part_1/country=USA
[root@hadoop1 alertDate]# hadoop fs -ls /user/hive/warehouse/test.db/t_part_1/country=CHA
Found 1 items
-rwxr-xr-x 3 root supergroup 105 2016-07-29 00:12 /user/hive/warehouse/test.db/t_part_1/country=CHA/test.text
[root@hadoop1 testData]# hadoop fs -cat /user/hive/warehouse/hivetest.db/t_part/country=CHA/t_part.bat
132213,dfggr
35,dfdfggfr
234,fhththtg
78,ddvbfbfb
90,dwwnrrg
213,qazxwe
查看表t_part的分区
hive> show partitions t_part;
OK
country=CHA
Time taken: 0.475 seconds, Fetched: 1 row(s)
1-2)、创建外部表
[root@hadoop1 testData]# hadoop fs -mkdir /data/exp
hive> CREATE EXTERNAL TABLE IF NOT EXISTS td_ext_a(id string, name string,password string) row format delimited fields terminated by ',' location '/data/exp';
OK
Time taken: 7.838 seconds
[root@hadoop1 testData]# hadoop fs -put text.text /data/ex
[root@hadoop1 testData]# hadoop fs -cat /data/exp/text.text
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
hive> select * from td_ext_a;
OK
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
Time taken: 0.987 seconds, Fetched: 8 row(s)
注意:location 指向的是HDFS的文件夹的路径,创建表后需要上传数据,然后再从hive中查询数据即可。
1-3)、数据的压缩
hive> create table td_ext_b(id string, name string,password string) row format delimited fields terminated by ',' STORED AS textfile;
OK
Time taken: 0.612 seconds
hive> load data local inpath '/usr/local/hive/testData/text.text' into table td_ext_b;
Loading data to table hivetest.td_ext_b
Table hivetest.td_ext_b stats: [numFiles=1, totalSize=104]
OK
Time taken: 1.914 seconds
hive> select * from td_ext_b;
OK
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
Time taken: 0.22 seconds, Fetched: 8 row(s)
使用了textfile数据压缩方式
1-4)、复制表
把td_ext_b表里的数据快速复制到td_ext_c表中
hive> create table td_ext_c as select * from td_ext_b;
Query ID = root_20160925051042_6957d79c-306f-4c6f-8511-7fb854cb6caf
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-09-25 05:10:46,127 Stage-1 map = 0%, reduce = 0%
2016-09-25 05:10:47,147 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local616044424_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/.hive-staging_hive_2016-09-25_05-10-42_436_3642217413855049913-1/-ext-10001
Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/td_ext_c
Table hivetest.td_ext_c stats: [numFiles=1, numRows=8, totalSize=104, rawDataSize=96]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 2128 HDFS Write: 425 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 6.796 seconds
可以看出及时复制一个表结构也要走mapreduce,是多么耗时间啊,会去执行数据的个数。
hive> select * from td_ext_c;
OK
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
Time taken: 0.139 seconds, Fetched: 8 row(s)
1-5)、创建分桶表
Hive采用对列值哈希来组织数据的方式, 称之为分桶, 适合采样和map-join.
hive> create table stu__buck(id string,name string,password string) clustered by (id) into 4 buckets row format delimited fields terminated by ',';
OK
Time taken: 0.332 seconds
clustered by (id) into 4 buckets 分桶命令
使用overwrite 的方式插入数据
hive> insert overwrite table stu__buck select * from td_ext_a cluster by (id);
Query ID = root_20160925051355_dc812af1-f359-4246-a1ad-507370b66540
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-09-25 05:13:59,078 Stage-1 map = 0%, reduce = 0%
2016-09-25 05:14:02,167 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1833084919_0009
Loading data to table hivetest.stu__buck
Table hivetest.stu__buck stats: [numFiles=1, numRows=8, totalSize=104, rawDataSize=96]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 4818 HDFS Write: 1028 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 7.47 seconds
cluster by (id) 手动设置插入分桶表中,如果set hive.enforce.bucketing=true;则不需要设置cluster by ,例如:hive> insert overwrite table stu__buck select * from td_ext_a;
hive> select * from stu__buck cluster by(id);
Query ID = root_20160925051436_323b6d35-176b-415c-89cc-fe3041b2f383
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-09-25 05:14:38,270 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1558510236_0010
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 5174 HDFS Write: 1206 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
A 2015-01 5
A 2015-01 15
A 2015-01 8
A 2015-01 5
A 2015-02 4
A 2015-02 6
B 2015-01 5
B 2015-01 25
Time taken: 1.55 seconds, Fetched: 8 row(s)
1-6)、创建表常用命令
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
说明:
A)、CREATE TABLE
CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXISTS 选项来忽略这个异常。
B)、EXTERNAL
EXTERNAL关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。
1-1)、LIKE 允许用户复制现有的表结构,但是不复制数据。
1-2)、ROW FORMAT
DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,将会使用自带的 SerDe。在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的 SerDe,Hive通过 SerDe 确定表的具体的列的数据。
C)、STORED
STORED AS SEQUENCEFILE|TEXTFILE|RCFILE
如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCEFILE。
D)、CLUSTERED BY
CLUSTERED BY对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。Hive也是 针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
把表(或者分区)组织成桶(Bucket)有两个理由:
(1)获得更高的查询处理效率。桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
(2)使取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。
1-7)、DDL 其他操作
A)、增加/删除分区
语法结构
ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
partition_spec:
: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
ALTER TABLE table_name DROP partition_spec, partition_spec,...
ALTER TABLE table_name ADD partition(part=’a’) partition(part=’a’)
ALTER TABLE table_name RENAME TO new_table_name
hive> alter table a rename to a_a;
OK
Time taken: 0.76 seconds
C)、增加/更新列
语法结构
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
注:ADD是代表新增一字段,字段位置在所有列后面(partition列前),REPLACE则是表示替换表中所有字段。
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
1-8)、DML 其他操作
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
hive> load data local inpath '/home/alertDate/text.dat' into table b;
Loading data to table t_test.b
Table t_test.b stats: [numFiles=4, numRows=0, totalSize=143, rawDataSize=0]
OK
Time taken: 0.586 seconds
hive> load data inpath '/home/alertDate' into table t_test5;
hive> load data local inpath '/home/alertDate/test.dat' into table t_test5;
Loading data to table test.t_test5
Table test.t_test5 stats: [numFiles=1, totalSize=105]
OK
Time taken: 0.905 seconds
注意local inpath 与inpath的区别
利用查询语句,将查询结果插入新的表
INSERT OVERWRITE [INTO] TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement
插入一条数据
INSERT INTO TABLE VALUES(XX,YY,ZZ);
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
]
[LIMIT number]
注:1、order by 会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。
2、sort by不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只保证每个reducer的输出有序,不保证全局有序。
3、distribute by(字段)根据指定的字段将数据分到不同的reducer,且分发算法是hash散列。
4、Cluster by(字段) 除了具有Distribute by的功能外,还会对该字段进行排序。
因此,如果分桶和sort字段是同一个时,此时,cluster by = distribute by + sort by
分桶表的作用:最大的作用是用来提高join操作的效率;
(思考这个问题:
select a.id,a.name,b.addr from a join b on a.id = b.id;
如果a表和b表已经是分桶表,而且分桶的字段是id字段
做这个join操作时,还需要全表做笛卡尔积吗?)
1-9)、查看函数命令
查看数据库
hive> show databases;
OK
default
hivetest
Time taken: 0.079 seconds, Fetched: 2 row(s)
查看表
hive> show tables;
OK
hive
stu__buck
t_part
t_partition
t_test
td_ext_a
td_ext_b
td_ext_c
Time taken: 0.077 seconds, Fetched: 8 row(s)
查看分区表
hive> show partitions t_part;
OK
country=CHA
country=USA
Time taken: 0.219 seconds, Fetched: 2 row(s)
查看函数
hive> show functions;
OK
!
!=
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
and
array
array_contains
ascii
asin
assert_true
atan
avg
base64
between
bin
case
cbrt
ceil
ceiling
coalesce
collect_list
collect_set
compute_stats
concat
concat_ws
context_ngrams
conv
corr
cos
count
covar_pop
covar_samp
create_union
cume_dist
current_database
current_date
current_timestamp
current_user
date_add
date_format
date_sub
datediff
day
dayofmonth
decode
degrees
dense_rank
div
e
elt
encode
ewah_bitmap
ewah_bitmap_and
ewah_bitmap_empty
ewah_bitmap_or
exp
explode
factorial
field
find_in_set
first_value
floor
format_number
from_unixtime
from_utc_timestamp
get_json_object
greatest
hash
hex
histogram_numeric
hour
if
in
in_file
index
initcap
inline
instr
isnotnull
isnull
java_method
json_tuple
lag
last_day
last_value
lcase
lead
least
length
levenshtein
like
ln
locate
log
log10
log2
lower
lpad
ltrim
map
map_keys
map_values
matchpath
max
min
minute
month
months_between
named_struct
negative
next_day
ngrams
noop
noopstreaming
noopwithmap
noopwithmapstreaming
not
ntile
nvl
or
parse_url
parse_url_tuple
percent_rank
percentile
percentile_approx
pi
pmod
posexplode
positive
pow
power
printf
radians
rand
rank
reflect
reflect2
regexp
regexp_extract
regexp_replace
repeat
reverse
rlike
round
row_number
rpad
rtrim
second
sentences
shiftleft
shiftright
shiftrightunsigned
sign
sin
size
sort_array
soundex
space
split
sqrt
stack
std
stddev
stddev_pop
stddev_samp
str_to_map
struct
substr
substring
sum
tan
to_date
to_unix_timestamp
to_utc_timestamp
translate
trim
trunc
ucase
unbase64
unhex
unix_timestamp
upper
var_pop
var_samp
variance
weekofyear
when
windowingtablefunction
xpath
xpath_boolean
xpath_double
xpath_float
xpath_int
xpath_long
xpath_number
xpath_short
xpath_string
year
|
~
Time taken: 0.033 seconds, Fetched: 216 row(s)
以上的函数建议多记忆一下,在以后的使用有较大的帮助。
查看一个表扩展相关的信息
hive> desc extended t_test;
OK
id int
name string
password string
Detailed Table Information Table(tableName:t_test, dbName:hivetest, owner:root, createTime:1474769518, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:password, type:string, comment:null)], location:hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/t_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=4, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1474772423, numRows=0, totalSize=182, rawDataSize=0}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.181 seconds, Fetched: 5 row(s)
查看表的格式化的信息
hive> desc formatted t_test;
OK
# col_name data_type comment
id int
name string
password string
# Detailed Table Information
Database: hivetest
Owner: root
CreateTime: Sat Sep 24 19:11:58 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/user/hive/warehouse/hivetest.db/t_test
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 4
numRows 0
rawDataSize 0
totalSize 182
transient_lastDdlTime 1474772423
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.27 seconds, Fetched: 34 row(s)
注意:show partitions table_name 需要表创建了分区,desc extended t_name;
以及desc formatted table_name;可以查看更详细表的信息
保存select查询结果的几种方式
1-1)、将查询结果保存到一张新的hive表中
hive> create table t_table as select * from td_ext_a;
Query ID = root_20160729132825_745c6f57-93da-4563-bed3-788fd781598e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1469693904965_0008, Tracking URL = http://hadoop2:8088/proxy/application_1469693904965_0008/
Kill Command = /home/hadoop-2.6.4/bin/hadoop job -kill job_1469693904965_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-07-29 13:28:40,662 Stage-1 map = 0%, reduce = 0%
使用AS可以快速的复制一张表出来。
hive> select * from t_table;
OK
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
Time taken: 0.093 seconds, Fetched: 8 row(s)
1-2)、将查询结果保存到一张已经存在的hive表中
hive> insert into table t_table select * from t_table;
Query ID = root_20160729133109_7e7e140c-13e5-470c-828e-e789b8e410f0
Total jobs = 3
Launching Job 1 out of 3n
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1469693904965_0009, Tracking URL = http://hadoop2:8088/proxy/application_1469693904965_0009/
Kill Command = /home/hadoop-2.6.4/bin/hadoop job -kill job_1469693904965_0009
hive> select * from t_table;
OK
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
Time taken: 0.079 seconds, Fetched: 16 row(s)
做快速查询与快速插入数据时都是比较耗时的,走mapreduce
1-3)、将查询结果保存到指定的文件目录(可以是本地,也可以是hdfs)
hive> insert overwrite local directory '/data/test.text' select * from t_table;
Query ID = root_20160729133824_dff25a98-38af-4a0a-b93e-02aa42356a45
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1469693904965_0011, Tracking URL = http://hadoop2:8088/proxy/application_1469693904965_0011/
Kill Command = /home/hadoop-2.6.4/bin/hadoop job -kill job_1469693904965_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-07-29 13:38:39,662 Stage-1 map = 0%, reduce = 0%
在语句中local 表示把数据放到本地,否则会放到HDFS上
[root@hadoop2 alertDate]# hadoop fs -ls /data/testFile
[root@hadoop2 alertDate]#
注意:将查询结果保存到hdfs的目录上也是要走mapreduce的,发现在这几个都要走mapreduce,看来需要注意一下操作了。
JOIN的使用
1-1)、数据的准备
hive> select * from a;
OK
1 2222 243434
2 sdff sdfff
3 sdff sdfff
Time taken: 0.1 seconds, Fetched: 3 row(s)
hive> select * from b;
OK
3 sdff sdfff
5 sdff sdfff
1 sdff sdfff
Time taken: 0.078 seconds, Fetched: 3 row(s)
1-2)、数据的查询
Inner 的查询
hive> select * from a inner join b on a.id = b.id;
Query ID = root_20160729141536_fe7d3f21-e31f-49b2-9401-6c2685151cd0
Total jobs = 1
Execution log at: /tmp/root/root_20160729141536_fe7d3f21-e31f-49b2-9401-6c2685151cd0.log
2016-07-29 14:15:50 Starting to launch local task to process map join; maximum memory = 518979584
2016-07-29 14:15:53 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/root/6143ddd7-42f9-4958-bbdc-bd3a22a640b1/hive_2016-07-29_14-15-36_256_2258338164921734285-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2016-07-29 14:15:53 Uploaded 1 File to: file:/tmp/root/6143ddd7-42f9-4958-bbdc-bd3a22a640b1/hive_2016-07-29_14-15-36_256_2258338164921734285-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (350 bytes)
2016-07-29 14:15:53 End of local task; Time Taken: 2.802 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
OK
1 2222 243434 1 sdff sdfff
3 sdff sdfff 3 sdff sdfff
Left join查询
hive> select * from a left join b on a.id = b.id;
Query ID = root_20160729141725_8b017579-c074-4128-90ea-7d9a424b44b6
Total jobs = 1
Execution log at: /tmp/root/root_20160729141725_8b017579-c074-4128-90ea-7d9a424b44b6.log
2016-07-29 14:17:37 Starting to launch local task to process map join; maximum memory = 518979584
2016-07-29 14:17:39 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/root/6143ddd7-42f9-4958-bbdc-bd3a22a640b1/hive_2016-07-29_14-17-25_723_6872206778159417490-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2016-07-29 14:17:39 Uploaded 1 File to: file:/tmp/root/6143ddd7-42f9-4958-bbdc-bd3a22a640b1/hive_2016-07-29_14-17-25_723_6872206778159417490-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (350 bytes)
2016-07-29 14:17:39 End of local task; Time Taken: 1.933 sec.
OK
1 2222 243434 1 sdff sdfff
2 sdff sdfff NULL NULL NULL
3 sdff sdfff 3 sdff sdfff
right join 查询
hive> select * from a right join b where a.id=b.id;
Query ID = root_20160729154358_38dd1b7a-b8ac-49b5-9a56-37f22636fc11
Total jobs = 1
Execution log at: /tmp/root/root_20160729154358_38dd1b7a-b8ac-49b5-9a56-37f22636fc11.log
2016-07-29 15:44:10 Starting to launch local task to process map join; maximum memory = 518979584
2016-07-29 15:44:13 Dump the side-table for tag: 0 with group count: 3 into file: file:/tmp/root/f4a99bd2-d473-48ac-bb84-4321ff29e9c5/hive_2016-07-29_15-43-58_345_9096767528169264921-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
******
Total MapReduce CPU Time Spent: 3 seconds 430 msec
OK
3 sdff sdfff 3 sdff sdfff
1 2222 243434 1 sdff sdfff
Time taken: 53.886 seconds, Fetched: 2 row(s)
Hive 常用函数的使用
1-1)、Split 的使用
hive> select * from a;
OK
1 2222 243434
2 sdff sdfff
3 sdff sdfff
Time taken: 0.095 seconds, Fetched: 3 row(s)
hive> select split(name,2) as name from a;
OK
["","","","",""]
["sdff"]
["sdff"]
Time taken: 0.099 seconds, Fetched: 3 row(s)
hive> select split(name,1) as name from a;
OK
["2222"]
["sdff"]
["sdff"]
Time taken: 0.33 seconds, Fetched: 3 row(s)
hive> select split(name,0) as name from a;
OK
["2222"]
["sdff"]
["sdff"]
Time taken: 0.13 seconds, Fetched: 3 row(s)
注意split的切的索引的问题,0和1是相同的结果
1-2)、Get_json_object的使用
hive> select * from a;
OK
1 2222 243434
2 sdff sdfff
3 sdff sdfff
6 {"name":"xiaozhang"} sdff
hive> select get_json_object(name,'$.name') as name from a;
OK
NULL
NULL
NULL
xiaozhang
Time taken: 0.366 seconds, Fetched: 4 row(s)
1-3)、更多函数的使用
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
Hive 查找需要的配置信息
A)、查找带有关键字warehouse的属性信息
# hive -S -e "SET" | grep warehouse
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
hive.metastore.warehouse.dir=/apps/hive/warehouse
hive.warehouse.subdir.inherit.perms=true
Hive.exec.parallel.thread.number
B)、查找带有关键字thread的属性信息
# hive -S -e "set" | grep "thread"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
dfs.client.file-block-storage-locations.num-threads=10
dfs.datanode.block.id.layout.upgrade.threads=12
dfs.datanode.directoryscan.threads=1
dfs.datanode.fsdatasetcache.max.threads.per.volume=4
dfs.datanode.max.transfer.threads=16384
hive.compactor.worker.threads=0
hive.exec.parallel.thread.number=8
hive.metastore.fshandler.threads=15
hive.metastore.server.max.threads=100000
hive.metastore.server.min.threads=200
hive.mv.files.thread=15
hive.orc.compute.splits.num.threads=10
hive.server2.async.exec.threads=100
hive.server2.thrift.max.worker.threads=500
hive.server2.thrift.min.worker.threads=5
hive.spark.client.rpc.threads=8
hive.stats.gather.num.threads=10
hive.txn.heartbeat.threadpool.size=5
javax.jdo.option.Multithreaded=true
Hive 参数优化说明
对数据进行均衡分配
set hive.groupby.skewindata=true
开启任务并行执行(提高大表关联小表的速度)
set hive.auto.convert.join=true
73 jfzq !`Qq `11=]\ 同一个sql允许并行任务的最大线程数(默认是8,可以根据集群的规模以及性能调整,调整后会占用map与reduce的资源)
set hive.exec.parallel.thread.number=8
设置开启动态分区开关
set hive.exec.dynamic.partition=true;
插入数据是设定立即压缩
SET hive.exec.compress.output=true;
当使用select 查询数据时候,显示的结果会带有表的字段名称
hive (default)> set hive.cli.print.header=true;
更多的配置请查看:http://blog.csdn.net/xfg0218/article/details/61417668
或者使用命令查看优化的项:
hive> set;
_hive.hdfs.session.path=/tmp/hive/root/bd4f450f-2f3f-460d-8539-5ee573701e59
_hive.local.session.path=/tmp/root/bd4f450f-2f3f-460d-8539-5ee573701e59
_hive.tmp_table_space=/tmp/hive/root/bd4f450f-2f3f-460d-8539-5ee573701e59/_tmp_space.db
datanucleus.autoCreateSchema=true
datanucleus.autoStartMechanismMode=checked
datanucleus.cache.level2=false
datanucleus.cache.level2.type=none
datanucleus.connectionPoolingType=BONECP
datanucleus.fixedDatastore=false
datanucleus.identifierFactory=datanucleus1
datanucleus.plugin.pluginRegistryBundleCheck=LOG
datanucleus.rdbms.useLegacyNativeValueStrategy=true
datanucleus.storeManagerType=rdbms
datanucleus.transactionIsolation=read-committed
datanucleus.validateColumns=false
datanucleus.validateConstraints=false
datanucleus.validateTables=false
dfs.block.access.key.update.interval=600
dfs.block.access.token.enable=false
dfs.block.access.token.lifetime=600
***********************************************
具体的请查看:http://blog.csdn.net/xfg0218/article/details/61930858
Hive 性能优化
1-1)、有小表链接大表时
set hive.auto.convert.join=true;
这样设置,hive就会自动的识别比较小的表,继而用mapJoin来实现两个表的联合。
在本质上mapjoin根本就没有运行MR进程,仅仅是在内存就进行了两个表的联合。
前提是对表进行分桶处理
set hive.optimize.bucketmapjoin = true;
可见mapjoin和使用bucket进行mapjoin区别很大,当map端join小表仍然不能全部放进内存时,对两个表在join的key上都做hash bucket, 之后可以用bucket mapjoin, 因为bucket划分数据后,相同hash值的key都放在了同一个bucket中, 在join时, 小表将会被复制到所有节点,map加载小表的bucket到内存hashtable, 与大表对应的bucket文件进行join。减少了单台机器执行的时间。
Hive Shell 参数的使用
1-1)、Hive 的参数
语法结构
hive [-hiveconf x=y]* [<-i filename>]* [<-f filename>|<-e query-string>] [-S]
说明:
- -i 从文件初始化HQL。
- -e从命令行执行指定的HQL
- -f 执行HQL脚本
- -v 输出执行的HQL语句到控制台
- -p <port> connect to Hive Server on port number
- -hiveconf x=y Use this to set hive/hadoop configuration variables.
[root@hadoop1 testDate]# hive -i common.property -f provinceNameByIdNo.sql
[root@hadoop1 testDate]# hive -e "show databases";
1-2)、添加以下的参数
set mapred.reduce.tasks = 80;
set hive.exec.reducers.bytes.per.reducer=5000000;
set hive.execution.engine=mr; --会直接显示MR的进度,建议不要开启会降低执行效率
set hive.mapred.reduce.tasks.speculative.execution=false;
set mapred.max.split.size=1933812999;
set mapred.min.split.size.per.node=1933812999;
set mapred.min.split.size.per.rack=1933812999;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set hive.groupby.skewindata=true;
1933812999是根据HDFS中该表的文件的大小除以6再把最后的两位换成99即可
Hive列存储简介
主要开源的有https://parquet.apache.org/与https://orc.apache.org/
创建表
# create table xiaoxu_test(id int,name string,age int);
# insert into xiaoxu_test values(1,"xiaozhang",1);
# insert into xiaoxu_test values(2,"xiaozhang",2);
1-1)、orc详解
# create table orc_xiaoxu( id int,name string,age int ) STORED AS ORC TBLPROPERTIES ("orc.compress"="NONE");
# INSERT OVERWRITE TABLE orc_xiaoxu SELECT * FROM xiaoxu_test;
hive> select * from orc_xiaoxu;
OK
2 xiaozhang 2
1 xiaozhang 1
Time taken: 0.052 seconds, Fetched: 2 row(s)
C)、查看hadoop上的数据
# hadoop fs -cat /apps/hive/warehouse/xiaoxu.db/orc_xiaoxu/000000_0
1-2)、parquet详解
# hive
# user xiaoxu
# create table orc_xiaoxu( id int,name string,age int );
# insert into orc_xiaoxu values(1,"xiaowang",1);
# insert into orc_xiaoxu values(2"xiaoli",2);
B)、创建parquet格式的数据表
# CREATE TABLE parquet_xiaoxu(id int,name string,age int)STORED AS PARQUET;
# INSERT OVERWRITE TABLE parquet_xiaoxu SELECT * FROM orc_xiaoxu;
hive> select * from parquet_xiaoxu;
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2 xiaozhang 2
1 xiaowang 1
Time taken: 0.39 seconds, Fetched: 2 row(s)
C)、查看hadoop上的数据
# hadoop fs -cat /apps/hive/warehouse/xiaoxu.db/parquet_xiaoxu/000000_0
Hive行转列
将一下数据变成
a b 1,2,3
c d 4,5,6
变为:
a b 1
a b 2
a b 3
c d 4
c d 5
c d 6
drop table test_jzl_20140701_test;
create table test_jzl_20140701_test(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by ' '
stored as textfile;
load data local inpath '/home/jiangzl/shell/test.txt' into table test_jzl_20140701_test;
select * from test_jzl_20140701_test
select col1,col2,name
from test_jzl_20140701_test
lateral view explode(split(col3,',')) col3 as name;
效果如下
a b 1
a b 2
a b 3
c d 4
c d 5
c d 6
Hive Web图形界面安装
1-1)、下载hive的源码
http://mirrors.hust.edu.cn/apache/hive/
1-2)、修改配置文件
[root@hadoop1 conf]# vi hive-site.xml
<property>
<name>hive.hwi.listen.host</name>
<value>192.168.132.140</value>
</property>
<property>
<name>hive.hwi.listen.port</name>
<value>9999</value>
</property>
<property>
<name>hive.hwi.war.file</name>
<value>/lib/hive-hwi-1.2.1.war</value>
</property>
1-3)、添加JAR
[root@hadoop1 conf]# ln -s $JAVA_HOME/lib/tools.jar $HIVE_HOME/lib/
1-4)、启动命令
[root@hadoop1 conf]# hive --service hwi 2> /tmp/hwi2.log &
1-5)、查看web界面
Hive与Tez结合
1-1)、安装编译Tez需要的软件
请查找与Hive对应的版本,本实例的Hive的1.2所以用Tez0.5.4的版本,由于Tez官方提供的是源码。所以还需要使用protobuf以及mvn进行编译,如果想自己编译但是网速不好,请下载.m2的JAR包:或者直接使用作者已经编译好的Tez0.5.4的包,所有的资料请下载:
链接:http://pan.baidu.com/s/1c27ri6W 密码:6ya4
[root@hadoop1 opt]# tar -zxvf protobuf-2.5.0.tar.gz
[root@hadoop1 protobuf-2.5.0]# cd protobuf-2.5.0
[root@hadoop1 protobuf-2.5.0]# ./configure
[root@hadoop1 protobuf-2.5.0]# make && make install
[root@hadoop1 opt]# tar -zxvf apache-tez-0.5.4-src.tar.gz
[root@hadoop1 opt]# mv apache-tez-0.5.4-src tez-0.5.4
[root@hadoop1 opt]# cd tez-0.5.4/
进入目录修改pom.xml中的hadoop.version 为2.6.4版本
[root@hadoop1 tez-0.5.4]# vi pom.xml
[root@hadoop1 opt]# chmod a+x apache-maven-3.1.0-bin.zip
[root@hadoop1 opt]# unzip apache-maven-3.1.0-bin.zip
[root@hadoop1 opt]# mv apache-maven-3.1.0 maven-3.1.0
添加环境变量:
[root@hadoop1 maven-3.1.0]# vi /etc/profile
查看版本:
[root@hadoop1 conf]# mvn -version
Apache Maven 3.1.0 (893ca28a1da9d5f51ac03827af98bb730128f9f2; 2013-06-27 19:15:32-0700)
Maven home: /opt/maven-3.1.0
Java version: 1.7.0_76, vendor: Oracle Corporation
Java home: /opt/jdk1.7/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-642.15.1.el6.x86_64", arch: "amd64", family: "unix"
1-2)、对Tez进行打包
进入下载源码的tez的目录下执行以下的命令进行编译
[root@hadoop1 tez-0.5.4]# mvn clean package -Dtar -DskipTests=true -Dmaven.javadoc.skip=true
1-3)、把JAR包上传到HDFS中
打包成功后进入tez-0.5.4/tez-dist/target目录,将tez-0.5.4.tar.gz上传至hdfs中的tez的目录下
[root@hadoop1 target]# hadoop fs -put tez-0.5.4.tar.gz /tez/
1-4)、修改Hive下的hive-site.xml
在hive-site.xml配置文件中新增以下内容:
<property>
<name>tez.lib.uris</name>
<value>hdfs://hadoop1:9000/tez/tez-0.5.4.tar.gz</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
1-5)、解压出编译后的文件
[root@hadoop1 opt]# mkdir tez-0.5.4-compiling
[root@hadoop1 tez-0.5.4-compiling]# tar -zxvf tez-0.5.4.tar.gz
1-6)、在hive-env.sh中添加如下配置
在hive-env.sh配置文件中新增以下内容:
export TEZ_HOME=/opt/tez-0.5.4-compiling/
for jar in `ls ${TEZ_HOME}|grep jar`;do
export HADOOP_CLASSPATH=${TEZ_HOME}/$jar:${HADOOP_CLASSPATH}
done
for jar in `ls ${TEZ_HOME}/lib/|grep jar`;do
export HADOOP_CLASSPATH=${TEZ_HOME}/lib/$jar:${HADOOP_CLASSPATH}
Done
1-7)、配置相关的JAR
tez-0.5.4中需要用到tez的相关jar包,所以需要将tez-0.5.4.tar.gz中的所有jar包拷贝到tez-0.5.4的lib目录下
1-8)、tez-0.5.4的bin/hive-config.sh中加入lzo
JAR的下载请下载以上的资料中。
1-9)、在tez-0.5.4的bin/hive-config.sh中加入以下配置
Export HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/common/lib/hadoop-lzo-0.4.3jar
或者在${HIVE_HOME}路径下建auxlib文件夹,将hadoop-lzo-0.4.20.jar拷贝至该目录中。该lzo包必须和hadoop的保持一致。
1-10)、查看效果
Beeline 链接Hive
1-1)、修改hive-site.xml
[root@hadoop1 conf]# cat hive-site.xml
在之前的基础上添加一下信息:
root 是运行的用户名,加入在hadoop用户下运行即可配置成
Hadoop.proxyuser.hadoop.hosts
与
Hadoop.proxyuser.hadoop.groups
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
<description>xiugai root jiaose</description>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hive.server2.transport.mode</name>
<value>binary</value>
<description>
Expects one of [binary, http].
Transport mode of HiveServer2.
</description>
</property>
<property>
<name>hive.hwi.listen.host</name>
<value>0.0.0.0</value>
<description>This is the host address the Hive Web Interface will listen on</description>
</property>
<property>
<name>hive.server2.webui.host</name>
<value>0.0.0.0</value>
<description>The host address the HiveServer2 WebUI will listen on</description>
</property>
1-2)、启动hiveserver2
启动hiveserver服务,可以得知默认hiveserver运行在端口10000,最小100工作线程,最大2147483647工作线程。
[root@hadoop1 root]# hive --service hiveserver2 &
[1] 6332
[root@hadoop1 root]# SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/03/16 07:00:06 WARN conf.HiveConf: HiveConf of name hive.server2.webui.host does not exist
到这里会卡住,不会再动了,表示已经启动成功,如果想后台执行可以关闭该SHELL窗口
1-3)、查看运行的端口
[root@hadoop1 start_sh]# netstat -nltp|grep 10000
tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 6332/java
有进程表示启动已经成。
1-4)、使用beeline嵌入链接客户端
[root@hadoop1 bin]# beeline
Beeline version 1.6.2 by Apache Hive
beeline> !connect jdbc:hive2://hadoop1:10000/default
Connecting to jdbc:hive2://hadoop1:10000/default
Enter username for jdbc:hive2://hadoop1:10000/default: root
Enter password for jdbc:hive2://hadoop1:10000/default: ******
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connected to: Apache Hive (version 1.2.1)
Driver: Spark Project Core (version 1.6.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoop1:10000/default> show databases;
+----------------+--+
| database_name |
+----------------+--+
| allinfo |
| default |
+----------------+--+
3 rows selected (0.47 seconds)
1-5)、使用beeline远程链接客户端
[root@hadoop1 bin]# beeline -u jdbc:hive2://hadoop1:10000 -n root -p 123456 -e “show databases”
DBeaver 4.0 链接Hive 工具
官网 : https://dbeaver.jkiss.org/download/
Dbeaver 可以连大部分的数据库,现在只展示链接Hive。
1-1)、软件下载
链接:http://pan.baidu.com/s/1geQ6KeB 密码:w5ld 如果无法下载请联系作者。
1-2)、链接Hive
请选择Apache Hive进行链接
填写链接信息,前提是hiveserver2服务已经启动,可通过beelin测试一下
选择驱动
JAR包已放在了安装目录下的lib文件夹下的lib目录,请自行添加。
1-3)、查看Hive信息
Hive 常见错误
1-1)、字符问题
hive> create table userName(id int,name string);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.)
解决方法:
在mysql中执行以下操作
set global binlog_format='MIXED';
alter database hive character set latin1;
更换hive下的lib下的mysql驱动,更改为5.1的即可
1-2)、找不到spark的JAR错误
其主要的原因是:在hive的文件中,发现了这样的命令,原来初始当spark存在的时候,进行spark中相关的JAR包的加载。而自从spark升级到2.0.0之后,原有的lib的整个大JAR包已经被分散的小JAR包的替代,所以肯定没有办法找到这个spark-assembly的JAR包。这就是问题所在。
[root@hadoop1 bin]# hive
ls: cannot access /opt/spark2.0/lib/spark-assembly-*.jar: No such file or directory
**********
[root@hadoop1 bin]# vi hive
# add Spark assembly jar to the classpath
if [[ -n "$SPARK_HOME" ]]
then
sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`
CLASSPATH="${CLASSPATH}:${sparkAssemblyPath}"
fi
把sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`修改成sparkAssemblyPath=`ls ${SPARK_HOME}/jars/*.jar`
1-3)、查看表结构时出现中文乱码
hive表在show create table 的时候字段的comment中文乱码,在desc的时候中文正常
解决方法:https://www.iteblog.com/archives/1687.html
官网说明:https://issues.apache.org/jira/browse/HIVE-11837
Hive数据导出与导入方案
1-1)、Hive 导入到Hbase
详细的过程请查看 《Hbase导数据的几种方式》总结中的资料
1-2)、Hive 导入到Mysql
1-1)、本分导入
sqoop export --connect jdbc:mysql://hadoop1:3306/dbname --username root --password 123456 --table sp_address_code --columns "ID,PLACE_TYPE,PLACE_CODE,PLACE_NAME,UP_PLACE_CODE" --hcatalog-database xiaoxu --hcatalog-table xiaoxu
1-2)、全部导入
sqoop export --connect jdbc:mysql://hadoop1:3306/xuaoxu --username root --password 123456 --table sp_address_code --hcatalog-database xiaoxu --hcatalog-table xiaoxu
1-3)、Hive 导入到Redis
执行下面之前必须开启beeline服务,端口是10000。Mysql的方案类似此方案
[root@hadoop1 testData]# vi hive-redis.sh
source /etc/profile
hadoop fs -rmr /hive-redis
beeline -u jdbc:hive2://hadoop1:10000 -n root -p 123456 -e "
insert overwrite directory '/hive-redis' SELECT CONCAT_ws('\t',redis_cmd,redis_key,hkey,hval)
FROM (
select
'HSET' as redis_cmd,
'sp_address' AS redis_key,
CONCAT_ws('_',PLACE_CODE,PLACE_TYPE) as hkey,
CONCAT_ws(',',PLACE_CODE,PLACE_NAME,PLACE_TYPE,UP_PLACE_CODE) as hval
from xiaoxu.sp_address limit 1234
) as sp_address;"
hadoop fs -cat /hive-redis/* |redis-cli
运行并把日志输出:
[root@hadoop1 testData]# sh hive-redis.sh >> /opt/testData/hive-redis.log 2>&1 &
[root@hadoop1 testData]# hadoop fs -cat /hive-redis/000000_0|head
HSET sp_address 320205_03 320205,锡山区,03,320200
HSET sp_address 320204_03 320204,北塘区,03,320200
HSET sp_address 320203_03 320203,南长区,03,320200
HSET sp_address 320202_03 320202,崇安区,03,320200
HSET sp_address 320201_03 320201,市辖区,03,320200
HSET sp_address 320125_03 320125,高淳县,03,320100
HSET sp_address 320124_03 320124,溧水县,03,320100
HSET sp_address 320116_03 320116,六合区,03,320100
HSET sp_address 320115_03 320115,江宁区,03,320100
HSET sp_address 320114_03 320114,雨花台区,03,320100
cat: Unable to write to output stream.
127.0.0.1:6379> KEYS *
- "sp_address"
查看全部的数据
127.0.0.1:6379> HGETALL "sp_address"
2464) "320700,\xe8\xbf\x9e\xe4\xba\x91\xe6\xb8\xaf\xe5\xb8\x82,02,320000"
2465) "340100_02"
2466) "340100,\xe5\x90\x88\xe8\x82\xa5\xe5\xb8\x82,02,340000"
2467) "331100_02"
2468) "331100,\xe4\xb8\xbd\xe6\xb0\xb4\xe5\xb8\x82,02,330000"
查看部分数据
127.0.0.1:6379> HGET "sp_address" "340100_02"
"340100,\xe5\x90\x88\xe8\x82\xa5\xe5\xb8\x82,02,340000"
详细的运行过程请查看:http://blog.csdn.net/xfg0218/article/details/70493195
1-4)、Hive导入到ES
链接:http://pan.baidu.com/s/1bp9vU2N 密码:1nbz
hive>
CREATE EXTERNAL TABLE sp_address_es (
ID STRING,
PLACE_TYPE STRING,
PLACE_NAME STRING,
UP_PLACE_CODE STRING
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'sp_address/sp_address', 'es.nodes'='hadoop1','es.port'='9003','es.mapping.id' = 'ID');
hive> insert into table sp_address_es select ID,PLACE_TYPE,PLACE_NAME,UP_PLACE_CODE from xiaoxu.es_sp_address;