第一:Hive的运行原理
一、Hive Cli(查询处理器)
1.Query Compiler
- Parser(语法解析器,生成AST(抽象语法树))
- Semantic Analyzer(语义分析器,生成QB(查询块))
- Logical Plan Generator(逻辑查询计划生成器,生成QB Tree)
- Logical Optimizer(逻辑查询优化器,生成QB Tree)
- Physical Plan Generator(物理查询计划生成器,生成Phys Plan)
- Physical Optimizer(物理查询优化器,生成 Phys Plan)
2.Exection Engine
- MR
- Spark
- Tez
二、MetaStore(元数据)
1.存储表的元数据信息
- 数据库,表的基本信息
- 分区信息
- 列的信息
- 存储格式信息
- 各种属性信息
- 权限信息
2.使用关系型数据库进行存储
3.MetaStore Server
- 通用的Catalog Server
- 支持多语言访问
- local/remote Mode
- 配置
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoopA:8020</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoopA/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123</value>
</property>
三、YARN(计算平台)
四、HDFS(数据存储)
第二:Hive的访问方式
Hive Cli
- hvie
- hive -h hotst -p port
Hive Beeline
- beeline -u jdbc:hive2://host:port user password
编程访问
- JDBC/ODBC
- Thrift
第三:Hive的查询优化
1. 合理建表
一、定义列类型
- 基本类型
- 负责类型
二、使用分区列
- 查询维度,业务需求
- 日期分区,动态分区
- set hive.exec.dynamic.partition=true
第三、定义分桶列
第四、采用列式存储
- Parquet
- orc
第五、采用压缩存储
- Bzip,Gzip
- LZO
- Snappy
2. 通用参数优化
第一、启用数据压缩
- 减少存储和IO
- 压缩Hive输出和中间结果
- hive.exec.compress.output=true
- hive.exec.compress.intermediate=true
- 设置Hive中间表存储格式
- hive.query.result.fileformat=SequenceFile
第二、Job执行优化
并行执行多个job
- hive.exec.parallel=true(default false)
- hive.exec.parallel.thread.number=8(default 8)
本地执行模式
- hive.exec.mode.local.auto=true
- hive.exec.mode.local.auto.inputbytes.max(128 by default)
- hive.exec.mode.local.auto.tasks.max(4 by default)
- num_reduce_task<=1
第三、选择合适引擎
- MR
- Spark
- Tez
第四、MapReduce参数优化
Map阶段优化
- mapreduce.job.maps无效
- num_map_tasks切割参数影响大小
- 切割算法
Reduce阶段优化
- mapred.reduce.tasks直接设置
- num_reduce_tasks大小影响参数
- 切割算法
Shuffle阶段优化
- 压缩中间数据
配置方法
- mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec
org.apache.hadoop.io.compress.LzoCodec
- org.apache.hadoop.io.compress.SnappyCodec
3. Join优化
第一、Common join
在Reduce端做join
第二、Map join
- 将小文件放到内存,和大文件的每一个map在mapper阶段进行join
- hive.auto.convert.join=true(default false)
- hive.mapjoin.smalltable.filesize=600M(default 25M)
- Select /+MAPJOIN(a)/..a JOIN b 强制指定mapjoin
第三、Bucket map join
- set hive.optimize.bucketmapjoin=true
- mapjoin 一起工作
- 所有要join的表必须分桶,大桶的表的个数是小桶的表的整数倍
- 做了bucket的列必须是等于join的列
第四:项目实践
第一、项目说明
- 搭建好Hadoop集群环境。安装好hdfs,yarn,hive等组件
- 分布创建三张Hive表。user_dimension,brand_dimension,record.其中user_dimension和brand_dimension是外部表。
- 向这三张表中导入数据。
- 数据来源均已准备好,并且和三张表的格式对应
第二、创建user_dimension,brand_dimension,record这三张表
[hadoop@hadoopa command]$ cat create_table_user.sql
create external table if not exists user_dimension (
uid STRING,
name STRING,
gender STRING,
birth DATE,
province STRING
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
location 'hdfs://hadoopA:8020/warehouse/user_dimension'
;
[hadoop@hadoopa command]$
[hadoop@hadoopa command]$ cat create_table_brand.sql
create external table if not exists brand_dimension (
bid STRING,
category STRING,
brand STRING
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
location 'hdfs://hadoopA:8020/warehouse/brand_dimension'
;
[hadoop@hadoopa command]$
[hadoop@hadoopa command]$ cat create_table_record.sql
create table if not exists record (
rid STRING,
uid STRING,
bid STRING,
trancation_date TIMESTAMP,
price INT,
source_province STRING,
target_province STRING,
site STRING,
express_number STRING,
express_company STRING
)
PARTITIONED BY (
partition_date STRING,
hour INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
[hadoop@hadoopa command]$ hive -f create_table_user.sql
[hadoop@hadoopa command]$ hive -f create_table_brand.sql
[hadoop@hadoopa command]$ hive -f create_table_record.sql
第三、向这三张表中导入数据
hive> load data local inpath '/home/hadoop/hadooptraining/datasource/brand.list' overwrite into table brand_dimension;
hive> load data local inpath '/home/hadoop/hadooptraining/datasource/user.list' overwrite into table user_dimension;
hive> load data inpath 'hdfs://hadoopA:8020/flume/record/2017-03-10/2220/transaction_log.1489155600805' overwrite into table record partition(partition_date='2017-03-10',hour=22);
第四、使用HQL命令验证试验结果
查询各品牌销售总额
- HQL语句
[hadoop@hadoopa command]$ cat brand_price_list.sql
select brand,sum(price) as totalPrice
from record join brand_dimension on record.bid=brand_dimension.bid
group by brand_dimension.brand
order by totalPrice desc;
[hadoop@hadoopa command]$
2. 执行结果
[hadoop@hadoopa command]$ hive -f brand_price_list.sql
which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = hadoop_20170313114243_fdc4d60e-75f7-426e-ba69-cf7198eaedc6
Total jobs = 2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-03-13 11:43:12 Starting to launch local task to process map join; maximum memory = 518979584
2017-03-13 11:43:15 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable
2017-03-13 11:43:16 Uploaded 1 File to: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (34008 bytes)
2017-03-13 11:43:16 End of local task; Time Taken: 3.194 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1489117385526_0031, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0031/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0031
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-03-13 11:43:40,015 Stage-2 map = 0%, reduce = 0%
2017-03-13 11:43:54,259 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.86 sec
2017-03-13 11:44:08,440 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 5.47 sec
MapReduce Total cumulative CPU time: 5 seconds 470 msec
Ended Job = job_1489117385526_0031
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1489117385526_0032, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0032/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0032
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2017-03-13 11:44:32,410 Stage-3 map = 0%, reduce = 0%
2017-03-13 11:44:52,170 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.4 sec
2017-03-13 11:45:07,571 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 5.73 sec
MapReduce Total cumulative CPU time: 5 seconds 730 msec
Ended Job = job_1489117385526_0032
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.47 sec HDFS Read: 62768 HDFS Write: 1666 SUCCESS
Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 5.73 sec HDFS Read: 6942 HDFS Write: 1480 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 200 msec
OK
SKYWORTH 11992
SAMSUNG 10240
YILI 9872
TCL 6741
OLAY 6442
MEIZU 6345
ASUS 5705
PEAK 5431
APPLE 5213
MOUTAI 4772
SHARP 4721
PEACEBIRD 4680
MIZUNO 4599
DHC 4585
NIULANSHAN 4582
CAMEL 4569
NIKE 4358
SEPTWOLVES 4345
OPPO 4306
NB 4237
KAPPA 4092
ZARA 4068
GUANGMING 4054
HP 4043
HISENSE 3995
HLA 3963
HUAWEI 3927
KANS 3884
LANGJIU 3857
NIVEA 3579
LINING 3559
CLINIQUE 3552
LENOVO 3534
PUMA 3531
HTC 3405
GXG 3322
UNIQLO 3271
HAIER 3106
LOREAL 2948
WULIANGYE 2912
ADIDAS 2847
MOTOROLA 2819
VIVO 2809
DELL 2785
SANYUAN 2776
LANCOME 2714
SELECTED 2633
INNISFREE 2452
SONY 2353
ACER 2339
XIAOMI 2260
HONGXING 2113
ANTA 1990
MENGNIU 1776
IPHONE 1628
SEMIR 1589
PHILIPS 1205
361 718
MEIFUBAO 448
Time taken: 146.0 seconds, Fetched: 59 row(s)
查询个年龄段用户消费总额
- HQL语句
[hadoop@hadoopa command]$ cat age_price_list.sql
select cast(DATEDIFF(CURRENT_DATE, birth)/365 as int) as age,
sum(price) as totalPrice
from record join user_dimension on record.uid=user_dimension.uid
group by cast(DATEDIFF(CURRENT_DATE, birth)/365 as int)
order by totalPrice desc;
2. 执行结果
[hadoop@hadoopa command]$ hive -f age_price_list.sql
which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = hadoop_20170313114910_f92cc413-d572-4574-a30e-2cfa9fdcded8
Total jobs = 2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-03-13 11:49:32 Starting to launch local task to process map join; maximum memory = 518979584
2017-03-13 11:49:35 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable
2017-03-13 11:49:35 Uploaded 1 File to: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (30382 bytes)
2017-03-13 11:49:35 End of local task; Time Taken: 3.073 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1489117385526_0033, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0033/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0033
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-03-13 11:49:57,659 Stage-2 map = 0%, reduce = 0%
2017-03-13 11:50:18,269 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.14 sec
2017-03-13 11:50:32,451 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 6.71 sec
MapReduce Total cumulative CPU time: 6 seconds 710 msec
Ended Job = job_1489117385526_0033
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1489117385526_0034, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0034/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0034
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2017-03-13 11:50:50,851 Stage-3 map = 0%, reduce = 0%
2017-03-13 11:51:06,025 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.22 sec
2017-03-13 11:51:19,065 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 5.14 sec
MapReduce Total cumulative CPU time: 5 seconds 140 msec
Ended Job = job_1489117385526_0034
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 6.71 sec HDFS Read: 64575 HDFS Write: 1104 SUCCESS
Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 6338 HDFS Write: 1037 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 850 msec
OK
46 9653
44 9436
10 8068
2 7918
15 7470
9 7456
40 7391
14 7388
43 7109
37 6671
6 5980
11 5969
18 5858
30 5853
29 5841
22 5819
39 5737
20 5597
13 5564
41 5503
21 5306
12 4998
23 4991
28 4830
35 4829
33 4528
19 4347
25 4109
36 4017
32 3910
17 3698
1 3468
7 3459
16 3344
26 3328
31 3286
3 3188
4 3141
8 2639
42 2557
45 2473
24 2457
34 2454
47 2254
27 2137
38 1796
5 1534
0 1232
Time taken: 131.25 seconds, Fetched: 48 row(s)
查询个省份消费总额
- HQL语句
[hadoop@hadoopa command]$ cat province_prince_list.sql
select province,sum(price) as totalPrice
from record join user_dimension on record.uid=user_dimension.uid
group by user_dimension.province
order by totalPrice desc;
2. 执行结果
[hadoop@hadoopa command]$ hive -f province_prince_list.sql which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = hadoop_20170313115315_d285100f-73a1-44b4-8ffd-9537cbea48e9 Total jobs = 2 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2017-03-13 11:53:44 Starting to launch local task to process map join; maximum memory = 518979584 2017-03-13 11:53:48 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable 2017-03-13 11:53:48 Uploaded 1 File to: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (35297 bytes) 2017-03-13 11:53:48 End of local task; Time Taken: 4.051 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0035, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0035/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0035 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2017-03-13 11:54:20,334 Stage-2 map = 0%, reduce = 0% 2017-03-13 11:54:45,944 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 5.22 sec 2017-03-13 11:55:06,202 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 9.18 sec MapReduce Total cumulative CPU time: 9 seconds 180 msec Ended Job = job_1489117385526_0035 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0036, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0036/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0036 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2017-03-13 11:55:31,708 Stage-3 map = 0%, reduce = 0% 2017-03-13 11:55:51,447 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.66 sec 2017-03-13 11:56:15,088 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 6.75 sec MapReduce Total cumulative CPU time: 6 seconds 750 msec Ended Job = job_1489117385526_0036 MapReduce Jobs Launched: Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 9.18 sec HDFS Read: 62835 HDFS Write: 1045 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 6.75 sec HDFS Read: 6333 HDFS Write: 938 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 930 msec OK ZheJiang 11501 ShanDong 11336 LiaoNing 10519 JiLin 10341 FuJian 9742 XiangGang 9371 QingHai 9329 ShanXi3 9283 GuiZhou 9148 HaiNan 9037 HuBei 8868 NeiMengGu 8245 BeiJing 8035 JiangXi 7706 AnHui 7382 HuNan 6666 ShangHai 6441 JiangSu 6025 TaiWan 5988 ShanXi1 5862 ChongQing 5854 XinJiang 5557 HeNan 5390 XiZang 5322 TianJin 4984 YunNan 4695 GuangXi 4646 HeiLongJiang 4639 GanSu 4288 GuangDong 3753 SiChuan 3627 Aomen 3233 NingXia 2444 HeBei 1334 Time taken: 182.306 seconds, Fetched: 34 row(s)
第五. 创建user_dimension_orc,brand_dimension_orc,record_orc这三张表。这三张表使用ORC格式保存数据
[hadoop@hadoopa command]$ cat create_table_user_orc.sql
create external table if not exists user_dimension_orc (
uid STRING,
name STRING,
gender STRING,
birth DATE,
province STRING
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
location 'hdfs://hadoopA:8020/warehouse/user_dimension'
;
[hadoop@hadoopa command]$ cat create_table_brand_orc.sql
create external table if not exists brand_dimension_orc (
bid STRING,
category STRING,
brand STRING
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
location 'hdfs://hadoopA:8020/warehouse/brand_dimension'
;
[hadoop@hadoopa command]$ cat create_table_record_orc.sql
create table if not exists record_orc (
rid STRING,
uid STRING,
bid STRING,
trancation_date TIMESTAMP,
price INT,
source_province STRING,
target_province STRING,
site STRING,
express_number STRING,
express_company STRING
)
PARTITIONED BY (
partition_date STRING,
hour INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
;
第六.将数据导入到三张ORC表中
hive> insert into table brand_dimension_orc select * from brand_dimension;
hive> insert into table user_dimension_orc select * from user_dimension;
hive> insert overwrite table record_orc partition (partition_date='2017-03-10',hour=22) select rid,uid,bid,trancation_date,price,source_province,target_province,site,express_number,express_company from record