数据分析系统Hive（第二部分）

最新推荐文章于 2023-06-07 15:28:33 发布

siyadong1

最新推荐文章于 2023-06-07 15:28:33 发布

阅读量1.4k

点赞数

分类专栏： hadoop 文章标签：数据分析 hive

hadoop 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

第一：Hive的运行原理

一、Hive Cli（查询处理器）

1.Query Compiler

Parser（语法解析器，生成AST（抽象语法树））
Semantic Analyzer（语义分析器，生成QB（查询块））
Logical Plan Generator（逻辑查询计划生成器，生成QB Tree）
Logical Optimizer（逻辑查询优化器，生成QB Tree）
Physical Plan Generator（物理查询计划生成器，生成Phys Plan）
Physical Optimizer（物理查询优化器，生成 Phys Plan）

2.Exection Engine
- MR
- Spark
- Tez

二、MetaStore（元数据）

1.存储表的元数据信息

数据库，表的基本信息
分区信息
列的信息
存储格式信息
各种属性信息
权限信息

2.使用关系型数据库进行存储
3.MetaStore Server

通用的Catalog Server
支持多语言访问
local/remote Mode
配置

   <property>
        <name>hive.metastore.uris</name>
        <value>thrift://hadoopA:8020</value>
    </property>

  <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoopA/metastore?createDatabaseIfNotExist=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123</value>
    </property>

三、YARN（计算平台）
四、HDFS（数据存储）

第二：Hive的访问方式

Hive Cli
- hvie
- hive -h hotst -p port
Hive Beeline
- beeline -u jdbc:hive2://host:port user password
编程访问
- JDBC/ODBC
- Thrift

第三：Hive的查询优化

1. 合理建表

一、定义列类型

基本类型
负责类型

二、使用分区列

查询维度，业务需求
日期分区，动态分区
set hive.exec.dynamic.partition=true

第三、定义分桶列
第四、采用列式存储

Parquet
orc

第五、采用压缩存储

Bzip，Gzip
LZO
Snappy

2. 通用参数优化

第一、启用数据压缩

减少存储和IO
压缩Hive输出和中间结果
- hive.exec.compress.output=true
- hive.exec.compress.intermediate=true
设置Hive中间表存储格式
- hive.query.result.fileformat=SequenceFile

第二、Job执行优化

并行执行多个job
- hive.exec.parallel=true(default false)
- hive.exec.parallel.thread.number=8(default 8)
本地执行模式
- hive.exec.mode.local.auto=true
- hive.exec.mode.local.auto.inputbytes.max(128 by default)
- hive.exec.mode.local.auto.tasks.max(4 by default)
- num_reduce_task<=1

第三、选择合适引擎

MR
Spark
Tez

第四、MapReduce参数优化

Map阶段优化
- mapreduce.job.maps无效
- num_map_tasks切割参数影响大小
- 切割算法
Reduce阶段优化
- mapred.reduce.tasks直接设置
- num_reduce_tasks大小影响参数
- 切割算法
Shuffle阶段优化
- 压缩中间数据
- 配置方法
  - mapreduce.map.output.compress=true
  - mapreduce.map.output.compress.codec
  - org.apache.hadoop.io.compress.LzoCodec
  - org.apache.hadoop.io.compress.SnappyCodec

3. Join优化

第一、Common join
在Reduce端做join
第二、Map join

将小文件放到内存，和大文件的每一个map在mapper阶段进行join
hive.auto.convert.join=true(default false)
hive.mapjoin.smalltable.filesize=600M(default 25M)
Select /+MAPJOIN(a)/..a JOIN b 强制指定mapjoin

第三、Bucket map join

set hive.optimize.bucketmapjoin=true
mapjoin 一起工作
所有要join的表必须分桶，大桶的表的个数是小桶的表的整数倍
做了bucket的列必须是等于join的列

第四：项目实践

第一、项目说明

搭建好Hadoop集群环境。安装好hdfs，yarn，hive等组件
分布创建三张Hive表。user_dimension,brand_dimension,record.其中user_dimension和brand_dimension是外部表。
向这三张表中导入数据。
数据来源均已准备好，并且和三张表的格式对应

第二、创建user_dimension,brand_dimension,record这三张表

[hadoop@hadoopa command]$ cat create_table_user.sql
create external table if not exists user_dimension (
 uid STRING,
 name STRING,
 gender STRING,
 birth DATE,
 province STRING
)ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 location 'hdfs://hadoopA:8020/warehouse/user_dimension'
 ;
[hadoop@hadoopa command]$


[hadoop@hadoopa command]$ cat create_table_brand.sql
create external table if not exists brand_dimension (
 bid STRING,
 category STRING,
 brand STRING
)ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 location 'hdfs://hadoopA:8020/warehouse/brand_dimension'
 ;
[hadoop@hadoopa command]$

[hadoop@hadoopa command]$ cat create_table_record.sql
create table if not exists record (
 rid STRING,
 uid STRING,
 bid STRING,
 trancation_date TIMESTAMP,
 price INT,
 source_province STRING,
 target_province STRING,
 site STRING,
 express_number STRING,
 express_company STRING
)
 PARTITIONED BY (
 partition_date STRING,
 hour INT
 )
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','

[hadoop@hadoopa command]$ hive -f create_table_user.sql
[hadoop@hadoopa command]$ hive -f create_table_brand.sql
[hadoop@hadoopa command]$ hive -f create_table_record.sql

第三、向这三张表中导入数据


hive> load data local inpath '/home/hadoop/hadooptraining/datasource/brand.list' overwrite into table brand_dimension;
hive> load data local inpath '/home/hadoop/hadooptraining/datasource/user.list' overwrite into table user_dimension;
hive> load data inpath 'hdfs://hadoopA:8020/flume/record/2017-03-10/2220/transaction_log.1489155600805' overwrite into table record partition(partition_date='2017-03-10',hour=22);

第四、使用HQL命令验证试验结果

查询各品牌销售总额
1. HQL语句


[hadoop@hadoopa command]$ cat brand_price_list.sql
select brand,sum(price) as totalPrice
from record join brand_dimension on record.bid=brand_dimension.bid
group by brand_dimension.brand
order by totalPrice desc;
[hadoop@hadoopa command]$

2. 执行结果



[hadoop@hadoopa command]$ hive -f brand_price_list.sql
which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = hadoop_20170313114243_fdc4d60e-75f7-426e-ba69-cf7198eaedc6
Total jobs = 2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-03-13 11:43:12     Starting to launch local task to process map join;     maximum memory = 518979584
2017-03-13 11:43:15     Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable
2017-03-13 11:43:16     Uploaded 1 File to: file:/tmp/hadoop/5558bbca-d4b6-4112-8932-6e8a662d02ac/hive_2017-03-13_11-42-43_902_6704644829066910043-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (34008 bytes)
2017-03-13 11:43:16     End of local task; Time Taken: 3.194 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1489117385526_0031, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0031/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0031
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-03-13 11:43:40,015 Stage-2 map = 0%,  reduce = 0%
2017-03-13 11:43:54,259 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.86 sec
2017-03-13 11:44:08,440 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 5.47 sec
MapReduce Total cumulative CPU time: 5 seconds 470 msec
Ended Job = job_1489117385526_0031
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1489117385526_0032, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0032/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0032
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2017-03-13 11:44:32,410 Stage-3 map = 0%,  reduce = 0%
2017-03-13 11:44:52,170 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.4 sec
2017-03-13 11:45:07,571 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 5.73 sec
MapReduce Total cumulative CPU time: 5 seconds 730 msec
Ended Job = job_1489117385526_0032
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 5.47 sec   HDFS Read: 62768 HDFS Write: 1666 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.73 sec   HDFS Read: 6942 HDFS Write: 1480 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 200 msec
OK
SKYWORTH        11992
SAMSUNG 10240
YILI    9872
TCL     6741
OLAY    6442
MEIZU   6345
ASUS    5705
PEAK    5431
APPLE   5213
MOUTAI  4772
SHARP   4721
PEACEBIRD       4680
MIZUNO  4599
DHC     4585
NIULANSHAN      4582
CAMEL   4569
NIKE    4358
SEPTWOLVES      4345
OPPO    4306
NB      4237
KAPPA   4092
ZARA    4068
GUANGMING       4054
HP      4043
HISENSE 3995
HLA     3963
HUAWEI  3927
KANS    3884
LANGJIU 3857
NIVEA   3579
LINING  3559
CLINIQUE        3552
LENOVO  3534
PUMA    3531
HTC     3405
GXG     3322
UNIQLO  3271
HAIER   3106
LOREAL  2948
WULIANGYE       2912
ADIDAS  2847
MOTOROLA        2819
VIVO    2809
DELL    2785
SANYUAN 2776
LANCOME 2714
SELECTED        2633
INNISFREE       2452
SONY    2353
ACER    2339
XIAOMI  2260
HONGXING        2113
ANTA    1990
MENGNIU 1776
IPHONE  1628
SEMIR   1589
PHILIPS 1205
361     718
MEIFUBAO        448
Time taken: 146.0 seconds, Fetched: 59 row(s)

查询个年龄段用户消费总额
1. HQL语句


[hadoop@hadoopa command]$ cat age_price_list.sql
select cast(DATEDIFF(CURRENT_DATE, birth)/365 as int) as age,
sum(price) as totalPrice
from record join user_dimension on record.uid=user_dimension.uid
group by cast(DATEDIFF(CURRENT_DATE, birth)/365 as int)
order by totalPrice desc;

2. 执行结果






[hadoop@hadoopa command]$ hive -f age_price_list.sql
which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = hadoop_20170313114910_f92cc413-d572-4574-a30e-2cfa9fdcded8
Total jobs = 2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-03-13 11:49:32     Starting to launch local task to process map join;     maximum memory = 518979584
2017-03-13 11:49:35     Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable
2017-03-13 11:49:35     Uploaded 1 File to: file:/tmp/hadoop/e1bbe409-d63f-4ef7-b610-bdd0a4caf873/hive_2017-03-13_11-49-10_204_7283272927191169557-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (30382 bytes)
2017-03-13 11:49:35     End of local task; Time Taken: 3.073 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1489117385526_0033, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0033/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0033
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-03-13 11:49:57,659 Stage-2 map = 0%,  reduce = 0%
2017-03-13 11:50:18,269 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 4.14 sec
2017-03-13 11:50:32,451 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 6.71 sec
MapReduce Total cumulative CPU time: 6 seconds 710 msec
Ended Job = job_1489117385526_0033
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1489117385526_0034, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0034/
Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job  -kill job_1489117385526_0034
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2017-03-13 11:50:50,851 Stage-3 map = 0%,  reduce = 0%
2017-03-13 11:51:06,025 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.22 sec
2017-03-13 11:51:19,065 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 5.14 sec
MapReduce Total cumulative CPU time: 5 seconds 140 msec
Ended Job = job_1489117385526_0034
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 6.71 sec   HDFS Read: 64575 HDFS Write: 1104 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.14 sec   HDFS Read: 6338 HDFS Write: 1037 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 850 msec
OK
46      9653
44      9436
10      8068
2       7918
15      7470
9       7456
40      7391
14      7388
43      7109
37      6671
6       5980
11      5969
18      5858
30      5853
29      5841
22      5819
39      5737
20      5597
13      5564
41      5503
21      5306
12      4998
23      4991
28      4830
35      4829
33      4528
19      4347
25      4109
36      4017
32      3910
17      3698
1       3468
7       3459
16      3344
26      3328
31      3286
3       3188
4       3141
8       2639
42      2557
45      2473
24      2457
34      2454
47      2254
27      2137
38      1796
5       1534
0       1232
Time taken: 131.25 seconds, Fetched: 48 row(s)

查询个省份消费总额
1. HQL语句

[hadoop@hadoopa command]$ cat province_prince_list.sql
select province,sum(price) as totalPrice
from record join user_dimension on record.uid=user_dimension.uid
group by user_dimension.province
order by totalPrice desc;

2. 执行结果











[hadoop@hadoopa command]$ hive -f province_prince_list.sql which: no hbase in (/usr/lib64/qt-3.3/bin:.:/home/hadoop/apache-hive-2.1.0-bin/bin:/home/hadoop/maven3.3.9/bin:/home/hadoop/hadoop-2.7.3/bin:/home/hadoop/hadoop-2.7.3/sbin:/home/hadoop/apache-flume-1.7.0-bin/bin:/home/hadoop/jdk1.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in file:/home/hadoop/apache-hive-2.1.0-bin/conf/hive-log4j2.properties Async: true WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = hadoop_20170313115315_d285100f-73a1-44b4-8ffd-9537cbea48e9 Total jobs = 2 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2017-03-13 11:53:44 Starting to launch local task to process map join; maximum memory = 518979584 2017-03-13 11:53:48 Dump the side-table for tag: 1 with group count: 1000 into file: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable 2017-03-13 11:53:48 Uploaded 1 File to: file:/tmp/hadoop/a03a87ad-de38-4374-b1b9-6e0042dd455d/hive_2017-03-13_11-53-15_150_9191727649602197768-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (35297 bytes) 2017-03-13 11:53:48 End of local task; Time Taken: 4.051 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0035, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0035/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0035 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2017-03-13 11:54:20,334 Stage-2 map = 0%, reduce = 0% 2017-03-13 11:54:45,944 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 5.22 sec 2017-03-13 11:55:06,202 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 9.18 sec MapReduce Total cumulative CPU time: 9 seconds 180 msec Ended Job = job_1489117385526_0035 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1489117385526_0036, Tracking URL = http://hadoopA:8088/proxy/application_1489117385526_0036/ Kill Command = /home/hadoop/hadoop-2.7.3/bin/hadoop job -kill job_1489117385526_0036 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2017-03-13 11:55:31,708 Stage-3 map = 0%, reduce = 0% 2017-03-13 11:55:51,447 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.66 sec 2017-03-13 11:56:15,088 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 6.75 sec MapReduce Total cumulative CPU time: 6 seconds 750 msec Ended Job = job_1489117385526_0036 MapReduce Jobs Launched: Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 9.18 sec HDFS Read: 62835 HDFS Write: 1045 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 6.75 sec HDFS Read: 6333 HDFS Write: 938 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 930 msec OK ZheJiang 11501 ShanDong 11336 LiaoNing 10519 JiLin 10341 FuJian 9742 XiangGang 9371 QingHai 9329 ShanXi3 9283 GuiZhou 9148 HaiNan 9037 HuBei 8868 NeiMengGu 8245 BeiJing 8035 JiangXi 7706 AnHui 7382 HuNan 6666 ShangHai 6441 JiangSu 6025 TaiWan 5988 ShanXi1 5862 ChongQing 5854 XinJiang 5557 HeNan 5390 XiZang 5322 TianJin 4984 YunNan 4695 GuangXi 4646 HeiLongJiang 4639 GanSu 4288 GuangDong 3753 SiChuan 3627 Aomen 3233 NingXia 2444 HeBei 1334 Time taken: 182.306 seconds, Fetched: 34 row(s)

第五. 创建user_dimension_orc,brand_dimension_orc,record_orc这三张表。这三张表使用ORC格式保存数据


[hadoop@hadoopa command]$ cat create_table_user_orc.sql
create external table if not exists user_dimension_orc (
 uid STRING,
 name STRING,
 gender STRING,
 birth DATE,
 province STRING
)ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 STORED AS ORC
 location 'hdfs://hadoopA:8020/warehouse/user_dimension'
 ;


[hadoop@hadoopa command]$ cat create_table_brand_orc.sql
create external table if not exists brand_dimension_orc (
 bid STRING,
 category STRING,
 brand STRING
)ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 STORED AS ORC
 location 'hdfs://hadoopA:8020/warehouse/brand_dimension'
 ;


[hadoop@hadoopa command]$ cat create_table_record_orc.sql
create table if not exists record_orc (
 rid STRING,
 uid STRING,
 bid STRING,
 trancation_date TIMESTAMP,
 price INT,
 source_province STRING,
 target_province STRING,
 site STRING,
 express_number STRING,
 express_company STRING
)
 PARTITIONED BY (
 partition_date STRING,
 hour INT
 )
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 STORED AS ORC
;

第六.将数据导入到三张ORC表中

hive> insert into table brand_dimension_orc select * from brand_dimension;

hive> insert into table user_dimension_orc select * from user_dimension;

hive> insert overwrite table record_orc partition (partition_date='2017-03-10',hour=22) select rid,uid,bid,trancation_date,price,source_province,target_province,site,express_number,express_company from record