测试hive

一、准备数据

 

造数据脚本 : gendata.sh

#!/bin/bash

file=$1

s=$2

touch $file

for((i=0;i<10000000;i++))

do

str=','$s;

name=${i}${str}${i}

#echo $name

echo  $name>> $file

done

 

echo 'show testdata'

head $file

 

造数据:

先造十个小文件,每个1000w记录:

bash gendata.sh  name.txt name ; bash gendata.sh  zhuzhi.txt zhuzhi ; bash gendata.sh minzu.txt minzu ; bash gendata.sh  jg.txt jg ;bash gendata.sh gj.txt gj ; bash gendata.sh dz.txt dz ; bash gendata.sh abcd.txt abcd ; bash gendata.sh efgh.txt efgh ; bash gendata.sh  xyz.txt xyz ;bash gendata.sh  opq.txt opq

 

total 1.8G

-rw-r--r-- 1 root root 189M Feb  9 10:35 abcd.txt

-rw-r--r-- 1 root root 170M Feb  9 10:32 dz.txt

-rw-r--r-- 1 root root 189M Feb  9 10:38 efgh.txt

-rw-r--r-- 1 root root 170M Feb  9 10:28 gj.txt

-rw-r--r-- 1 root root 170M Feb  9 10:25 jg.txt

-rw-r--r-- 1 root root 199M Feb  9 10:22 minzu.txt

-rw-r--r-- 1 root root 189M Feb  9 10:08 name.txt

-rw-r--r-- 1 root root 180M Feb  9 10:49 opq.txt

-rw-r--r-- 1 root root 180M Feb  9 10:41 xyz.txt

-rw-r--r-- 1 root root 208M Feb  9 10:19 zhuzhi.txt

 

大文件,1亿记录

 bash gendata.sh  name1000.txt name

 

-rw-r--r--  1 root root 2.1G Feb  9 10:50 name1000.txt

 

二、测试10个小文件,每个文件1000万记录,180MB大小,总1亿记录,1.8G ,不做任何优化的数据分析

 

hive中建表:

 

create table hyl_test_par(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

手动建立分区文件夹:

 

hadoop  fs -mkdir -p /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003

 

上传数据、修改权限:

 

 hadoop  fs -put *.txt /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003/

 hadoop  fs -chown -R hive /apps/hive/warehouse/hyl_test_par/

 

修复分区信息:

 

0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;

+------------+--+

| partition  |

+------------+--+

+------------+--+

No rows selected (0.125 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> msck repair table hyl_test_par;

No rows affected (0.551 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;

+--------------------------------+--+

|           partition            |

+--------------------------------+--+

| sys_sj=20170209/sys_type=2003  |

+--------------------------------+--+

1 row selected (0.123 seconds)

 

 

测试:

 

select count(*) from hyl_test_par where name <> ' ' ;

 

第一次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (189.116 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (118.107 seconds)

 

 

第三次运行:

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (117.551 seconds)

 

第四次运行:

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (117.44 seconds)

 

第五次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (113.291 seconds)

 

======================================莫名的分割线=======================================================

下午重新跑,性能块了10倍以上!!!

第一次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.274 seconds)

 

 

第二次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.525 seconds)

 

 

第三次运行

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.11 seconds)

 

第四次:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.722 seconds)

 

 

三、测试10个小文件,每个文件1000万记录,180MB大小,总1亿记录,1.8G,经过analysis分区信息的表

 

同理,创建测试analysis的表,并导入数据:

 

create table hyl_test_par_ana(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

hadoop  fs -mkdir -p /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003

hadoop  fs -put *.txt /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003/

hadoop  fs -chown -R hive /apps/hive/warehouse/hyl_test_par_ana/

 

show partitions hyl_test_par_ana;

msck repair table hyl_test_par_ana;

show partitions hyl_test_par_ana;

 

分区信息分析:

0: jdbc:hive2://cluster09.hzhz.co:10000> analyze table hyl_test_par_ana partition(sys_sj=20170209,sys_type=2003) compute statistics ;

INFO  : Session is already open

INFO  : Dag name: analyze table hyl_test_par_ana ...statistics(Stage-0)

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id application_1486351392526_0021)

.

.

.

INFO  : Partition default.hyl_test_par_ana{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (126.81 seconds)

 

测试:

select count(*) from hyl_test_par_ana;

0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_ana;

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (0.086 seconds)

 

换sql:

select count(*) from hyl_test_par_ana where name <> ' ';

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (118.239 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (121.687 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (121.319 seconds)

 

======================================莫名的分割线=======================================================

下午重新跑,性能块了10倍以上!!!

第一次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (10.923 seconds)

 

第二次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.058 seconds)

 

第三次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.45 seconds)

 

第四次:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.218 seconds)

 

 

三、测试1个大文件,2.0G大小,总1亿记录,没有任何优化的表

创建使用一个大文件的同结构表,并上传数据:

 

 

create table hyl_test_par_big(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

hadoop  fs -mkdir -p /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003

hadoop  fs -put name1000.txt /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003/

hadoop  fs -chown -R hive /apps/hive/warehouse/hyl_test_par_big/

 

[hdfs@cluster13 tmp]$ hadoop  fs -du -h /apps/hive/warehouse/hyl_test_par_big/

2.0 G  /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209

 

show partitions hyl_test_par_big;

msck repair table hyl_test_par_big;

show partitions hyl_test_par_big;

 

测试:

select count(*) from hyl_test_par_big where name <> ' ';

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.356 seconds)

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.3 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.861 seconds)

第四次:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.675 seconds)

 

第五次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.814 seconds)

 

======================================莫名的分割线=======================================================

下午测试:

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.933 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (4.435 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.868 seconds)

 

第四次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.403 seconds)

 

 

第五次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.814 seconds)

 

四、测试10个小文件,180MB大小,总1亿记录,是经过insert into方式建立的表

 

通过insert into 从第一张表中导入数据(这个过程中hive会自动analysis信息)到新表,测试:

 

create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);

 

0: jdbc:hive2://cluster09.hzhz.co:10000> create table hyl_test_par_auto as  select * from hyl_test_par distribute by rand(123);

INFO  : Session is already open

INFO  : Dag name: create table hyl_test_par_auto a...rand(123)(Stage-1)

INFO  : Tez session was closed. Reopening...

INFO  : Session re-established.

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)

 

INFO  : Map 1: -/-    Reducer 2: 0/10   

INFO  : Map 1: 0/119    Reducer 2: 0/10   

.

.

.

INFO  : Moving data to directory hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto from hdfs://myBigdata/apps/hive/warehouse/.hive-staging_hive_2017-02-09_13-05-59_036_2906983779886430780-1/-ext-10001

INFO  : Table default.hyl_test_par_auto stats: [numFiles=10, numRows=100000000, totalSize=3327777800, rawDataSize=3227777800]

 

[hdfs@cluster13 tmp]$ hadoop  fs -ls -h  hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003

Found 10 items

-rwxrwxrwx   2 hive hdfs    183.9 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000000_0

-rwxrwxrwx   2 hive hdfs    183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000001_0

-rwxrwxrwx   2 hive hdfs    183.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000002_0

-rwxrwxrwx   2 hive hdfs    184.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000003_0

-rwxrwxrwx   2 hive hdfs    183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000004_0

-rwxrwxrwx   2 hive hdfs    183.3 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000005_0

-rwxrwxrwx   2 hive hdfs    184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000006_0

-rwxrwxrwx   2 hive hdfs    184.0 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000007_0

-rwxrwxrwx   2 hive hdfs    184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000008_0

-rwxrwxrwx   2 hive hdfs    183.4 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000009_0

 

 

测试:

select count(*) from hyl_test_par_auto where name <> ' ';

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (14.653 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (13.989 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (9.236 seconds)

 

 

drop table hyl_test_auto;

 

create table hyl_test_par_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

INFO  : Session is already open

INFO  : Dag name: insert into table hyl_test_par_a...rand(123)(Stage-1)

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)

 

INFO  : Map 1: 0/119    Reducer 2: 0/10  

.

.

INFO  : Loading data to table default.hyl_test_par_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_13-26-47_915_3059621581501435386-1/-ext-10000

INFO  : Partition default.hyl_test_par_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (135.325 seconds)

 

 

测试:

select count(*) from hyl_test_par_auto where name <> ' ';

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.303 seconds)

 

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.56 seconds)

 

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.446 seconds)

 

第四次:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.643 seconds)

 

0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_auto ;

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (0.098 seconds)

 

 

======================================莫名的分割线=======================================================

下午测试:

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.282 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.414 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.452 seconds)

 

第四次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.24 seconds)

 

 

五、统计:

 

耗时\条件10*180MB*NONE10*180MB*analyze1*2GB*none10*180MB*insert into
第一次运行11.27410.92311.93311.282
第二次运行11.5256.0584.4353.414
第三次运行11.116.455.8686.452
第四次运行11.7226.2183.4036.24
平均时间11.407757.412256.409756.847
热数据平均时间(去掉第一次)

             11.45233333

                                    6.242

                 4.568666667

                        5.368666667

 

 

六、继续寻找优化项:

 

单个文件,通过insert into方式插入数据的表:

 

set mapred.reduce.tasks=1;

create table hyl_test_par_big_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

INFO  : Tez session hasn't been created yet. Opening session

INFO  : Dag name: insert into table hyl_test_par_b...rand(123)(Stage-1)

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id application_1486351392526_0023)

 

INFO  : Map 1: -/-    Reducer 2: 0/1

INFO  : Map 1: 0/119    Reducer 2: 0/1

.

.

.

INFO  : Loading data to table default.hyl_test_par_big_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_big_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_14-30-05_480_3967529948260649900-1/-ext-10000

INFO  : Partition default.hyl_test_par_big_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=1, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (104.637 seconds)

 

测试:

select count(*) from hyl_test_par_big_auto where name <> ' ';

运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.744 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (4.188 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (4.041 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.198 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.788 seconds)

 

 

耗时\条件10*180MB*NONE10*180MB*analyze1*2GB*none10*180MB*insert into1*2GB*none*insert into
第一次运行11.27410.92311.93311.28211.744
第二次运行11.5256.0584.4353.4144.188
第三次运行11.116.455.8686.4524.041
第四次运行11.7226.2183.4036.245.198
平均时间11.407757.412256.409756.8476.29275
热数据平均时间(去掉第一次)

11.45233333

 6.242

4.56866666

5.368666667

  4.475666667

 

优化,列信息统计分析

 

analyze table hyl_test_par_ana compute statistics for columns;

 

select count(*) from hyl_test_par_ana where name <> ' ';

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.519 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

 

1 row selected (7.688 seconds)

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.456 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.651 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.413 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.62 seconds)

 

 

 

 

矢量查询(Vectorized query) 每次处理数据时会将1024行数据组成一个batch进行处理,而不是一行一行进行处理:

 

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

 

select count(*) from hyl_test_par where name <> ' ';

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.54 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.859 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (4.134 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.844 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.089 seconds)

转载于:https://my.oschina.net/u/3115904/blog/839486

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值