测试hive

最新推荐文章于 2024-07-01 23:48:28 发布

weixin_33728268

最新推荐文章于 2024-07-01 23:48:28 发布

阅读量213

点赞数

文章标签：大数据 python

原文链接：https://my.oschina.net/u/3115904/blog/839486

版权

2019独角兽企业重金招聘Python工程师标准>>>

一、准备数据

造数据脚本： gendata.sh

#!/bin/bash

file=$1

s=$2

touch $file

for((i=0;i<10000000;i++))

str=','$s;

name=${i}${str}${i}

#echo $name

echo $name>> $file

done

echo 'show testdata'

head $file

造数据：

先造十个小文件，每个1000w记录：

bash gendata.sh name.txt name ; bash gendata.sh zhuzhi.txt zhuzhi ; bash gendata.sh minzu.txt minzu ; bash gendata.sh jg.txt jg ;bash gendata.sh gj.txt gj ; bash gendata.sh dz.txt dz ; bash gendata.sh abcd.txt abcd ; bash gendata.sh efgh.txt efgh ; bash gendata.sh xyz.txt xyz ;bash gendata.sh opq.txt opq

total 1.8G

-rw-r--r-- 1 root root 189M Feb 9 10:35 abcd.txt

-rw-r--r-- 1 root root 170M Feb 9 10:32 dz.txt

-rw-r--r-- 1 root root 189M Feb 9 10:38 efgh.txt

-rw-r--r-- 1 root root 170M Feb 9 10:28 gj.txt

-rw-r--r-- 1 root root 170M Feb 9 10:25 jg.txt

-rw-r--r-- 1 root root 199M Feb 9 10:22 minzu.txt

-rw-r--r-- 1 root root 189M Feb 9 10:08 name.txt

-rw-r--r-- 1 root root 180M Feb 9 10:49 opq.txt

-rw-r--r-- 1 root root 180M Feb 9 10:41 xyz.txt

-rw-r--r-- 1 root root 208M Feb 9 10:19 zhuzhi.txt

大文件，1亿记录

bash gendata.sh name1000.txt name

-rw-r--r-- 1 root root 2.1G Feb 9 10:50 name1000.txt

二、测试10个小文件，每个文件1000万记录，180MB大小，总1亿记录，1.8G ，不做任何优化的数据分析

hive中建表：

create table hyl_test_par(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

手动建立分区文件夹：

hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003

上传数据、修改权限：

hadoop fs -put *.txt /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003/

hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par/

修复分区信息：

0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;

+------------+--+

| partition |

+------------+--+

No rows selected (0.125 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> msck repair table hyl_test_par;

No rows affected (0.551 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;

+--------------------------------+--+

| partition |

+--------------------------------+--+

| sys_sj=20170209/sys_type=2003 |

+--------------------------------+--+

1 row selected (0.123 seconds)

测试：

select count(*) from hyl_test_par where name <> ' ' ;

第一次运行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (189.116 seconds)

第二次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (118.107 seconds)

第三次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (117.551 seconds)

第四次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (117.44 seconds)

第五次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (113.291 seconds)

======================================莫名的分割线=======================================================

下午重新跑，性能块了10倍以上！！！

第一次运行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.274 seconds)

第二次运行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.525 seconds)

第三次运行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.11 seconds)

第四次：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.722 seconds)

三、测试10个小文件，每个文件1000万记录，180MB大小，总1亿记录，1.8G，经过analysis分区信息的表

同理，创建测试analysis的表，并导入数据：

create table hyl_test_par_ana(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003

hadoop fs -put *.txt /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003/

hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par_ana/

show partitions hyl_test_par_ana;

msck repair table hyl_test_par_ana;

show partitions hyl_test_par_ana;

分区信息分析：

0: jdbc:hive2://cluster09.hzhz.co:10000> analyze table hyl_test_par_ana partition(sys_sj=20170209,sys_type=2003) compute statistics ;

INFO : Session is already open

INFO : Dag name: analyze table hyl_test_par_ana ...statistics(Stage-0)

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0021)

INFO : Partition default.hyl_test_par_ana{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (126.81 seconds)

测试：

select count(*) from hyl_test_par_ana;

0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_ana;

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (0.086 seconds)

换sql：

select count(*) from hyl_test_par_ana where name <> ' ';

第一次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (118.239 seconds)

第二次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (121.687 seconds)

第三次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (121.319 seconds)

======================================莫名的分割线=======================================================

下午重新跑，性能块了10倍以上！！！

第一次运行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (10.923 seconds)

第二次运行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.058 seconds)

第三次运行

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.45 seconds)

第四次：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.218 seconds)

三、测试1个大文件，2.0G大小，总1亿记录，没有任何优化的表

创建使用一个大文件的同结构表，并上传数据：

create table hyl_test_par_big(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003

hadoop fs -put name1000.txt /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003/

hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par_big/

[hdfs@cluster13 tmp]$ hadoop fs -du -h /apps/hive/warehouse/hyl_test_par_big/

2.0 G /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209

show partitions hyl_test_par_big;

msck repair table hyl_test_par_big;

show partitions hyl_test_par_big;

测试：

select count(*) from hyl_test_par_big where name <> ' ';

第一次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.356 seconds)

第二次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.3 seconds)

第三次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (5.861 seconds)

第四次：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (5.675 seconds)

第五次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.814 seconds)

======================================莫名的分割线=======================================================

下午测试：

第一次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.933 seconds)

第二次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (4.435 seconds)

第三次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (5.868 seconds)

第四次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.403 seconds)

第五次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.814 seconds)

四、测试10个小文件，180MB大小，总1亿记录，是经过insert into方式建立的表

通过insert into 从第一张表中导入数据（这个过程中hive会自动analysis信息）到新表，测试：

create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);

INFO : Session is already open

INFO : Dag name: create table hyl_test_par_auto a...rand(123)(Stage-1)

INFO : Tez session was closed. Reopening...

INFO : Session re-established.

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)

INFO : Map 1: -/- Reducer 2: 0/10

INFO : Map 1: 0/119 Reducer 2: 0/10

INFO : Moving data to directory hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto from hdfs://myBigdata/apps/hive/warehouse/.hive-staging_hive_2017-02-09_13-05-59_036_2906983779886430780-1/-ext-10001

INFO : Table default.hyl_test_par_auto stats: [numFiles=10, numRows=100000000, totalSize=3327777800, rawDataSize=3227777800]

[hdfs@cluster13 tmp]$ hadoop fs -ls -h hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003

Found 10 items

-rwxrwxrwx 2 hive hdfs 183.9 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000000_0

-rwxrwxrwx 2 hive hdfs 183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000001_0

-rwxrwxrwx 2 hive hdfs 183.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000002_0

-rwxrwxrwx 2 hive hdfs 184.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000003_0

-rwxrwxrwx 2 hive hdfs 183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000004_0

-rwxrwxrwx 2 hive hdfs 183.3 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000005_0

-rwxrwxrwx 2 hive hdfs 184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000006_0

-rwxrwxrwx 2 hive hdfs 184.0 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000007_0

-rwxrwxrwx 2 hive hdfs 184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000008_0

-rwxrwxrwx 2 hive hdfs 183.4 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000009_0

测试：

select count(*) from hyl_test_par_auto where name <> ' ';

第一次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (14.653 seconds)

第二次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (13.989 seconds)

第三次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (9.236 seconds)

drop table hyl_test_auto;

create table hyl_test_par_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

INFO : Session is already open

INFO : Dag name: insert into table hyl_test_par_a...rand(123)(Stage-1)

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)

INFO : Map 1: 0/119 Reducer 2: 0/10

INFO : Loading data to table default.hyl_test_par_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_13-26-47_915_3059621581501435386-1/-ext-10000

INFO : Partition default.hyl_test_par_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (135.325 seconds)

测试：

select count(*) from hyl_test_par_auto where name <> ' ';

第一次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.303 seconds)

第二次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.56 seconds)

第三次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.446 seconds)

第四次：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.643 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_auto ;

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (0.098 seconds)

======================================莫名的分割线=======================================================

下午测试：

第一次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.282 seconds)

第二次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.414 seconds)

第三次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.452 seconds)

第四次运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (6.24 seconds)

五、统计：

耗时\条件	10180MBNONE	10180MBanalyze	12GBnone	10180MBinsert into
第一次运行	11.274	10.923	11.933	11.282
第二次运行	11.525	6.058	4.435	3.414
第三次运行	11.11	6.45	5.868	6.452
第四次运行	11.722	6.218	3.403	6.24
平均时间	11.40775	7.41225	6.40975	6.847
热数据平均时间（去掉第一次）	11.45233333	6.242	4.568666667	5.368666667

六、继续寻找优化项：

单个文件，通过insert into方式插入数据的表：

set mapred.reduce.tasks=1;

create table hyl_test_par_big_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

INFO : Tez session hasn't been created yet. Opening session

INFO : Dag name: insert into table hyl_test_par_b...rand(123)(Stage-1)

INFO :

INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0023)

INFO : Map 1: -/- Reducer 2: 0/1

INFO : Map 1: 0/119 Reducer 2: 0/1

INFO : Loading data to table default.hyl_test_par_big_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_big_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_14-30-05_480_3967529948260649900-1/-ext-10000

INFO : Partition default.hyl_test_par_big_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=1, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (104.637 seconds)

测试：

select count(*) from hyl_test_par_big_auto where name <> ' ';

运行：

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (11.744 seconds)

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (4.188 seconds)

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (4.041 seconds)

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (5.198 seconds)

+------------+--+

| _c0 |

+------------+--+

| 100000000 |

+------------+--+

1 row selected (3.788 seconds)

耗时\条件	10180MBNONE	10180MBanalyze	12GBnone	10180MBinsert into	12GBnone*insert into
第一次运行	11.274	10.923	11.933	11.282	11.744
第二次运行	11.525	6.058	4.435	3.414	4.188
第三次运行	11.11	6.45	5.868	6.452	4.041
第四次运行	11.722	6.218	3.403	6.24	5.198
平均时间	11.40775	7.41225	6.40975	6.847	6.29275
热数据平均时间（去掉第一次）	11.45233333	6.242	4.56866666	5.368666667	4.475666667