Hadoop常用存储格式测试

最新推荐文章于 2024-06-03 10:52:11 发布

AaronLwx

最新推荐文章于 2024-06-03 10:52:11 发布

阅读量386

点赞数

分类专栏： Hadoop

本文链接：https://blog.csdn.net/xiaoxiongaa0/article/details/89401879

版权

Hadoop 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

hive> set hive.default.fileformat;
hive.default.fileformat=TextFile

先关闭压缩

hive> SET hive.exec.compress.output
    > ;
hive.exec.compress.output=true

hive>
    > set hive.exec.compress.output=fasle;

hive> set hive.exec.compress.output;
hive.exec.compress.output=fasle

SEQUENCEFILE

hive> create table page_views_seq
    > stored as SEQUENCEFILE
    > as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0005, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0005/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1555643336639_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 15:11:26,067 Stage-1 map = 0%,  reduce = 0%
2019-04-19 15:11:33,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.85 sec
MapReduce Total cumulative CPU time: 2 seconds 850 msec
Ended Job = job_1555643336639_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_15-11-19_748_1458653148881308947-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_seq
Table default.page_views_seq stats: [numFiles=1, numRows=100000, totalSize=20501449, rawDataSize=18914993]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.85 sec   HDFS Read: 19018400 HDFS Write: 20501537 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 850 msec
OK
Time taken: 14.846 seconds

[hadoop@hadoop004 hadoop]$ hdfs dfs -ls /user/hive/warehouse/page_views_seq
Found 1 items
-rwxr-xr-x   1 hadoop supergroup   20501449 2019-04-19 15:11 /user/hive/warehouse/page_views_seq/000000_0

[hadoop@hadoop004 hadoop]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_seq/*
19.6 M  19.6 M  /user/hive/warehouse/page_views_seq/000000_0

RCFILE

hive>
    > create table page_views_rcfile
    > stored as RCFILE
    > as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0006, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0006/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1555643336639_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 15:14:00,203 Stage-1 map = 0%,  reduce = 0%
2019-04-19 15:14:06,565 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.68 sec
MapReduce Total cumulative CPU time: 2 seconds 680 msec
Ended Job = job_1555643336639_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_15-13-53_788_8414345647588210415-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_rcfile
Table default.page_views_rcfile stats: [numFiles=1, numRows=100000, totalSize=18799578, rawDataSize=18314993]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.68 sec   HDFS Read: 19018443 HDFS Write: 18799669 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
Time taken: 15.03 seconds

[hadoop@hadoop004 hadoop]$ hdfs dfs -ls /user/hive/warehouse/page_views_rcfile
Found 1 items
-rwxr-xr-x   1 hadoop supergroup   18799578 2019-04-19 15:14 /user/hive/warehouse/page_views_rcfile/000000_0

[hadoop@hadoop004 hadoop]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_rcfile/*
17.9 M  17.9 M  /user/hive/warehouse/page_views_rcfile/000000_0

ORC

ORC默认的压缩格式是ZLIB

hive> set hive.exec.orc.default.compress;
hive.exec.orc.default.compress=ZLIB

hive>
    >
    > create table page_views_orc
    > stored as ORC
    > as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0007, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0007/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1555643336639_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 15:21:21,912 Stage-1 map = 0%,  reduce = 0%
2019-04-19 15:21:29,295 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.27 sec
MapReduce Total cumulative CPU time: 4 seconds 270 msec
Ended Job = job_1555643336639_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_15-21-16_165_6566607993878562298-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_orc
Table default.page_views_orc stats: [numFiles=1, numRows=100000, totalSize=2914012, rawDataSize=76900000]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 4.27 sec   HDFS Read: 19018431 HDFS Write: 2914100 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 270 msec
OK
Time taken: 15.427 seconds

[hadoop@hadoop004 hadoop]$ hdfs dfs -ls /user/hive/warehouse/page_views_orc
Found 1 items
-rwxr-xr-x   1 hadoop supergroup    2914012 2019-04-19 15:21 /user/hive/warehouse/page_views_orc/000000_0

[hadoop@hadoop004 hadoop]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_orc/*
2.8 M  2.8 M  /user/hive/warehouse/page_views_orc/000000_0

hive> create table page_views_orc_none
    > stored as ORC tblproperties ("orc.compress"="NONE")
    > as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0008, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0008/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1555643336639_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 15:27:25,281 Stage-1 map = 0%,  reduce = 0%
2019-04-19 15:27:32,558 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.28 sec
MapReduce Total cumulative CPU time: 4 seconds 280 msec
Ended Job = job_1555643336639_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_15-27-19_598_2293440779180455372-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_orc_none
Table default.page_views_orc_none stats: [numFiles=1, numRows=100000, totalSize=8101548, rawDataSize=76900000]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 4.28 sec   HDFS Read: 19018456 HDFS Write: 8101641 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 280 msec
OK
Time taken: 14.205 seconds

[hadoop@hadoop004 hadoop]$ hdfs dfs -ls /user/hive/warehouse/page_views_orc_none
Found 1 items
-rwxr-xr-x   1 hadoop supergroup    8101548 2019-04-19 15:27 /user/hive/warehouse/page_views_orc_none/000000_0

[hadoop@hadoop004 hadoop]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_orc_none/*
7.7 M  7.7 M  /user/hive/warehouse/page_views_orc_none/000000_0

hive> set parquet.compression;
parquet.compression is undefined

hive> create table page_views_parquet
    > stored as PARQUET
    > as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0009, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0009/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1555643336639_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 15:37:15,140 Stage-1 map = 0%,  reduce = 0%
2019-04-19 15:37:23,556 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.34 sec
MapReduce Total cumulative CPU time: 5 seconds 340 msec
Ended Job = job_1555643336639_0009
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_15-37-08_429_4827863228315749856-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_parquet
Table default.page_views_parquet stats: [numFiles=1, numRows=100000, totalSize=4050771, rawDataSize=700000]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 5.34 sec   HDFS Read: 19018481 HDFS Write: 4050861 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 340 msec
OK
Time taken: 16.359 seconds

[hadoop@hadoop004 hadoop]$ hdfs dfs -ls /user/hive/warehouse/page_views_parquet
Found 1 items
-rwxr-xr-x   1 hadoop supergroup    4050771 2019-04-19 15:37 /user/hive/warehouse/page_views_parquet/000000_0

[hadoop@hadoop004 hadoop]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_parquet/*
3.9 M  3.9 M  /user/hive/warehouse/page_views_parquet/000000_0

hive> set parquet.compression=gzip;
hive> set parquet.compression
    > ;
parquet.compression=gzip

hive> create table page_views_parquet_gzip
    > stored as PARQUET
    > as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0010, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0010/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1555643336639_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 15:41:16,722 Stage-1 map = 0%,  reduce = 0%
2019-04-19 15:41:25,119 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.18 sec
MapReduce Total cumulative CPU time: 5 seconds 180 msec
Ended Job = job_1555643336639_0010
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-19_15-41-09_976_9091496854159646261-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_parquet_gzip
Table default.page_views_parquet_gzip stats: [numFiles=1, numRows=100000, totalSize=4050771, rawDataSize=700000]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 5.18 sec   HDFS Read: 19018486 HDFS Write: 4050866 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 180 msec
OK
Time taken: 16.374 seconds

[hadoop@hadoop004 hadoop]$ hdfs dfs -ls /user/hive/warehouse/page_views_parquet_gzip
Found 1 items
-rwxr-xr-x   1 hadoop supergroup    4050771 2019-04-19 15:41 /user/hive/warehouse/page_views_parquet_gzip/000000_0

[hadoop@hadoop004 hadoop]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_parquet_gzip/*
3.9 M  3.9 M  /user/hive/warehouse/page_views_parquet_gzip/000000_0

hive> set parquet.compression=bzip2;
hive> set parquet.compression
    > ;

hive> create table page_views_parquet_bzip2
    > stored as PARQUET
    > as select * from page_views;
Query ID = hadoop_20190419143939_53d67293-92de-4697-a791-f9a1afe7be01
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555643336639_0012, Tracking URL = http://hadoop004:8088/proxy/application_1555643336639_0012/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1555643336639_0012
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-19 15:44:59,446 Stage-1 map = 0%,  reduce = 0%
2019-04-19 15:45:20,283 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1555643336639_0012 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1555643336639_0012_m_000000 (and more) from job job_1555643336639_0012

Task with the most failures(4):
-----
Task ID:
  task_1555643336639_0012_m_000000

URL:
  http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1555643336639_0012&tipid=task_1555643336639_0012_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"track_times":"2013-05-19 13:00:00","url":"http://www.taobao.com/17/?tracker_u=1624169&type=1","session_id":"B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1","referer":"http://hao.360.cn/","ip":"1.196.34.243","end_user_id":"NULL","city_id":"-1"}
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"track_times":"2013-05-19 13:00:00","url":"http://www.taobao.com/17/?tracker_u=1624169&type=1","session_id":"B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1","referer":"http://hao.360.cn/","ip":"1.196.34.243","end_user_id":"NULL","city_id":"-1"}
	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507)
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
	... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: No enum constant parquet.hadoop.metadata.CompressionCodecName.BZIP2
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:525)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:623)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
	at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
	at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
	at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
	... 9 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: No enum constant parquet.hadoop.metadata.CompressionCodecName.BZIP2
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:248)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:570)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:514)
	... 16 more
Caused by: java.lang.IllegalArgumentException: No enum constant parquet.hadoop.metadata.CompressionCodecName.BZIP2
	at java.lang.Enum.valueOf(Enum.java:236)
	at parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:24)
	at parquet.hadoop.metadata.CompressionCodecName.fromConf(CompressionCodecName.java:34)
	at parquet.hadoop.codec.CodecConfig.getParquetCompressionCodec(CodecConfig.java:81)
	at parquet.hadoop.codec.CodecConfig.getCodec(CodecConfig.java:88)
	at parquet.hadoop.ParquetOutputFormat.getCodec(ParquetOutputFormat.java:233)
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
	at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.<init>(ParquetRecordWriterWrapper.java:65)
	at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:125)
	at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:114)
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:260)
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:245)
	... 18 more


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

AaronLwx

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop常用存储格式测试

hive> set hive.default.fileformat;hive.default.fileformat=TextFile先关闭压缩hive> SET hive.exec.compress.output > ;hive.exec.compress.output=truehive> > set hive.exec.co...
复制链接

扫一扫

专栏目录