Hadoop压缩配置
MR支持的压缩编码
压缩格式 | 工具 | 算法 | 文件扩展名 | 是否可切分 |
DEFLATE | 无 | DEFLATE | .deflate | 否 |
Gzip | gzip | DEFLATE | .gz | 否 |
bzip2 | bzip2 | bzip2 | .bz2 | 是 |
LZO | lzop | LZO | .lzo | 是 |
Snappy | 无 | Snappy | .snappy | 否 |
为了支持多种压缩/解压缩算法,Hadoop引入了编码/解码器,如下表所示:
压缩格式 | 对应的编码/解码器 |
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
LZO | com.hadoop.compression.lzo.LzopCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
压缩性能的比较:
压缩算法 | 原始文件大小 | 压缩文件大小 | 压缩速度 | 解压速度 |
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s |
bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s |
LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s |
http://google.github.io/snappy/
On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
压缩参数配置
要在Hadoop中启用压缩,可以配置如下参数(mapred-site.xml文件中):
参数 | 默认值 | 阶段 | 建议 |
io.compression.codecs (在core-site.xml中配置) | org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.Lz4Codec | 输入压缩 | Hadoop使用文件扩展名判断是否支持某种编解码器 |
mapreduce.map.output.compress | false | mapper输出 | 这个参数设为true启用压缩 |
mapreduce.map.output.compress.codec | org.apache.hadoop.io.compress.DefaultCodec | mapper输出 | 使用LZO、LZ4或snappy编解码器在此阶段压缩数据 |
mapreduce.output.fileoutputformat.compress | false | reducer输出 | 这个参数设为true启用压缩 |
mapreduce.output.fileoutputformat.compress.codec | org.apache.hadoop.io.compress. DefaultCodec | reducer输出 | 使用标准工具或者编解码器,如gzip和bzip2 |
mapreduce.output.fileoutputformat.compress.type | RECORD | reducer输出 | SequenceFile输出使用的压缩类型:NONE和BLOCK |
开启Map输出阶段压缩
开启map输出阶段压缩可以减少job中map和Reduce task间数据传输量。具体配置如下:
案例实操:
1.开启hive中间传输数据压缩功能
hive (default)>set hive.exec.compress.intermediate=true;
2.开启mapreduce中map输出压缩功能
hive (default)>set mapreduce.map.output.compress=true;
3.设置mapreduce中map输出数据的压缩方式
hive (default)>set mapreduce.map.output.compress.codec=
org.apache.hadoop.io.compress.SnappyCodec;
4.执行查询语句
hive (default)> select count(ename) name from emp;
开启Reduce输出阶段压缩
当Hive将输出写入到表中时,输出内容同样可以进行压缩。属性hive.exec.compress.output控制着这个功能。用户可能需要保持默认设置文件中的默认值false,这样默认的输出就是非压缩的纯文本文件了。用户可以通过在查询语句或执行脚本中设置这个值为true,来开启输出结果压缩功能。
案例实操:
1.开启hive最终输出数据压缩功能
hive (default)>set hive.exec.compress.output=true;
2.开启mapreduce最终输出数据压缩
hive (default)>set mapreduce.output.fileoutputformat.compress=true;
3.设置mapreduce最终数据输出压缩方式
hive (default)> set mapreduce.output.fileoutputformat.compress.codec =
org.apache.hadoop.io.compress.SnappyCodec;
4.设置mapreduce最终数据输出压缩为块压缩
hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
5.测试一下输出结果是否是压缩文件
hive (default)> insert overwrite local directory
'/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;
文件存储格式
Hive支持的存储数据的格式主要有:TEXTFILE 、SEQUENCEFILE、ORC、PARQUET。
1 列式存储和行式存储
列式存储和行式存储
左边为逻辑表,右边第一个为行式存储,第二个为列式存储。
1.行存储的特点
查询满足条件的一整行数据的时候,列存储则需要去每个聚集的字段找到对应的每个列的值,行存储只需要找到其中一个值,其余的值都在相邻地方,所以此时行存储查询的速度更快。
2.列存储的特点
因为每个字段的数据聚集存储,在查询只需要少数几个字段的时候,能大大减少读取的数据量;每个字段的数据类型一定是相同的,列式存储可以针对性的设计更好的设计压缩算法。
TEXTFILE和SEQUENCEFILE的存储格式都是基于行存储的;
ORC和PARQUET是基于列式存储的。
2 TextFile格式
默认格式,数据不做压缩,磁盘开销大,数据解析开销大。可结合Gzip、Bzip2使用,但使用Gzip这种方式,hive不会对数据进行切分,从而无法对数据进行并行操作。
3 Orc格式
Orc (Optimized Row Columnar)是Hive 0.11版里引入的新的存储格式。
如图所示可以看到每个Orc文件由1个或多个stripe组成,每个stripe一般为HDFS的块大小,每一个stripe包含多条记录,这些记录按照列进行独立存储,对应到Parquet中的row group的概念。每个Stripe里有三部分组成,分别是Index Data,Row Data,Stripe Footer:
Orc格式
1)Index Data:一个轻量级的index,默认是每隔1W行做一个索引。这里做的索引应该只是记录某行的各字段在Row Data中的offset。
2)Row Data:存的是具体的数据,先取部分行,然后对这些行按列进行存储。对每个列进行了编码,分成多个Stream来存储。
3)Stripe Footer:存的是各个Stream的类型,长度等信息。
每个文件有一个File Footer,这里面存的是每个Stripe的行数,每个Column的数据类型信息等;每个文件的尾部是一个PostScript,这里面记录了整个文件的压缩类型以及FileFooter的长度信息等。在读取文件时,会seek到文件尾部读PostScript,从里面解析到File Footer长度,再读FileFooter,从里面解析到各个Stripe信息,再读各个Stripe,即从后往前读。
4 Parquet格式
Parquet文件是以二进制方式存储的,所以是不可以直接读取的,文件中包括该文件的数据和元数据,因此Parquet格式文件是自解析的。
- 行组(Row Group):每一个行组包含一定的行数,在一个HDFS文件中至少存储一个行组,类似于orc的stripe的概念。
- 列块(Column Chunk):在一个行组中每一列保存在一个列块中,行组中的所有列连续的存储在这个行组文件中。一个列块中的值都是相同类型的,不同的列块可能使用不同的算法进行压缩。
- 页(Page):每一个列块划分为多个页,一个页是最小的编码的单位,在同一个列块的不同页可能使用不同的编码方式。
通常情况下,在存储Parquet数据的时候会按照Block大小设置行组的大小,由于一般情况下每一个Mapper任务处理数据的最小单位是一个Block,这样可以把每一个行组由一个Mapper任务处理,增大任务执行并行度。Parquet文件的格式如图6-12所示。
Parquet格式
上图展示了一个Parquet文件的内容,一个文件中可以存储多个行组,文件的首位都是该文件的Magic Code,用于校验它是否是一个Parquet文件,Footer length记录了文件元数据的大小,通过该值和文件长度可以计算出元数据的偏移量,文件的元数据中包括每一个行组的元数据信息和该文件存储数据的Schema信息。除了文件中每一个行组的元数据,每一页的开始都会存储该页的元数据,在Parquet中,有三种类型的页:数据页、字典页和索引页。数据页用于存储当前行组中该列的值,字典页存储该列值的编码字典,每一个列块中最多包含一个字典页,索引页用来存储当前行组下该列的索引,目前Parquet中还不支持索引页。
8.5.5 主流文件存储格式对比实验
从存储文件的压缩比和查询速度两个角度对比。
存储文件的压缩比测试:
- 测试数据
2017-08-10 13:00:00 http://www.taobao.com/17/?tracker_u=1624169&type=1 B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1 http://hao.360.cn/ 1.196.34.243 NULL -1
2017-08-10 13:00:00 http://www.taobao.com/item/962967_14?ref=1_1_52_search.ctg_1 T82C9WBFB1N8EW14YF2E2GY8AC9K5M5P http://www.yihaodian.com/ctg/s2/c24566-%E5%B1%B1%E6%A5%82%E5%88%B6%E5%93%81?ref=pms_15_78_258 222.78.246.228 134939954 156
2017-08-10 13:00:00 http://www.taobao.com/1/?tracker_u=1013304189&uid=2687512&type=3 W17C89RU8DZ6NMN7JD2ZCBDMX1CQVZ1W http://www.yihaodian.com/1/?tracker_u=1013304189&uid=2687512&type=3 118.205.0.18 NULL -20
2017-08-10 13:00:00 http://m.taobao.com/getCategoryByRootCategoryId_1_5146 f55598cafba346eb217ff3fbd0de2930 http://m.yihaodian.com/getCategoryByRootCategoryId_1_5135 10.4.6.53 NULL -1000
2017-08-10 13:00:00 http://m.taobao.com/getCategoryByRootCategoryId_1_24728 f55598cafba346eb217ff3fbd0de2930 http://m.yihaodian.com/getCategoryByRootCategoryId_1_5146 10.4.4.109 NULL -1000
2017-08-10 13:00:00 http://union.taobao.com/link_make/viewPicInfo.do?imgSize=660x70&truckerU=101542127 4PBTT18JEEJHM91DNGKNUSZTA29W8WP3 http://www.jintoutiao.com/yule.html 125.38.159.84 NULL -30
2017-08-10 13:00:00 http://www.taobao.com/item/9680587_1?smt_b=C0B0A09BECBF110A470F00C 615MMAA5RFVRRMHJVP5VCHTQGEGDW988 211.167.237.134 NULL -20
2017-08-10 13:00:00 http://union.taobao.com/link_make/viewPicInfo.do?imgSize=660x70&truckerU=101542127 QCU8A6VCNX28YGGJVQAABG4BJ6F8QT1B http://jintoutiao.com/yule.html 117.22.144.64 NULL 327
2017-08-10 13:00:00 http://www.taobao.com/ctg/s2/c8325-%E7%99%BE%E6%B4%81%E5%B8%83-%E9%92%A2%E4%B8%9D%E7%90%83/ ZYVF5W5WDFGAQP1S45RTFR7SPVM8GY5Q http://www.yihaodian.com/item/user/continueShopping.do?lc=\u6c34\u4e0a\u7528\u5177&csps=1984887_3_1_950411&pageType=B 61.128.247.131 5305018 -40
2017-08-10 13:00:00 http://m.taobao.com/getProductDetail_1_1001417 A90A3460D49D437E96BC3B4DC138B628 http://m.yihaodian.com/searchProduct_1_银杏洗面奶_2 222.128.241.1 NULL -20
2017-08-10 13:00:00 http://search.taobao.com/s2/c0-0/k%25E6%25B4%2597%25E5%258F%2591%25E6%25B0%25B4/ WGSMFJD42BQEYB57EUHPUHHZMZHJTGX2 http://www.yihaodian.com/3/?type=3&tracker_u=1013241403 60.30.93.90 NULL -30
2017-08-10 13:00:00 http://m.taobao.com/ RJSDGJVTMNBHXE9HX53ZTZTV11NCYTED 58.39.59.80 NULL -10
2017-08-10 13:00:00 http://www.taobao.com/cart/cart.do?action=view CB4CUVX8MHE1NF3F8GKSJCZMNUYVQB9N 116.25.135.202 126027330 238
2017-08-10 13:00:00 http://m.taobao.com 169003c92a02da66a335f422acc69242 http://m.yihaodian.com/getCategoryByRootCategoryId_1_0 10.4.4.109 NULL -1000
2017-08-10 13:00:00 http://m.taobao.com/searchProduct ede5b4097d7990b15d9c3c566df7e04d 10.4.6.53 NULL -1000
2017-08-10 13:00:00 http://m.taobao.com/getCurrentGrouponList_3_-1 D14A18209FB9FDB9A971293D9AF0330C http://m.yihaodian.com/getCurrentGrouponList_3_-1 61.149.71.175 NULL -20
2017-08-10 13:00:00 http://m.taobao.com/getCategoryByRootCategoryId_1_5404 f55598cafba346eb217ff3fbd0de2930 http://m.yihaodian.com/getCategoryByRootCategoryId_1_5138 10.4.6.47 NULL -1000
2017-08-10 13:00:00 http://www.taobao.com/ctg/s2/c5464-%E8%82%89%E5%B9%B2-%E8%82%89%E6%9D%BE/ AEHC5FQSA4CYAZQMGP7113B25QTNCS2C http://www.yihaodian.com/19/?tracker_u=10535027967&type=1 222.240.184.147 NULL 222
2017-08-10 13:00:00 http://search.taobao.com/s2/c0-0/b907857/a40045-s1-v0-p1-price-d0-f0-m1-rt0-pid-khtc/ XQQMRX1HSE3AN9F33HCSV4QSDHU22N7Z http://search.yihaodian.com/s2/c0-0/b907857/ 202.204.48.147 NULL -20
2017-08-10 13:00:00 http://search.taobao.com/s2/c0-0/k%25E5%25B0%258F%25E8%2584%259A%25E4%25B8%25AB%25E5%259E%258B%25E5%258F%258C%25E9%259D%25A2%25E8%25B4%25B4%25E7%2589%25A9%25E5%2599%25A8%2520/ HA8NMG2CNF1SEPZCEFAU3DTJVK1NQCXZ 183.153.86.246 134989533 52
2017-08-10 13:00:00 http://www.taobao.com/item/1065245_6?ref=1_1_52_search.ctg_1 CKVXZVP6ENDJVM9T6JYA6U8H3UMG66PU http://www.yihaodian.com/ctg/s2/c6631-%E5%8D%B8%E5%A6%86/ 119.135.177.40 7065561 252
2017-08-10 13:00:00 http://search.taobao.com/s2/c0-0/k%25E5%258A%259E%25E5%2585%25AC/ 6844YVHVX7X3MHCT4ZSK6B1RU2ADAGCW 61.164.209.63 NULL 52
2017-08-10 13:00:00 http://www.taobao.com/checkoutV3/index.do BY2AP4Y7MA1QFS3Z1VXRGBAGUNX3KYR9 http://www.yihaodian.com/cart/cart.do?action=view 119.139.235.172 126380183 238
2017-08-10 13:00:00 http://m.taobao.com/getCategoryByRootCategoryId_1_5154 e9edbb994bf76bf683aa0ff633306f29 http://m.yihaodian.com/getCategoryByRootCategoryId_1_5143 10.4.6.47 NULL -1000
2017-08-10 13:00:00 http://www.taobao.com/2/?tracker_u=10627255 X2Q4KJ36X73Y388Y2J24ZR2874921V35 27.203.254.52 NULL 29
2017-08-10 13:00:00 http://search.taobao.com/s2/c0-0/k%E6%8A%98%E5%8F%A0%E5%BA%8A/ GQSZPP1T51NVVDHFJ79DERUYZ9BY4YPV http://www.yihaodian.com/ctg/s2/c21055-%E6%B2%99%E5%8F%91/ 182.139.161.242 NULL -1
2017-08-10 13:00:00 http://www.taobao.com/ctg/s2/c6504-%E9%9D%A2%E5%8C%85%E7%94%9C%E5%93%81%E5%88%B8/b5011/ 8QBDUSW4T7DVYNX2V3REBNUYMZMJKS3Z http://www.yihaodian.com/1/?type=2 101.229.209.34 7102251 -10
2017-08-10 13:00:00 http://www.taobao.com/item/9680587_1?smt_b=C0B0A09BECBF1106C70F00C MJAGGMWCC73BEXY88PHK7GKY185ZD4Z7 http://vas.funshion.com/attachment/editor/minisite/track/track.php?r=1&t=b&fs-c-url=http://pub.funshion.com/interface/click?uid=&mac=00306717F4C0&fck=2943F792A9E2075AA0636EEA86125D35&ap=w_da_so_02&ad=3912&mid=&re=http%3A%2F%2Fmc.funshion.com%2Finterface%2Fcc%3Fmcid%3D297%26source%3Dopt-w_da_so_02&reqId=f1798890-c040-11e2-a047-d5df87697805 58.62.185.212 NULL 237
2017-08-10 13:00:00 http://m.taobao.com/productList_1_5135_2 34E89C489B1E708883DD40AD20DB722E http://m.yihaodian.com/getCategoryByRootCategoryId_5135 106.3.103.145 NULL -20
2017-08-10 13:00:00 http://www.taobao.com/item/9680587_1?smt_b=C0B0A09BECBF110A470F00C AMUSZGN1AHC2V5JYU48QEH3QHZJWQVEB 222.35.85.77 NULL -20
2017-08-10 13:00:00 http://www.taobao.com/16/?tracker_u=7520169&type=1 DXWSKTYMMJZ4YZ6V4TWBV1GD2TFF4TMF http://www.yihaodian.com/1/?tracker_u=7520169&type=1 123.150.218.141 NULL -30
2017-08-10 13:00:00 http://m.taobao.com/searchProduct bd5776b0cc527cfdc877c22a03361364 10.4.3.83 NULL -1000
2017-08-10 13:00:00 http://m.taobao.com/searchProduct 27ca0c24fa42cf415baa1277f907860f 10.4.4.109 NULL -1000
2017-08-10 13:00:00 http://m.taobao.com/searchProduct 069e71e0bf74a381cbbb731cd23c7997 10.4.6.28 NULL -1000
2017-08-10 13:00:00 http://m.taobao.com/getProductDetail_6_9475 ae34b1030bcf048b2a7c3970ecd273c0 http://m.yihaodian.com/getCategoryByRootCategoryId_1_5404 10.4.6.47 NULL -1000
2017-08-10 13:00:00 http://www.taobao.com/item/1610446_1 9CBDTSZYG3Q4TXW63GJ5DQXA1JTN8P59 http://www.yihaodian.com/S-theme/41352/ 222.71.218.101 NULL -10
2017-08-10 13:00:00 http://m.taobao.com/searchProductsOnly 36d34d07-9723-4a35-8e23-13c7f2d50b1f 10.4.6.47 NULL -1000
2017-08-10 13:00:00 http://search.taobao.com/s2/c0-0/k%25E8%25AF%25AD%25E6%2596%2587%25E5%258F%258A%25E8%25A7%25A3%25E9%25A2%2598%25E6%258C%2587%25E5%25AF%25BC/ 4AU8GBUFURH4QTRB9275BUEZ2WDKHPNV 124.65.148.196 132780751 -20
2017-08-10 13:00:00 http://my.taobao.com/order/myOrder.do?chooseType=1 7PUMXW534N7ZXXF1FVVCZD62Y89AYC3E http://www.yihaodian.com/6/?tracker_u=1013304189&uid=2722204&type=3 61.174.53.86 134781570 52
2017-08-10 13:00:00 http://m.taobao.com/getMyYihaodianSessionUser bccf7b99af3af9f644fc54d5b43d5f36 10.4.4.109 126152325 -1000
2017-08-10 13:00:00 http://www.taobao.com/checkoutV3/index.do H7DEMF2ATV4ESF8B664G4V91HYYCT5RF http://www.yihaodian.com/cart/cart.do?action=view 221.234.44.141 133355124 205
2017-08-10 13:00:00 http://www.taobao.com/1/?type=3 U36VWSX1SZ9UN85U4215J3HZ8519Z7CW http://my.yihaodian.com/order/myOrder.do 116.237.203.19 134147646 -10
2017-08-10 13:00:00 http://m.taobao.com/getCategoryByRootCategoryId_1_24729 f55598cafba346eb217ff3fbd0de2930 http://m.yihaodian.com/getCategoryByRootCategoryId_1_5148 10.4.6.28 NULL -1000
2017-08-10 13:00:00 http://my.taobao.com/order/myOrder.do Q1MX7VWK4Q9FGR9NC1WUJW9TZRTZSBGX http://my.yihaodian.com/account/displayAccount.do 115.193.195.218 7166668 50
2017-08-10 13:00:00 http://m.taobao.com/viewcart F8Q958Q5V38CWSSJX3JTD19GYEGJWRPZ http://m.yihaodian.com/viewcart 120.84.202.25 NULL 253
2017-08-10 13:00:00 http://m.taobao.com/getCurrentGrouponList_32_-1 CDXP2P2DMK7N1TRXXF1W4QEYMX6UNFMG http://m.yihaodian.com 111.77.135.139 NULL 169
2017-08-10 13:00:00 http://m.taobao.com/getCurrentGrouponList_1_-1 00000000-11e5-8f0c-3863-1db750ef14e8 http://m.yihaodian.com/getCurrentGrouponList_1_-1 10.4.6.47 NULL -1000
2017-08-10 13:00:00 http://www.taobao.com/14/?tracker_u=1624169&type=3 WDY57TSE32CFRGMC75QAFWT4MRAEFR5R http://hao.360.cn/?src=360c 120.40.40.27 132915468 150
2017-08-10 13:00:00 http://union.taobao.com/link_make/viewPicInfo.do?imgSize=660x70&truckerU=101542127 PPNNPK35UMQZ2NXC41E9S9Z5D1NWZRR1 http://www.jintoutiao.com/yule.html 222.177.236.226 NULL -40
2017-08-10 13:00:00 http://www.taobao.com/checkoutV3/index.do FEY7QPPAYRH66XEJRVQ1AAX79FMET5RA http://www.yihaodian.com/cart/cart.do?action=view 124.127.121.97 5412541 -20
2017-08-10 13:00:00 http://m.taobao.com/productList_3_28982_2 VT21J473AQ5ADHFXNBKA22J747P9UPGG http://m.yihaodian.com/productList_3_28982_1 125.39.238.103 NULL -30
2017-08-10 13:00:00 http://m.taobao.com/getCurrentGrouponList_1_-1 5722F99769C50485920C3E2DDF23DA1D http://m.yihaodian.com/getCurrentGrouponList_1_-1 61.174.53.74 NULL 52
2017-08-10 13:00:00 http://m.taobao.com/getCurrentGrouponList_33_2 0a39f1504cdb1066224b2f997d3de6cf http://m.yihaodian.com/getCurrentGrouponList_33_32 10.4.6.28 NULL -1000
2017-08-10 13:00:00 http://m.taobao.com/searchProduct 00000000-489a-749a-ffff-ffffb2b4e611 10.4.6.47 NULL -1000
2017-08-10 13:00:00 http://search.taobao.com/s2/c21587-%E7%A7%BB%E5%8A%A8%E7%A1%AC%E7%9B%98/1/?ref=ad.8937_1120623_6 3C4X4HCMTNVSKFRBNGJHUK1T8NEY48YQ http://channel.yihaodian.com/dianqi/1/ 113.106.19.190 114837386 238
2017-08-10 13:00:00 http://m.taobao.com/getCategoryByRootCategoryId_1_22586 ccd8c3515c366e32c0a06fcc19bdccd2 http://m.yihaodian.com/getCategoryByRootCategoryId_1_22466 10.4.6.47 NULL -1000
2017-08-10 13:00:01 http://m.taobao.com/getCategoryByRootCategoryId_1_21384 ccd8c3515c366e32c0a06fcc19bdccd2 http://m.yihaodian.com/getCategoryByRootCategoryId_1_21375 10.4.3.83 NULL -1000
2017-08-10 13:00:01 http://www.taobao.com/ctg/s2/c5320-%E5%B0%8F%E9%A3%9F/ GH6PNRVGFPX2JCF951VUJ9R2Q3JM3SWJ http://www.yihaodian.com/1/?type=3 58.40.93.96 7225016 -10
2017-08-10 13:00:01 http://m.taobao.com/searchProduct c2006171a85357f5b59bab8d1ce0b7a0 10.4.6.28 NULL -1000
2017-08-10 13:00:01 http://union.taobao.com/resourceCenter/viewShoppingWindow.do?shoppingWindowVo.trackerU=105233621&shoppingWindowVo.type=1&shoppingWindowVo.width=280&shoppingWindowVo.height=300&shoppingWindowVo.yhdSearchBrowseHistory=&shoppingWindowVo.ymSearchBrowseHistory=&shoppingWindowVo.provinceId= JPGFCP7HD8K1XBHVR9PXCRXCET8ZK5JR http://union.yihaodian.com/resourceCenter/getUserCookies.do?shoppingWindowVo.trackerU=105233621&shoppingWindowVo.type=1&shoppingWindowVo.width=280&shoppingWindowVo.height=300 119.184.78.246 NULL 180
2017-08-10 13:00:01 http://search.taobao.com/s2/c0-0/k%25E4%25BC%2591%25E9%2597%25B2%25E9%25A3%259F%25E5%2593%2581/ S24WN1JD64PZKRFC19V4DBZCVGPHKDGZ http://search.yihaodian.com/s2/c0-0/k%25E9%2587%2591%25E5%2586%25A0/ 183.241.190.11 NULL -20
2017-08-10 13:00:01 http://m.taobao.com/getCurrentGrouponList?areaId=33&categoryId=2 0a39f1504cdb1066224b2f997d3de6cf 10.4.6.28 NULL -1000
2017-08-10 13:00:01 http://m.taobao.com/mw/search/1/2_%E5%BA%8A%E4%B8%8A%E7%94%A8%E5%93%81 UVCGYRFY9S5C8ZP37B3MY22GQJBCBUXH 114.132.245.203 NULL -20
2017-08-10 13:00:01 http://m.taobao.com/searchProduct f9613f7fdec1b0f4763c7f29614de74f 10.4.6.28 NULL -1000
2017-08-10 13:00:01 http://www.taobao.com/ctg/s2/c21285-%E7%94%B5%E9%A3%8E%E6%89%87-%E7%A9%BA%E8%B0%83%E6%89%87/b930527/a-s1-v0-p1-price-d0-f0-m1-rt0-pid-k/ GMNJ7A9PS7WKKHBGC1G9KJ99CX86AS25 http://www.yihaodian.com/ctg/s2/c21285-%E7%94%B5%E9%A3%8E%E6%89%87-%E7%A9%BA%E8%B0%83%E6%89%87/b930527/a13123-s1-v0-p1-price-d0-f0-m1-rt0-pid-k/ 112.64.21.58 114357970 -10
2017-08-10 13:00:01 http://www.taobao.com/1/?type=2 RUC8AG4BZB7FEZKFUYX5J8YV72UGYNM9 http://www.baidu.com/index.php?tn=maxthon2&ch=3 223.65.141.172 115617510 -1
2017-08-10 13:00:01 http://www.taobao.com/ctg/s2/c28112-%E5%90%8A%E5%B8%A6-%E8%83%8C%E5%BF%83/b/a42793-s1-v0-p1-price-d0-f0-m1-rt0-pid-k/ F6SVZUFMYHMZ979EACPVCJXBJST84WDF http://www.yihaodian.com/ctg/s2/c28112-%E5%90%8A%E5%B8%A6-%E8%83%8C%E5%BF%83/ 113.83.126.43 133468484 247
2017-08-10 13:00:01 http://m.taobao.com/searchProduct 59fa2f771e902c53f350c80fe28d79d1 10.4.4.109 NULL -1000
2017-08-10 13:00:01 http://www.taobao.com/cmsPage/show.do?pageId=36902&tracker_u=10568936992&source_id=mz_k1005892_p3yUMZ0_aE2ORU0i7eAlm&provinceId=1 GRB4V77654D4CTN3CPPWTDV5PH9EHH4A http://sohu.ad-plus.cn/event.ng/Type=click&FlightID=201304&TargetID=sohu&Values=ed4a8fd2 183.60.121.141 NULL 238
2017-08-10 13:00:01 http://union.taobao.com/link_make/viewPicInfo.do?imgSize=660x70&truckerU=101542127 1XBU61QQJ1PSXQSZZUSH3JBZKKMHY6HM http://www.jintoutiao.com/shishang.html 114.236.214.67 NULL 45
2017-08-10 13:00:01 http://union.taobao.com/link_make/viewPicInfo.do?imgSize=728x90&truckerU=1026205671 E2SSZT7F7CCGQ67F5YB845ZD42PA9C4H http://www.61diy.com/huanbao/4093.html 58.254.172.54 NULL 257
2017-08-10 13:00:01 http://www.taobao.com/item/106243_5?ref=1_1_51_search.keyword_1 PSNTVTTH8BX2QNA2U23D94JY15YW2WZ1 114.62.144.152 NULL -10
2017-08-10 13:00:01 http://my.taobao.com/giftcard/manageGiftCard.do?type=2 APKJ1C4MGS4QMD2JAB33G77DS6QDX3HN http://my.yihaodian.com/giftcard/manageGiftCard.do?type=1 123.150.217.58 130814660 -30
2017-08-10 13:00:01 http://m.taobao.com/getProductDetail_1_1403199 1aa98d28-057e-4d2a-832d-ff17c25bf858 10.4.6.28 NULL -1000
2017-08-10 13:00:01 http://m.taobao.com/getCurrentGrouponList_3_10 1352b800c8eceaeade41cdf8442410a9 http://m.yihaodian.com/getCurrentGrouponList_3_10 10.4.6.47 NULL -1000
2017-08-10 13:00:01 http://www.taobao.com/ctg/s2/c23538-%E6%89%8B%E9%93%BE-%E6%89%8B%E9%95%AF/b/a58157-s1-v0-p1-price-d0-f0-m1-rt0-pid-k/ XTTWXD771NPARQF98RJC59WR91QMNKAE http://www.yihaodian.com/ctg/s2/c23538-%E6%89%8B%E9%93%BE-%E6%89%8B%E9%95%AF/b/a58163-s1-v0-p1-price-d0-f0-m1-rt0-pid-k/ 117.83.162.168 NULL 41
2017-08-10 13:00:01 http://www.taobao.com/review/2432767_1-1-1-1-1.html EFKCD35WPXMHX3J9DVX99BHBFAJQBT1J http://www.yihaodian.com/promotion/detail.do?promotionId=35434&promotionLevelId=33548&merchantId=9 125.82.163.230 5625808 -40
2017-08-10 13:00:01 http://www.taobao.com/1/?type=3&tracker_u=1013241403 5A4YQBN1H1B7JMFWZUZ7K8YJ3XB6F86Y 112.239.37.164 NULL 173
2017-08-10 13:00:01 http://m.taobao.com/getProductDetail_1_1111234 ce7808c6a9989279f915c2de0542d56f http://m.yihaodian.com/getProductDetail_1_2224414 10.4.6.28 NULL -1000
2017-08-10 13:00:01 http://m.taobao.com/getSessionCart bf37387fce164e708fe619343131c999 10.4.4.109 125183945 -1000
2017-08-10 13:00:01 http://m.taobao.com/searchProduct f0ba370e879e4cbba7aca8d311aa571b 10.4.6.47 NULL -1000
2017-08-10 13:00:01 http://www.taobao.com/ctg/s2/c6638-%E6%B4%97%E8%A1%A3%E6%B6%B2/b1459/ AQHM919NJ2R7P5AUFEN9RWMUV8VM5YNG http://www.yihaodian.com/ctg/s2/c6638-%E6%B4%97%E8%A1%A3%E6%B6%B2?ref=1_1_5_pms.search_bottom_relcate 117.9.234.219 NULL -30
2017-08-10 13:00:01 http://search.taobao.com/s2/c0-0/b/a-s1-v0-p1-price-d0-f06-m1-rt0-pid-k%E6%B4%97%E6%B4%81%E7%B2%BE/ PD69ZGHGBGDMW9WH486S6CKJZMKNH53E http://search.yihaodian.com/s2/c0-0/k%25E6%25B4%2597%25E6%25B4%2581%25E7%25B2%25BE/ 58.33.36.9 50213119 -10
2017-08-10 13:00:01 http://m.taobao.com/getCategoryByRootCategoryId_1_27939 00000000-63da-09a0-0033-c5870033c587 http://m.yihaodian.com/getCategoryByRootCategoryId_1_8704 10.4.6.47 NULL -1000
2017-08-10 13:00:01 http://www.taobao.com/3/?tracker_u=10535027967&type=1 NU34SKACKS4WSBRSCZKH3YZDKHZRK9Z1 http://123.sogou.com/ 111.164.53.248 NULL -30
2017-08-10 13:00:01 http://m.taobao.com/getCurrentGrouponList_1_-1 00000000-335f-225f-5663-c869159fee61 http://m.yihaodian.com/getCurrentGrouponList_1_-1 10.4.6.28 NULL -1000
2017-08-10 13:00:01 http://m.taobao.com/getCategoryByRootCategoryId_1_22860 47f1736a82eb4c3b93253c2c9a149bab http://m.yihaodian.com/getCategoryByRootCategoryId_1_22887 10.4.6.47 NULL -1000
2017-08-10 13:00:01 http://m.taobao.com/getCurrentGrouponList_33_2 A396E80ED32E2BA18B013C1C87313596 http://m.yihaodian.com/getCurrentGrouponList_33_32 123.130.144.145 NULL 175
2017-08-10 13:00:01 http://union.taobao.com/resourceCenter/viewShoppingWindow.do?shoppingWindowVo.trackerU=105233621&shoppingWindowVo.type=2&shoppingWindowVo.width=960&shoppingWindowVo.height=90&shoppingWindowVo.yhdSearchBrowseHistory=&shoppingWindowVo.ymSearchBrowseHistory=&shoppingWindowVo.provinceId= M6MPYNVRZQ8EDKVP8E97TPKXENNS77KR http://union.yihaodian.com/resourceCenter/getUserCookies.do?shoppingWindowVo.trackerU=105233621&shoppingWindowVo.type=2&shoppingWindowVo.width=960&shoppingWindowVo.height=90 221.233.194.76 NULL 216
2017-08-10 13:00:01 http://m.taobao.com/getRockResultV3 4cc64318a80ceb8f69fb7879df9ae5a7 10.4.6.28 121673579 -1000
2017-08-10 13:00:01 http://m.taobao.com/getCurrentGrouponList_1_-1 4de8f4cf9139de301e0c49881d046861 http://m.yihaodian.com/getCurrentGrouponList_1_-1 10.4.6.47 NULL -1000
2017-08-10 13:00:01 http://www.taobao.com/item/69958_1?ref=1_1_52_search.ctg_1 MNAVNEJZTKX9YNUGH256SFAKYTV2GJAB http://www.yihaodian.com/ctg/s2/c28661-%E8%A1%A3%E7%89%A9%E6%8A%A4%E7%90%86/1/ 58.246.161.119 NULL -10
2017-08-10 13:00:01 http://www.taobao.com/cart/cart.do?action=view 3TBBTUBB22TTPQ4Z8SM646A5HM9JEDG1 http://www.yihaodian.com/item/2084456_3 60.26.140.39 117932573 -30
2017-08-10 13:00:01 http://m.taobao.com/productList_5_22836_3 B266EBA152507025CE84CE762DED3665 http://m.yihaodian.com/getCategoryByRootCategoryId_22836 114.217.234.10 NULL 41
2017-08-10 13:00:01 http://m.taobao.com/unionLogin e370ef8f-4dcf-40d3-81ba-7032115b3363 10.4.4.109 127841899 -1000
2017-08-10 13:00:01 http://search.taobao.com/s2/c19932-%E4%BC%91%E9%97%B2%E9%9E%8B/k%E9%9E%8B%E5%AD%90%E7%94%B7%E9%9E%8B/1/ FBA31K25MUWKEJK5TUQSJS2WM5VNWFSH http://search.yihaodian.com/s2/c19933-%E5%95%86%E5%8A%A1%E9%9E%8B/k%E9%9E%8B%E5%AD%90%E7%94%B7%E9%9E%8B/1/ 114.80.243.201 134990430 -10
2017-08-10 13:00:01 http://m.taobao.com e11b731e97934a54677bd303a6e72af9 http://m.yihaodian.com/getCategoryByRootCategoryId_1_0 10.4.3.83 NULL -1000
2017-08-10 13:00:01 http://m.taobao.com/getHomeHotPointListNew_1 9d2f791d8bd002cc64c12b9e094f38c9 http://m.yihaodian.com 10.4.6.28 NULL -1000
2017-08-10 13:00:01 http://m.taobao.com/getSessionCart d721ec9aa6ea2287f45017225877adef 10.4.3.83 133461980 -1000
2017-08-10 13:00:01 http://m.taobao.com/searchProductsOnly 00000000-6ed9-f842-95c5-95f17a34d8a1 10.4.6.47 NULL -1000
2.TextFile
(1)创建表,存储数据格式为TEXTFILE
create table log_text ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as textfile ; |
(2)向表中加载数据
hive (default)> load data local inpath '/opt/module/datas/log.data' into table log_text ; |
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_text; |
18.1 M /user/hive/warehouse/log_text/log.data
3.ORC
(1)创建表,存储数据格式为ORC
create table log_orc( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc ; |
(2)向表中加载数据
hive (default)> insert into table log_orc select * from log_text ; |
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_orc/ ; |
2.8 M /user/hive/warehouse/log_orc/000000_0
4.Parquet
(1)创建表,存储数据格式为parquet
create table log_parquet( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as parquet ; |
(2)向表中加载数据
hive (default)> insert into table log_parquet select * from log_text ; |
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_parquet/ ; |
13.1 M /user/hive/warehouse/log_parquet/000000_0
存储文件的压缩比总结:
ORC > Parquet > textFile
存储文件的查询速度测试:
1.TextFile
hive (default)> select count(*) from log_text;
_c0
100000
Time taken: 21.54 seconds, Fetched: 1 row(s)
Time taken: 21.08 seconds, Fetched: 1 row(s)
Time taken: 19.298 seconds, Fetched: 1 row(s)
2.ORC
hive (default)> select count(*) from log_orc;
_c0
100000
Time taken: 20.867 seconds, Fetched: 1 row(s)
Time taken: 22.667 seconds, Fetched: 1 row(s)
Time taken: 18.36 seconds, Fetched: 1 row(s)
3.Parquet
hive (default)> select count(*) from log_parquet;
_c0
100000
Time taken: 22.922 seconds, Fetched: 1 row(s)
Time taken: 21.074 seconds, Fetched: 1 row(s)
Time taken: 18.384 seconds, Fetched: 1 row(s)
存储文件的查询速度总结:查询速度相近。
存储和压缩结合
1 修改Hadoop集群具有Snappy压缩方式
1.查看hadoop checknative命令使用
[atguigu@hadoop104 hadoop-2.7.2]$ hadoop
checknative [-a|-h] check native hadoop and compression libraries availability
2.查看hadoop支持的压缩方式
[atguigu@hadoop104 hadoop-2.7.2]$ hadoop checknative
17/12/24 20:32:52 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
17/12/24 20:32:52 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so
zlib: true /lib64/libz.so.1
snappy: false
lz4: true revision:99
bzip2: false
3.将编译好的支持Snappy压缩的hadoop-2.7.2.tar.gz包导入到hadoop102的/opt/software中
4.解压hadoop-2.7.2.tar.gz到当前路径
[atguigu@hadoop102 software]$ tar -zxvf hadoop-2.7.2.tar.gz
5.进入到/opt/software/hadoop-2.7.2/lib/native路径可以看到支持Snappy压缩的动态链接库
(此部分为Snappy源码编译后的hadoop-2.7.2中查找,并使用)
[atguigu@hadoop102 native]$ pwd
/opt/software/hadoop-2.7.2/lib/native
[atguigu@hadoop102 native]$ ll
-rw-r--r--. 1 atguigu atguigu 472950 9月 1 10:19 libsnappy.a
-rwxr-xr-x. 1 atguigu atguigu 955 9月 1 10:19 libsnappy.la
lrwxrwxrwx. 1 atguigu atguigu 18 12月 24 20:39 libsnappy.so -> libsnappy.so.1.3.0
lrwxrwxrwx. 1 atguigu atguigu 18 12月 24 20:39 libsnappy.so.1 -> libsnappy.so.1.3.0
-rwxr-xr-x. 1 atguigu atguigu 228177 9月 1 10:19 libsnappy.so.1.3.0
6.拷贝/opt/software/hadoop-2.7.2/lib/native里面的所有内容到开发集群的/opt/module/hadoop-2.7.2/lib/native路径上
[atguigu@hadoop102 native]$ cp ../native/* /opt/module/hadoop-2.7.2/lib/native/
7.分发集群
[atguigu@hadoop102 lib]$ xsync native/
以下是xsync脚本
#!/bin/bash
#1 获取输入参数个数,如果没有参数,直接退出
pcount=$#
if ((pcount==0)); then
echo no args;
exit;
fi
#2 获取文件名称
p1=$1
fname=`basename $p1`
echo fname=$fname
#3 获取上级目录到绝对路径
pdir=`cd -P $(dirname $p1); pwd`
echo pdir=$pdir
#4 获取当前用户名称
user=`whoami`
#5 循环
for((host=103; host<105; host++)); do
echo ------------------- hadoop$host --------------
rsync -av $pdir/$fname $user@hadoop$host:$pdir
done
8.再次查看hadoop支持的压缩类型
[atguigu@hadoop102 hadoop-2.7.2]$ hadoop checknative
17/12/24 20:45:02 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
17/12/24 20:45:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so
zlib: true /lib64/libz.so.1
snappy: true /opt/module/hadoop-2.7.2/lib/native/libsnappy.so.1
lz4: true revision:99
bzip2: false
9.重新启动hadoop集群和hive
测试存储和压缩
官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
ORC存储方式的压缩:
Key | Default | Notes |
orc.compress | ZLIB | high level compression (one of NONE, ZLIB, SNAPPY) |
orc.compress.size | 262,144 | number of bytes in each compression chunk |
orc.stripe.size | 268,435,456 | number of bytes in each stripe |
orc.row.index.stride | 10,000 | number of rows between index entries (must be >= 1000) |
orc.create.index | true | whether to create row indexes |
orc.bloom.filter.columns | "" | comma separated list of column names for which bloom filter should be created |
orc.bloom.filter.fpp | 0.05 | false positive probability for bloom filter (must >0.0 and <1.0) |
注意:所有关于ORCFile的参数都是在HQL语句的TBLPROPERTIES字段里面出现
1.创建一个非压缩的的ORC存储方式
(1)建表语句
create table log_orc_none( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc tblproperties ("orc.compress"="NONE"); |
(2)插入数据
hive (default)> insert into table log_orc_none select * from log_text ; |
(3)查看插入后数据
hive (default)> dfs -du -h /user/hive/warehouse/log_orc_none/ ; |
7.7 M /user/hive/warehouse/log_orc_none/000000_0
2.创建一个SNAPPY压缩的ORC存储方式
(1)建表语句
create table log_orc_snappy( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc tblproperties ("orc.compress"="SNAPPY"); |
(2)插入数据
hive (default)> insert into table log_orc_snappy select * from log_text ; |
(3)查看插入后数据
hive (default)> dfs -du -h /user/hive/warehouse/log_orc_snappy/ ; |
3.8 M /user/hive/warehouse/log_orc_snappy/000000_0
3.上一节中默认创建的ORC存储方式,导入数据后的大小为
2.8 M /user/hive/warehouse/log_orc/000000_0
比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩,ZLIB采用的是deflate压缩算法。比snappy压缩的小。
4.存储方式和压缩总结
在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy,lzo。