Hive 数据抽样

最新推荐文章于 2024-01-19 13:06:40 发布

周二Tyr

最新推荐文章于 2024-01-19 13:06:40 发布

阅读量1.9k

点赞数

分类专栏：大数据/Hive 文章标签： hive 抽样 sample rand 随机抽样

本文链接：https://blog.csdn.net/zhoubangtao/article/details/53517576

版权

大数据/Hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Hive 抽样语法

Hive使用tablesample语法对表进行抽样，包含两种抽样方式桶抽样和块抽样

桶抽样

桶抽样语法：tablesample (bucket x out of y [on colname])

tablesample可以用在任何表上，不单单是分区表
colname用于对表中的行抽样， colname可以是任何非分区字段或者用rand()来表示用整行而非一个单独字段

整个表会以colname打散到y个桶里（编号从1到y），属于x桶的行会被抽出来
例如：

SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

source表会按照整行分成32个桶，第三个桶的数据会抽样出来，s是表的别。

块抽样

Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn’t necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.
In the following example the input size 0.1% or more will be used for the query.

随机抽样

分类随机抽样

周二Tyr

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive 数据抽样

Hive 抽样语法桶抽样块抽样随机抽样分类随机抽样快捷键Markdown及扩展表格定义列表代码块脚注目录数学公式UML 图离线写博客浏览器兼容Hive 抽样语法Hive使用TABLESAMPLE语法对表进行抽样桶抽样桶抽样语法：tablesample (bucket x out of y [on colname]) tablesample可以用在任何表上，不单
复制链接

扫一扫