Hive 抽样语法
Hive使用tablesample语法对表进行抽样,包含两种抽样方式桶抽样和块抽样
桶抽样
桶抽样语法:tablesample (bucket x out of y [on colname])
- tablesample可以用在任何表上,不单单是分区表
- colname用于对表中的行抽样, colname可以是任何非分区字段或者用rand()来表示用整行而非一个单独字段
整个表会以colname打散到y个桶里(编号从1到y),属于x桶的行会被抽出来
例如:
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
source表会按照整行分成32个桶,第三个桶的数据会抽样出来,s是表的别。
块抽样
Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn’t necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.
In the following example the input size 0.1% or more will be used for the query.