【十八】hive常用内置函数之聚合函数Aggregate Functions

Aggregate Functions

Return Type

Name(Signature)

Description

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr...])

count(*) - Returns the total number of retrieved rows, including rows containing NULL values.

count(expr) - Returns the number of rows for which the supplied expression is non-NULL.

count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Execution of this can be optimized with hive.optimize.distinct.rewrite.

DOUBLE

sum(col), sum(DISTINCT col)

Returns the sum of the elements in the group or the sum of the distinct values of the column in the group.

DOUBLE

avg(col), avg(DISTINCT col)

Returns the average of the elements in the group or the average of the distinct values of the column in the group.

DOUBLE

min(col)

Returns the minimum of the column in the group.

DOUBLE

max(col)

Returns the maximum value of the column in the group.

DOUBLE

variance(col), var_pop(col)

Returns the variance of a numeric column in the group.

DOUBLE

var_samp(col)

Returns the unbiased sample variance of a numeric column in the group.

DOUBLE

stddev_pop(col)

Returns the standard deviation of a numeric column in the group.

DOUBLE

stddev_samp(col)

Returns the unbiased sample standard deviation of a numeric column in the group.

DOUBLE

covar_pop(col1, col2)

Returns the population covariance of a pair of numeric columns in the group.

DOUBLE

covar_samp(col1, col2)

Returns the sample covariance of a pair of a numeric columns in the group.

DOUBLE

corr(col1, col2)

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.

DOUBLE

percentile(BIGINT col, p)

Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

array<double>

percentile(BIGINT col, array(p1 [, p2]...))

Returns the exact percentiles p1, p2, ... of a column in the group (does not work with floating point types). pi must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

DOUBLE

percentile_approx(DOUBLE col, p [, B])

Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.

array<double>

percentile_approx(DOUBLE col, array(p1 [, p2]...) [, B])

Same as above, but accepts and returns an array of percentile values instead of a single one.

double

regr_avgx(independent, dependent)

Equivalent to avg(dependent). As of Hive 2.2.0.

double

regr_avgy(independent, dependent)

Equivalent to avg(independent). As of Hive 2.2.0.

double

regr_count(independent, dependent)

Returns the number of non-null pairs used to fit the linear regression line. As of Hive 2.2.0.

double

regr_intercept(independent, dependent)

Returns the y-intercept of the linear regression line, i.e. the value of b in the equation dependent = a * independent + b. As of Hive 2.2.0.

double

regr_r2(independent, dependent)

Returns the coefficient of determination for the regression. As of Hive 2.2.0.

double

regr_slope(independent, dependent)

Returns the slope of the linear regression line, i.e. the value of a in the equation dependent = a * independent + b. As of Hive 2.2.0.

double

regr_sxx(independent, dependent)

Equivalent to regr_count(independent, dependent) * var_pop(dependent). As of Hive 2.2.0.

double

regr_sxy(independent, dependent)

Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent). As of Hive 2.2.0.

doubleregr_syy(independent, dependent)

Equivalent to regr_count(independent, dependent) * var_pop(independent). As of Hive 2.2.0.

array<struct {'x','y'}>

histogram_numeric(col, b)

Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights

array

collect_set(col)

Returns a set of objects with duplicate elements eliminated.

array

collect_list(col)

Returns a list of objects with duplicates. (As of Hive 0.13.0.)

INTEGERntile(INTEGER x)

Divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值