官网地址
https://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html
When the APPX_COUNT_DISTINCT
query option is set to TRUE
, Impala implicitly converts COUNT(DISTINCT)
operations to the NDV()
function calls. The resulting count is approximate rather than precise. Enable the query option when a tolerable amount of error is acceptable in order to obtain faster query results than with a COUNT (DISTINCT)
queries.
Type: Boolean; recognized values are 1 and 0, or true
and false
; any other value interpreted as false
Default: false
(shown as 0 in output of SET
statement)
简单的来说就是设置APPX_COUNT_DISTINCT=1后 也就是true 会在select count(distinct column)的时候查询的更快但是查询结果不一定准确
实战
select count(distinct company_id) from ia_fdw_b_profile_product_info--4.97s 27664540
set APPX_COUNT_DISTINCT=1
select count(distinct company_id) from ia_fdw_b_profile_product_info--1.38s 29165564
这里可以看到设置之后查询速度快了几倍,但是误差也不小,所以慎用。。。
NDV()
这里提到了一个ndv函数,但是我在show functions的时候没看到,后面凑巧在官网看到了。
也学习一番
https://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv
简介
An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col)
, the "number of distinct values". It is much faster than the combination of COUNT
and DISTINCT
, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.
一种聚合函数,返回的值与COUNT(DISTINCT col)差不多
,对值去重。它比count(distinct )快的多,并使用固定的内存量,因此对于具有高基数的列,内存占用较少。
语法:
NDV([DISTINCT | ALL] expression [,scale])
说实话这个表达式我还没看懂。。 举例使用就是 select ndv(colum) from table =select count(distinct column) from table。这个scale是一个精度,但是根本写了就报错。
Note: The optional argument scale
must be an integer and can be in the range from 1 to 10 and maps to a precision used by the HyperLogLog (HLL) algorithm with the following mapping formula:
precision = scale + 8
Therefore a scale of 1 is mapped to a precision of 9 and a scale of 10 is mapped to a precision of 18.
Without the optional argument, the precision which determines the total number of different estimators in the HLL algorithm will be still 10.
A large precision value generally produces a better estimation with less error than a small precision value. This is due to the extra number of estimators involved. The expense is at the need of extra memory. For a given precision p, the amount of memory used by the HLL algorithm is in the order of 2^p bytes.
When provided a scale of 10 against a total of 22 distinct data sets loaded into external Impala tables, the error will be computed as abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>
The scale of 10, mapped to the precision of 18, yielded the worst estimation error at 0.42% (for one set of 10 million integers), and average error no more than 0.17%. This was at the cost of 256Kb of memory for the internal data structure per evaluation of the HLL algorithm.
官方案例
Examples:
The following example queries a billion-row table to illustrate the relative performance of COUNT(DISTINCT)
and NDV()
. It shows how COUNT(DISTINCT)
gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient. The NDV()
function gives an approximate result but is much faster.
select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000 |
+---------------------+
Fetched 1 row(s) in 20.13s
select cast(ndv(col1) as bigint) as col1 from sample_data;
+----------+
| col1 |
+----------+
| 139017 |
+----------+
Fetched 1 row(s) in 8.91s
The following example shows how you can code multiple NDV()
calls in a single query, to easily learn which columns have substantially more or fewer distinct values. This technique is faster than running a sequence of queries with COUNT(DISTINCT)
calls.
select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2,
cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4
from sample_data;
+----------+-----------+------------+-----------+
| col1 | col2 | col3 | col4 |
+----------+-----------+------------+-----------+
| 139017 | 282 | 46 | 145636240 |
+----------+-----------+------------+-----------+
Fetched 1 row(s) in 34.97s
select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000 |
+---------------------+
Fetched 1 row(s) in 20.13s
select count(distinct col2) from sample_data;
+----------------------+
| count(distinct col2) |
+----------------------+
| 278 |
+----------------------+
Fetched 1 row(s) in 20.09s
select count(distinct col3) from sample_data;
+-----------------------+
| count(distinct col3) |
+-----------------------+
| 46 |
+-----------------------+
Fetched 1 row(s) in 19.12s
select count(distinct col4) from sample_data;
+----------------------+
| count(distinct col4) |
+----------------------+
| 147135880 |
+----------------------+
Fetched 1 row(s) in 266.95s
其实我到这里也没看太懂这个函数有什么用?唯一的用处就是算去重值的时候会快一点。。。