Tidb 跨分区使用count distinct if 统计不准（有bug）

最新推荐文章于 2024-07-31 19:04:13 发布

lvqinglou

最新推荐文章于 2024-07-31 19:04:13 发布

阅读量708

点赞数

文章标签： tidb bug mysql

本文链接：https://blog.csdn.net/lvqinglou/article/details/126878850

版权

本文讲述了在使用 TiDB 数据库时遇到的一个关于分区表统计不准确的问题。当尝试通过用户ID和事件类型进行去重计数时，原始SQL在全表查询时返回错误结果，但在单分区查询时正确。最终通过修改SQL，用0替换NULL并做差值计算得到了准确结果。问题涉及到分布式数据库的统计特性和SQL优化。

摘要由CSDN通过智能技术生成

我们公司使用了tidb，然后我就发现了这东西有点不靠谱，用新技术果然要付出代价。不要问我为什么用，反正很蛋疼

正文

如标题所示，假设一张表结构如下，用日期进行分区

CREATE TABLE `table_xiucai` (
  `user_id` bigint(20) NOT NULL DEFAULT '0' COMMENT '用户id',
  `event_type` varchar(100) COLLATE utf8_general_ci NOT NULL COMMENT '事件类型',
  `created_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  `updated_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',
  `par` date NOT NULL COMMENT '分区键'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci 
PARTITION BY RANGE ( TO_DAYS(`par`) ) (
  PARTITION `20211215` VALUES LESS THAN (738504)
)

表中用户id默认值为0（实际表中会存在为0的记录）。

业务需求

需要统计不同事件类型的uv数，也就是根据user_id进行去重，按照我们最简单的mysql写法如下

select count(distinct(if(event_type = 'xx1' and user_id > 0, user_id, null))) as count1,
	count(distinct(if(event_type = 'xx2' and user_id > 0, user_id, null))) as count2
from table_xiucai 
where par >= 20210101 and par <= 20220101

然后把以上sql拿到tidb去运行，结果返回的结果是不准的，非常的坑，但是我自测的时候用的是单个分区，如下

select count(distinct(if(event_type = 'xx1' and user_id > 0, user_id, null))) as count1,
	count(distinct(if(event_type = 'xx2' and user_id > 0, user_id, null))) as count2
from table_xiucai 
where par = 20210101

这样统计出来的结果是准确的，一万个操蛋。最终没有办法改成了以下sql才得到正确结果

select count(distinct(if(event_type = 'xx1' and user_id > 0, user_id, 0)))  - count(distinct if(event_type = 'xx1' and user_id = 0, 1, null)) as count1,
	count(distinct(if(event_type = 'xx2' and user_id > 0, user_id, 0)))  - count(distinct if(event_type = 'xx2' and user_id = 0, 1, null)) as count2  
from table_xiucai 
where par = 20210101