数据清洗使用Parallel 多线程

花阴偷移

于 2024-03-21 18:09:00 发布

阅读量242

点赞数 4

本文链接：https://blog.csdn.net/weixin_43394129/article/details/137277807

版权

一.概述

　　在开发数据清洗时，ES数据集有600w条，每一条的子对象又有几十条，需要拿到子对象去重后的集合，使用分批提取ES数据，共535批。开始使用List来操作，关键代码如下：

           var specListAll = new List<SpecInfo>();
            for (int i = 0; i < batchCount; i++)
            {
                //从es提取一批数据
                //每条数据提取子集合到list
               //下面去重后添加到新集合中
                foreach (var specDesc in list)
                {
                    if (specListAll.Count(w => w.NameJoinValue == specDesc.NameJoinValue) == 0)
                        specListAll.Add(specDesc);
                }
            }

　　使用计时器，第一批数据执行完耗时3分29秒，去重后进入15542个到specListAll集合中, 535批预估共执行31.2小时。

　　下面使用Parallel 多线程来实现去重后，添加到新集合中，关键代码如下：

           var specListAll = new ConcurrentBag<SpecInfo>();
            for (int i = 0; i < batchCount; i++)
            {
                //从es提取一批数据
                //每条数据提取子集合到list
                //下面去重后添加到新集合中
                Parallel.ForEach(list, specDesc =>
                {
                    if (specListAll.Count(w => w.NameJoinValue == specDesc.NameJoinValue) == 0)
                        specListAll.Add(specDesc);
                });
            }

　　使用计时器，第一批数据执行完耗时2分19秒，去重后进入15542个到specListAll集合中, 535批预估共执行20.8小时。

　　最后查看CPU,使用Parallel 多线程会高出30%的使用率

二.改进

　　　　在清洗中，发现使用specListAll.Count来去重复很耗时间，改进后，只需要2个多小时清洗完成，代码如下：

    var specListAll = new ConcurrentBag<SpecInfo>();
            for (int i = 0; i < batchCount; i++)
            {
                //从es提取一批数据
                //每条数据提取子集合到list
                //下面去重后添加到新集合中
                Parallel.ForEach(list, specDesc =>
                {
                    //if (specListAll.Count(w => w.NameJoinValue == specDesc.NameJoinValue) == 0)
                        specListAll.Add(specDesc);
                });
            }
   //最后加上去重
   var specListAll2=specListAll.Distinct().ToList();

花阴偷移

关注

4
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
数据清洗使用Parallel 多线程

一.概述　　在开发数据清洗时，ES数据集有600w条，每一条的子对象又有几十条，需要拿到子对象去重后的集合，使用分批提取ES数据，共535批。开始使用List来操作，关键代码如下： var specListAll = new List<SpecInfo>(); for (int i = 0; i < batchCount; i+...
复制链接

扫一扫