对百万级txt文件的数据，进行查重处理

最新推荐文章于 2025-04-17 11:53:41 发布

sophiemantela

最新推荐文章于 2025-04-17 11:53:41 发布

阅读量2k

点赞数

分类专栏： c# 学习笔记文章标签： c#

本文链接：https://blog.csdn.net/sophiemantela/article/details/105075864

版权

c# 学习笔记专栏收录该内容

67 篇文章

订阅专栏

需求1：比对A、B两个文件，其中A文件是B文件的一部分，找出A、B文件的差集。
使用hashset 分别保存要比对的数据，然后求差集，主要代码如下

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace CompareFile
{
    class Program
    {

        static void Main(string[] args)
        {
   

            HashSet<string> fullLst = new HashSet<string>();//bao

            using (StreamReader sr = new StreamReader("full.txt"))
            {

                string line = "";
                string last_data = "";
                int idx = 0;
                while ((line = sr.ReadLine()) != null)
                {
                    idx++;
                    if (line != last_data)
                        fullLst.Add(line);
                     last_data = line;

                    //或者使用另外一种方式
                    //if (!fullLst.Contains(line))
                    //    fullLst.Add(line);

                    if (idx % 10000 == 0)
                        Console.WriteLine(idx);
                }

            }
            Console.WriteLine("已完成读取full.txt");
            HashSet<string> compareLst = new HashSet<string>();
            using (StreamReader sr = new StreamReader("compare.txt"))
            {
                string line = "";
                int lineID = 0;
                while ((line = sr.ReadLine()) != null)
                {
                    lineID++;
                    compareLst.Add(line);
                    if (lineID % 10000 == 0)
                        Console.WriteLine(lineID);
                }
            }
            Console.WriteLine("读取compare.txt已完成");
            fullLst.ExceptWith(compareLst);

            Console.WriteLine("full.txt 与 compare.txt 差集 count： " + fullLst.Count);

            //将差集文件导出
            using (StreamWriter sr = new StreamWriter("diff.txt"))
            {
                foreach (string item in fullLst)
                    sr.WriteLine(item);
            }
            Console.WriteLine("比对结果已导出");

            Console.ReadLine();

        }
    }
}

代码demo下载地址
. PS:
最初，考虑把读到的数据放入List中，去重使用List.constains()判断，如果包含在不添加，否则添加的到 list中。
编译代码，发现运行的非常慢，
代码逻辑比较简单，就改用string last 变量保存读到的数据，本次读到的数据保存到last，下一次读取时，如果与last不同在添加，并更新last值。

编译运行很快。

结论：list.constains()在数据量很大时，检索是否包含某个值，是很慢。