读取，分割，分页读取UTF-8文件时，避免中文乱码

最新推荐文章于 2023-12-14 11:35:04 发布

画夜

最新推荐文章于 2023-12-14 11:35:04 发布

阅读量893

点赞数

本文链接：https://blog.csdn.net/zsr_251/article/details/40822383

版权

思想是，判断最后一个读取的字节是不是汉字中的字节。

汉字在UTF-8中是3字节，其编码方式是1110xxxx 10xxxxxx 10xxxxxx

        /// <summary>
        /// 一个汉字缺少几个字节
        /// 汉字在UTF-8中是3字节，所以其编码方式是1110xxxx 10xxxxxx 10xxxxxx
        /// </summary>
        /// <param name="last">最后一个字节</param>
        /// <param name="before">倒数第二个字节</param>
        /// <returns>汉字缺少几个字节</returns>
        private static int MissNum(int last, int before)
        {
            if (last >= 0 && last <= 127)
            {
                //128 个 ASCII 字符
                //不属于汉字
                return 0;
            }
            if (last >= 224 && last <= 239)
            {
                //属于汉字的第一个字节
                //剩余两个字节
                return 2;
            }
            if (last >= 128 && last <= 191 && before >= 224 && before <= 239)
            {
                //属于汉字的第二个字节
                //剩余一个字节
                return 1;
            }
            return 0;
        }

调用这个方法，把缺少的字节重新读取，见代码

            //当前字节
            long cur = 0;
            FileStream fs = new FileStream("E:/Log/abc.txt", FileMode.Open);
            byte[] buf = new byte[2048];
            //设置文件读取开始的偏移地址
            cur = 0;
            fs.Position = cur;
            //成功读取的字节数
            int n = 0;
            //未读取到结尾
            while (fs.Position <= fs.Length)
            {
                //读取文件，空两个字节以备后患
                n = fs.Read(buf, 0, 2046);
                //判断汉字还差多少字节
                int miss = MissNum((int)buf[n - 1], (int)buf[n - 2]);
                for (int i = 0; i < miss; i++)
                {
                    fs.Read(buf, n, 1);
                    n++;
                }
                string s = System.Text.Encoding.UTF8.GetString(buf, 0, n);
                Console.WriteLine(s);
                Console.ReadKey();
            }

代码中未实现：如果不是从第一个字节读取的，有可能第一个字会出现乱码。理论一样。