统计大文件中的中文字符、英文字符、数字及其他字符的数量

最新推荐文章于 2023-03-05 15:17:24 发布

NBA_2011

最新推荐文章于 2023-03-05 15:17:24 发布

阅读量1.4k

点赞数

分类专栏： Java技术 Java 文章标签： byte string os file java null

本文链接：https://blog.csdn.net/NBA_2011/article/details/7283162

版权

Java 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

Java技术

4 篇文章 0 订阅

订阅专栏

一个大文件中，包含多种字符。现对文件中的中文字符，英文字符、数字和其他字符进行统计。主要是切割文件，即将一个大文件切割为多个小文件，然后分别对他们进行处理。

在切割文件的时候，思路是：

1. 使用 Java 的RandomAccessFile 类进行随机访问

2.由于读取字节时，容易将中文字符一份为二。所以采用了按行读取的方法，当读取一行之后，计算当前读取到的数据的字节数，看是否达到某一区间，如果达到，则将其放入相应区间，以便后续操作。

切割文件的方法如下：

scopeMap = new LinkedHashMap<Long, Long>();
		try {
			RandomAccessFile raf = new RandomAccessFile(new File(fileName),
					"rw");

			long fileLen;
			if (raf.length() > piecesNum) {
				fileLen = raf.length() / piecesNum;
			} else {
				fileLen = raf.length();
			}

			boolean [] flag = new boolean[piecesNum];
			for(int i = 0;i < flag.length;i++){
				flag[i] = true;
			}

			long startSize = 0;
			long endSize = 0;
			long line = 0;
			String strLine = null;
			while ((strLine = raf.readLine()) != null) {
				
				
				if(!flag[piecesNum-1]){
					break;
				}

				/**
				 * in the English OS, the  word wrap(回车)
				 * takes 2 bytes; 
				 * But in the Chinese OS, the word wrap takes 1 byte.
				 */
//				endSize += (strLine.getBytes().length + ReadFileUtil.ENGLISH_OS_WORD_WRAP); 
				
				endSize += (strLine.getBytes().length + ReadFileUtil.CHINESE_OS_WORD_WRAP);
				line++;
				
				int segmentNum = (int) (endSize / fileLen);
				
				if(segmentNum < piecesNum){
					if(flag[segmentNum]){
						if(segmentNum == (piecesNum - 1)){
							/* we should exclude the size of the word wrap
							 * and the last line has not a word wrap
							 * */
							endSize -= (ReadFileUtil.CHINESE_OS_WORD_WRAP*line);
						}
						scopeMap.put(startSize, endSize);
						startSize = endSize;
						flag[segmentNum] = false;
					}
				}else{
					break;
				}
				
			}

			// if the file is not end
			if (startSize < raf.length()) {
				scopeMap.put(startSize, raf.length());
			}
			
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}

具体处理数据的时候，将scopeMap中的键值对拿出来，并将其这一区间中的数据读入到一个byte的数组中，最后对这个数组进行处理。

				for (Entry<Long, Long> entry : ReadFileUtil.scopeMap.entrySet()) {
//					System.out.println("start :" + entry.getKey() + ",end :"
//							+ entry.getValue());
					long start = entry.getKey();
					long end = entry.getValue();
					byte[] bytes = new byte[(int) (end - start)];
					raf.seek(start);
					raf.read(bytes, 0, (int) bytes.length);
}

                                String str = new String(bytes, 0, bytes.length, "GB2312");
				// System.out.println(str);
				char[] chars = str.toCharArray();

				for (int i = 0; i < chars.length; i++) {
					if (ReadFileUtil.isChinese(chars[i])) {
						ReadFileUtil.chineseCount++;
					} else if (Character.isDigit(chars[i])) {
						ReadFileUtil.numericCount++;
					} else if (Character.isLetter(chars[i])) {
						ReadFileUtil.englishCount++;
					} else {
						ReadFileUtil.otherCount++;
					}

				}

对判断一个字符是否是中文，从网上能搜到很多方法：

byte[] bytes = ("" + ch).getBytes();
		if (bytes[0] < 0) {
			return true;
		}
		return false;

                char ch = "中";
                String str = ""+ch;
                char[] chars = str.toCharArray();
		boolean isGB2312 = false;
		for (int i = 0; i < chars.length; i++) {
//			byte[] bytes = ("" + chars[i]).getBytes();
			byte[] bytes = ("" + chars[i]).getBytes();
			if (bytes.length == 3) {
				int[] ints = new int[2];
				ints[0] = bytes[0] & 0xff;
				ints[1] = bytes[1] & 0xff;
				if (ints[0] >= 0x81 && ints[0] <= 0xFE && ints[1] >= 0x40
						&& ints[1] <= 0xFE) {
					isGB2312 = true;
					break;
				}
			}
		}
		return isGB2312;

NBA_2011

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
统计大文件中的中文字符、英文字符、数字及其他字符的数量

一个大文件中，包含多种字符。现对文件中的中文字符，英文字符、数字和其他字符进行统计。主要是切割文件，即将一个大文件切割为多个小文件，然后分别对他们进行处理。在切割文件的时候，思路是： 1. 使用 Java 的RandomAccessFile 类进行随机访问 2.由于读取字节时，容易将中文字符一
复制链接

扫一扫