关于Xml大文件的解析小结

最新推荐文章于 2023-02-18 12:09:33 发布

BAStriver

最新推荐文章于 2023-02-18 12:09:33 发布

阅读量1k

点赞数

分类专栏： Java 软件/工具文章标签： java

本文链接：https://blog.csdn.net/BAStriver/article/details/104500711

版权

Java 同时被 2 个专栏收录

63 篇文章 0 订阅

订阅专栏

软件/工具

29 篇文章 1 订阅

订阅专栏

1. 对于大文件，很多时候我们是不能直接通过普通的读文件解析的。这篇文章主要是总结下解析xml大文件的思路和代码。

2. 主要思路，其实就是通过封装一个切割文件的工具类。如：每次读取部分文件内容，比如10M。Xml标签定位、标签匹配。

1) 假设有如下的test.xml (1.42kb)

<?xml version="1.0" encoding="UTF-8"?>
<Data>
  <Header>
    <ContentDate>2020-02-03T01:00:00Z</ContentDate>
    <FileContent>BAS</FileContent>
    <DeltaStart>2020-02-02T17:00:00Z</DeltaStart>
  </Header>
  <Records>
<Record>
	   <BAS>123</BAS>
	   <Entity>
		  <LegalName>1 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="1 en" type="TRADING_OR_OPERATING_NAME">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
	  <Record>
	   <BAS>456</BAS>
	   <Entity>
		  <LegalName>2 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="2 en" type="TRADING_OR_OPERATING_NAME">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>
	  </Record>
	  <Record>
	   <BAS>789</BAS>
	   <Entity>
		  <LegalName>3 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="3 en" type="TRADING_OR_OPERATING_NAME">3 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
	  <Record>
	   <BAS>1022</BAS>
	   <Entity>
		  <LegalName>4 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="4 en" type="TRADING_OR_OPERATING_NAME">4 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
  </Records>
</Data>

2) 读取大文件。

public MappedBiggerFileReader(String fileName, int arraySize) throws IOException {
	this.fileIn = new FileInputStream(fileName);
	FileChannel fileChannel = fileIn.getChannel();
	this.fileLength = fileChannel.size();
	this.number = (int) Math.ceil((double) fileLength / (double) Integer.MAX_VALUE);
	this.mappedBufArray = new MappedByteBuffer[number];// memory File Mapping Array
	long preLength = 0;
	long regionSize = (long) Integer.MAX_VALUE;// size of mapping region
	for (int i = 0; i < number; i++) {
		// mapping contiguous areas of files to memory file mapping arrays
		if (fileLength - preLength < (long) Integer.MAX_VALUE) {
			regionSize = fileLength - preLength;// the size of the last area
		}
		mappedBufArray[i] = fileChannel.map(FileChannel.MapMode.READ_ONLY, preLength, regionSize);
		preLength += regionSize;// the beginning of the next area
	}
	this.arraySize = arraySize;
}

测试：

public static void main(String[] args) throws IOException {
	MappedBiggerFileReader reader = new MappedBiggerFileReader(
			"D:\\test\\test.xml", 1 * 1024); // 1Kb
	while (reader.read() != -1) {
		System.out.println("===========================");
		System.out.println(new String(reader.getArray()));
	}
	reader.close();
}

注：等号上面读取了1kb的内容。

3) Xml标签定位。

从test.xml可以看出，我们要取的是<Record>的内容，所以我们现在需要截取上一步读取的1kb里面的<Record>和</Record>之间的所有内容了。

public static Range getRangeForTags(StringBuffer sfb, String tag) {
	Range range = new Range();
	range.setFrom(sfb.indexOf("<" + tag));
	range.setTo(sfb.lastIndexOf("</" + tag + ">"));
	return range;
}

测试：

public static void main(String[] args) throws IOException {
	MappedBiggerFileReader reader = new MappedBiggerFileReader(
			"D:\\test\\test.xml", 1 * 1024); // 1Kb
	StringBuffer cache = new StringBuffer();
	while (reader.read() != -1) {
		cache.append(new String(reader.getArray()));
		Range range = Strkit.getRangeForTags(cache, "Record");
		System.out.println(range);
	}
	reader.close();
}

4) Xml标签匹配查询。

上一步取得了我们想要的标签的index，那么现在就是匹配里面的<Record></Record>。

public static List<String> getSubUtil(String soap,String rgex){
	List<String> list = new ArrayList<String>();
	Pattern pattern = Pattern.compile(rgex);
	Matcher m = pattern.matcher(soap);
	while (m.find()) {
		int i = 1;
		list.add(m.group(i));
	}
	return list;
}

测试：

public static void main(String[] args) throws IOException {
	StringBuffer cache = new StringBuffer();
	cache.append("<Records>"
			+ "<Record>\r\n" + 
			"	   <BAS>123</BAS>\r\n" + 
			"	   <Entity>\r\n" + 
			"		  <LegalName>1 AVSUPER FUND</LegalName>\r\n" + 
			"		  <OtherEntityNames>\r\n" + 
			"			 <OtherEntityName xml:lang=\"1 en\" type=\"TRADING_OR_OPERATING_NAME\">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" + 
			"		  </OtherEntityNames>\r\n" + 
			"	   </Entity>	   \r\n" + 
			"	  </Record><Record>\r\n" + 
			"	   <BAS>456</BAS>\r\n" + 
			"	   <Entity>\r\n" + 
			"		  <LegalName>2 AVSUPER FUND</LegalName>\r\n" + 
			"		  <OtherEntityNames>\r\n" + 
			"			 <OtherEntityName xml:lang=\"2 en\" type=\"TRADING_OR_OPERATING_NAME\">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" + 
			"		  </OtherEntityNames>\r\n" + 
			"	   </Entity>\r\n" + 
			"	  </Record>"
			+ "</Records>");

	StringBuffer texts = new StringBuffer();
	String node = "Record";
	String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
	List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
	elemTexts.forEach(str -> {
		str = str.replace("s><" + node,"");
		String t = "<" + node + str + "</" + node + ">";
		texts.append(t);
	});
	System.out.println(texts);
}

注：去掉了首尾的<Records>。

5) 集成的解析工具类。

结合第3步获取到的index来删除cache里面的内容，实现Xml切割的核心思路。

public void parse() {
	MappedBiggerFileReader reader = null;
	try {
		reader = new MappedBiggerFileReader(filePath, size * 1024);
		StringBuffer cache = new StringBuffer();
		while (reader.read() != -1) {
			cache.append(new String(reader.getArray()));
			Range range = Strkit.getRangeForTags(cache, node);

			if (range.getFrom() >= 0) {
				StringBuffer texts = new StringBuffer();

				String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
				List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
				elemTexts.forEach(str -> {
					str = str.replace("s>\n<" + node + "", "");
					String t = "<" + node + str + "</" + node + ">";
					texts.append(t);
				});
				texts.insert(0, "<Root>");
				texts.append("</Root>");

				System.out.println("===============");
				System.out.println(texts);
				cache = cache.delete(0, range.getTo());
			}
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
}

测试：

public static void main(String[] args) {
	new XmlParseUtil("D:\\test\\test.xml", "Record", 1).parse();
}

注：切割后的每一个Record前后都加上了<Root></Root>，方便之后再使用XstreamUtil进行Xml解析。

3. 附上源码下载，有什么不懂或者觉得有问题的欢迎留言讨论。

注：

1. 如果遇到：Xstream NumberFormatException: Zero length string...

可能是因为你测试的xml文件中，有部分数据节点存在attribute但是这个attribute的值是空的。

BAStriver

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
关于Xml大文件的解析小结

1. 对于大文件，很多时候我们是不能直接通过普通的读文件解析的。这篇文章主要是总结下解析xml大文件的思路和代码。2. 主要思路，其实就是通过封装一个切割文件的工具类，每次读取部分文件内容，比如10M。那么这里面涉及到的前后内容拼接就是重中之重了...
复制链接

扫一扫