关于Xml大文件的解析 小结

1. 对于大文件,很多时候我们是不能直接通过普通的读文件解析的。这篇文章主要是总结下解析xml大文件的思路和代码。

2. 主要思路,其实就是通过封装一个切割文件的工具类。如:每次读取部分文件内容,比如10M。Xml标签定位、标签匹配。

1) 假设有如下的test.xml (1.42kb)

<?xml version="1.0" encoding="UTF-8"?>
<Data>
  <Header>
    <ContentDate>2020-02-03T01:00:00Z</ContentDate>
    <FileContent>BAS</FileContent>
    <DeltaStart>2020-02-02T17:00:00Z</DeltaStart>
  </Header>
  <Records>
<Record>
	   <BAS>123</BAS>
	   <Entity>
		  <LegalName>1 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="1 en" type="TRADING_OR_OPERATING_NAME">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
	  <Record>
	   <BAS>456</BAS>
	   <Entity>
		  <LegalName>2 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="2 en" type="TRADING_OR_OPERATING_NAME">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>
	  </Record>
	  <Record>
	   <BAS>789</BAS>
	   <Entity>
		  <LegalName>3 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="3 en" type="TRADING_OR_OPERATING_NAME">3 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
	  <Record>
	   <BAS>1022</BAS>
	   <Entity>
		  <LegalName>4 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="4 en" type="TRADING_OR_OPERATING_NAME">4 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
  </Records>
</Data>

2) 读取大文件

public MappedBiggerFileReader(String fileName, int arraySize) throws IOException {
	this.fileIn = new FileInputStream(fileName);
	FileChannel fileChannel = fileIn.getChannel();
	this.fileLength = fileChannel.size();
	this.number = (int) Math.ceil((double) fileLength / (double) Integer.MAX_VALUE);
	this.mappedBufArray = new MappedByteBuffer[number];// memory File Mapping Array
	long preLength = 0;
	long regionSize = (long) Integer.MAX_VALUE;// size of mapping region
	for (int i = 0; i < number; i++) {
		// mapping contiguous areas of files to memory file mapping arrays
		if (fileLength - preLength < (long) Integer.MAX_VALUE) {
			regionSize = fileLength - preLength;// the size of the last area
		}
		mappedBufArray[i] = fileChannel.map(FileChannel.MapMode.READ_ONLY, preLength, regionSize);
		preLength += regionSize;// the beginning of the next area
	}
	this.arraySize = arraySize;
}

 测试:

public static void main(String[] args) throws IOException {
	MappedBiggerFileReader reader = new MappedBiggerFileReader(
			"D:\\test\\test.xml", 1 * 1024); // 1Kb
	while (reader.read() != -1) {
		System.out.println("===========================");
		System.out.println(new String(reader.getArray()));
	}
	reader.close();
}

注:等号上面读取了1kb的内容。

 

3) Xml标签定位

从test.xml可以看出,我们要取的是<Record>的内容,所以我们现在需要截取上一步读取的1kb里面的<Record>和</Record>之间的所有内容了。

public static Range getRangeForTags(StringBuffer sfb, String tag) {
	Range range = new Range();
	range.setFrom(sfb.indexOf("<" + tag));
	range.setTo(sfb.lastIndexOf("</" + tag + ">"));
	return range;
}

 测试:

public static void main(String[] args) throws IOException {
	MappedBiggerFileReader reader = new MappedBiggerFileReader(
			"D:\\test\\test.xml", 1 * 1024); // 1Kb
	StringBuffer cache = new StringBuffer();
	while (reader.read() != -1) {
		cache.append(new String(reader.getArray()));
		Range range = Strkit.getRangeForTags(cache, "Record");
		System.out.println(range);
	}
	reader.close();
}

 

4) Xml标签匹配查询

上一步取得了我们想要的标签的index,那么现在就是匹配里面的<Record></Record>。

public static List<String> getSubUtil(String soap,String rgex){
	List<String> list = new ArrayList<String>();
	Pattern pattern = Pattern.compile(rgex);
	Matcher m = pattern.matcher(soap);
	while (m.find()) {
		int i = 1;
		list.add(m.group(i));
	}
	return list;
}

测试:

public static void main(String[] args) throws IOException {
	StringBuffer cache = new StringBuffer();
	cache.append("<Records>"
			+ "<Record>\r\n" + 
			"	   <BAS>123</BAS>\r\n" + 
			"	   <Entity>\r\n" + 
			"		  <LegalName>1 AVSUPER FUND</LegalName>\r\n" + 
			"		  <OtherEntityNames>\r\n" + 
			"			 <OtherEntityName xml:lang=\"1 en\" type=\"TRADING_OR_OPERATING_NAME\">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" + 
			"		  </OtherEntityNames>\r\n" + 
			"	   </Entity>	   \r\n" + 
			"	  </Record><Record>\r\n" + 
			"	   <BAS>456</BAS>\r\n" + 
			"	   <Entity>\r\n" + 
			"		  <LegalName>2 AVSUPER FUND</LegalName>\r\n" + 
			"		  <OtherEntityNames>\r\n" + 
			"			 <OtherEntityName xml:lang=\"2 en\" type=\"TRADING_OR_OPERATING_NAME\">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" + 
			"		  </OtherEntityNames>\r\n" + 
			"	   </Entity>\r\n" + 
			"	  </Record>"
			+ "</Records>");

	StringBuffer texts = new StringBuffer();
	String node = "Record";
	String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
	List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
	elemTexts.forEach(str -> {
		str = str.replace("s><" + node,"");
		String t = "<" + node + str + "</" + node + ">";
		texts.append(t);
	});
	System.out.println(texts);
}

 注:去掉了首尾的<Records>。

 

5) 集成的解析工具类

结合第3步获取到的index来删除cache里面的内容,实现Xml切割的核心思路。

public void parse() {
	MappedBiggerFileReader reader = null;
	try {
		reader = new MappedBiggerFileReader(filePath, size * 1024);
		StringBuffer cache = new StringBuffer();
		while (reader.read() != -1) {
			cache.append(new String(reader.getArray()));
			Range range = Strkit.getRangeForTags(cache, node);

			if (range.getFrom() >= 0) {
				StringBuffer texts = new StringBuffer();

				String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
				List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
				elemTexts.forEach(str -> {
					str = str.replace("s>\n<" + node + "", "");
					String t = "<" + node + str + "</" + node + ">";
					texts.append(t);
				});
				texts.insert(0, "<Root>");
				texts.append("</Root>");

				System.out.println("===============");
				System.out.println(texts);
				cache = cache.delete(0, range.getTo());
			}
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
}

测试:

public static void main(String[] args) {
	new XmlParseUtil("D:\\test\\test.xml", "Record", 1).parse();
}

 注:切割后的每一个Record前后都加上了<Root></Root>,方便之后再使用XstreamUtil进行Xml解析

 

 3. 附上源码下载,有什么不懂或者觉得有问题的欢迎留言讨论。

 

注:

1. 如果遇到:Xstream NumberFormatException: Zero length string...

    可能是因为你测试的xml文件中,有部分数据节点存在attribute但是这个attribute的值是空的。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值