使用Java StAX解析超大xml（超过60g）文件，并将其存入数据库（MySQL）

最新推荐文章于 2024-08-12 23:01:19 发布

一枚小蔡鸡

最新推荐文章于 2024-08-12 23:01:19 发布

阅读量6.1k

点赞数 12

分类专栏：数据处理文章标签： Java StAX 超大xml文件解析 Stack Overflow数据 MySQL

本文链接：https://blog.csdn.net/caipengbo/article/details/83031367

版权

遇到的问题

本人需要解析Stack Overflow的dump文件（xml格式）将数据其存入数据库，其中关于Stack Overflow帖子（Posts）的xml文件超过了60G。

那么如何解析那么大的xml文件呢（Stack Overflow上有解决方案-链接）？

解决方案

或许你已经想到了分块读取，然后解析。那么如何分块解析呢？Java中处理xml文件有两种处理方案（DOM和Event Driven）。DOM需要将所有文件读取，在内存中构建DOM树，显然这种方案不行；只能选择基于事件的方式，本人选择了StAX，或许用SAX的人比较多，那么SAX和StAX的区别是什么呢？，这是说明： SAX vs StAX。

StAX的学习教程可以在：http://tutorials.jenkov.com/java-xml/index.html 和 https://docs.oracle.com/javase/tutorial/jaxp/ 中寻找。

你需要了解的是：什么是事件驱动。（懂得这个，也就懂得了StAX为什么是分块处理了）

代码实现

Posts.xml预览(使用IntelliJ IDEA可以预览超大xml文件)

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="573" ViewCount="37080" Body="" OwnerUserId="8" LastEditorUserId="6786713" LastEditorDisplayName="Rich B" LastEditDate="2018-07-02T17:55:27.247" LastActivityDate="2018-07-02T17:55:27.247" Title="Convert Decimal to Double?" Tags="&lt;c#&gt;&lt;floating-point&gt;&lt;type-conversion&gt;&lt;double&gt;&lt;decimal&gt;" AnswerCount="13" CommentCount="1" FavoriteCount="41" CommunityOwnedDate="2012-10-31T16:42:47.213" />
  <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="256" ViewCount="16306" Body="" OwnerUserId="9" LastEditorUserId="63550" LastEditorDisplayName="Rich B" LastEditDate="2016-03-19T06:05:48.487" LastActivityDate="2016-03-19T06:10:52.170" Title="" Tags="&lt;html&gt;&lt;css&gt;&lt;css3&gt;&lt;internet-explorer-7&gt;" AnswerCount="5" CommentCount="0" FavoriteCount="10" />
  <row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="401" Body="" OwnerUserId="9" LastEditorUserId="4020527" LastEditDate="2017-12-16T05:06:57