java xml排序_《代码之美》（五）－－正确、优美、迅速（按重要性排序）：从设计XML验证器中学到的经验...

最新推荐文章于 2022-04-27 21:30:29 发布

小爱酱的万水千山

最新推荐文章于 2022-04-27 21:30:29 发布

阅读量254

点赞数

文章标签： java xml排序

本文链接：https://blog.csdn.net/weixin_32968007/article/details/114461050

版权

Elliotte Rusty Harold

Elliotte参与开发了java上的JDOM库和XOM库。它们都用于处理XML，且是完全独立的，没有共享任何代码，但后者在开发中借鉴了前者的设计思想。

在解析XML时，需要检查它的结构正确性，所以我们要对其进行输入验证和输出验证。本章专门研究对XML1.0的元素名字的验证。

XML名字的BNF语法(有删节)如下：

BaseChar ::= [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6]

NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

Name ::= (Letter | '_' | ':') (NameChar)*

Letter ::= BaseChar | Ideographic

Ideographic ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]

Digit ::= [#x0030-#x0039] | [#x0660-#x3029] | [#x06F0-#x06F9]

| [#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F]

| [#x0AE6-#x0AEF] | [#x0B66-#x0B6F] | [#x0BE7-#x0BEF]

| [#x0C66-#x0C6F] | [#x0CE6-#x0CEF] | [#x0D66-#x0D6F]

| [#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29]

Extender ::= #x00B7 | #x02D0 | #x02D1 | #x0387 | #x0640 | 0x0E46 | #x0EC6

| #x3005 | [#x3031-#x3035] | [#x309D-#309E] | [#x30FC-#x30FE]

| [#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131]

| [#x0180-#x01C3] ...

CombiningChar ::= [#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486]

| [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF

| [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670

| [#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4]

| [#x06E7-#x06E8] | [#x06EA-#x06ED] ...

完整的设置规则远不止这些，因为需要考虑90000余个Unicode字符，所以示例特地删减了针对BaseChar和CombiningChar的规则。

下面给出名字验证的第一个版本：

private static String checkName(String name) {

// 不能是空或者null

if (name == null || name.length() == 0 || name.trim().equals(""))

return "XML names cannot be null or empty";

// 开头不能是数字

char first = name.charAt(0);

if (Character.isDigit(first))

return "XML names cannot begin with a number."

// 开头不能是$

if (first == '$')

return "XML names cannot begin with a dollar sign ($)."

// 开头不能是_

if (first == '_')

return "XML names cannot begin with a hyphen (_)."

// 确保有效的内容

for (int i = 0, len = name.length(); i < len; i++) {

char c = name.charAt(i);

if (!Character.isLetterOrDigit(c) && c != '-' && c != '$' && c != '_')

return c + " is not allowed in XML names.";

}

上面的代码使用了Java的Character类来实现验证，但这并不合适，因为它容易造成错误。下面的代码实现了第二个版本，模拟BNF语法：

private static String checkName(String name) {

// 不能是空或者null

if (name == null || name.length() == 0 || name.trim().equals(""))

return "XML names cannot be null or empty";

// 开头不能是数字

char first = name.charAt(0);

if (!isXMLNameStartCharacter(first))

return "XML names cannot begin with the character \"" + first + \".";

// 确保有效的内容

for (int i = 0, len = name.length(); i < len; i++) {

char c = name.charAt(i);

if (!isXMLNameCharacter(c))

return "XML names cannot contain the character \"" + c + "\".";

}

public static boolean isXMLNameCharacter(char c) {

return isXMLLetter(c) || isXMLDigit(c) || c == '.' || c == '-' || c == '_' || c == ':' || isXMLCombiningChar(c) || isXMLExtender(c);

}

public static boolean isXMLNameStartCharacter(char c) {

return isXMLLetter(c) || c == '_' || c == ':';

}

下面仅考虑函数isXMLDigit的实现：

public static boolean isXMLDigit(char c) {

if (c >= 0x0030 && c <= 0x0039) return true;

if (c >= 0x0660 && c <= 0x3029) return true;

if (c >= 0x06F0 && c <= 0x06F9) return true;

if (c >= 0x0966 && c <= 0x096F) return true;

if (c >= 0x09E6 && c <= 0x09EF) return true;

if (c >= 0x0A66 && c <= 0x0A6F) return true;

if (c >= 0x0AE6 && c <= 0x0AEF) return true;

if (c >= 0x0B66 && c <= 0x0B6F) return true;

if (c >= 0x0BE7 && c <= 0x0BEF) return true;

if (c >= 0x0C66 && c <= 0x0C6F) return true;

if (c >= 0x0CE6 && c <= 0x0CEF) return true;

if (c >= 0x0D66 && c <= 0x0D6F) return true;

if (c >= 0x0E50 && c <= 0x0E59) return true;

if (c >= 0x0ED0 && c <= 0x0ED9) return true;

if (c >= 0x0F20 && c <= 0x0F29) return true;

}

上面的代码非常清晰，但是性能却不算太好，为此Jason Hunter进行了优化：

public static boolean isXMLDigit(char c) {

if (c < 0x0030) return false; if (c <= 0x0039) return true;

if (c < 0x0660) reutrn false; if (c <= 0x3029) return true;

if (c < 0x06F0) reutrn false; if (c <= 0x06F9) return true;

if (c < 0x0966) reutrn false; if (c <= 0x096F) return true;

if (c < 0x09E6) reutrn false; if (c <= 0x09EF) return true;

if (c < 0x0A66) reutrn false; if (c <= 0x0A6F) return true;

if (c < 0x0AE6) reutrn false; if (c <= 0x0AEF) return true;

if (c < 0x0B66) reutrn false; if (c <= 0x0B6F) return true;

if (c < 0x0BE7) reutrn false; if (c <= 0x0BEF) return true;

if (c < 0x0C66) reutrn false; if (c <= 0x0C6F) return true;

if (c < 0x0CE6) reutrn false; if (c <= 0x0CEF) return true;

if (c < 0x0D66) reutrn false; if (c <= 0x0D6F) return true;

if (c < 0x0E50) reutrn false; if (c <= 0x0E59) return true;

if (c < 0x0ED0) reutrn false; if (c <= 0x0ED9) return true;

if (c < 0x0F20) reutrn false; if (c <= 0x0F29) return true;

}

这种优化可以避免很多不必要的比较。

在JDOM中，构造函数总是会检测正确性，而对于一些解析器已经读取并验证过的字符串，重复的检测显得多余。为此，可以提供两种构造函数：一种给解析器使

用，只在需要时才进行验证；另一种给其他调用者使用，必须进行验证。JDOM开发者考虑过这种优化，但是JDOM有一个很大的缺陷：用于解析的代码和

XML核心对象被分割到了org.jdom.input和org.jdom两个包中，这样就无法提供一个只对解析器可见的构造函数。

为此，Elliote

提出：

Java代码中，把包分割得过细是一个常见的违反模式的设计。这个设计要么把不应该开放的方法设置为公有的，要么只提供少量的功能，而这两种选择都是不好的。

Elliote在XOM的开发中，吸取了JDOM中的经验与教训，把处理输入的类和处理核心节点的类放在一个包中，并为每个节点类都提供一个私有的不含验证的构造函数。

对于前面判断是否为数字的代码，我们可以做进一步优化。我们可以利用空间来换取时间，最直接的方法就是使用java语言中的switch语句生成一张数值表，使得查询操作的代价为O(1)：

private static boolean isHexDigit(char c) {

case '0' : return true;

case '1' : return true;

...

case ':' : return false;

case ';' : return false;

...

case 'A' : return true;

case 'B' : return true;

...

}

上面的switch语句可能牵涉到几万个case语句，但java规定一个方法的字节数量最大不能超过64KB，所以我们需要另外一种解决方案。

我们可以把这张巨大的表存为二进制的格式，在基本多语言平面(Basic Multilingual

Plane，BMP)的65536个Unicode中，每一个编码点都对应了一个字节。

该字节的8个比特位用来确定最重要的字符属性。例如，当字符是合法的PCDATA内容时，比特位1被置位，否则复位；当字符可以用于XML名字中时，比特

位2被置位，否则复位；当字符可以作为XML名字的首字符时，比特位3被置位，否则复位。

在程序中可以把这个表载入到内存中，这样便可改写上面的验证函数(flags为载入的表)：

private final static byte XML_CHARACTER = 1;

private final static byte NAME_CHARACTER = 2;

private final static byte NAME_START_CHARACTER = 4;

private final static byte NCNAME_CHARACTER = 8;

static void checkNCName(String name) {

if (name == null)

throwIllegalNameException(name, "NCNames cannot be null");

int length = name.length();

if (length == 0)

thrwoIllegalNameException(name, "NCNames cannot be empty");

char first = name.charAt(0);

if ((flags[first] & NAME_START_CHARACTER) == 0)

throwIllegalNameException(name, "NCNames cannot start with the character " + Integer.toHexString(first));

for (int i = 1; i < length; i++) {

char c = name.charAt(i);

if ((flags[c] & NCNAME_CHARACTER) == 0) {

if (c == ':')

throwIllegalNameException(name, "NCNames cannot contain colons");

else

throwIllegalNameException(name, "0x" + Integer.toHexString(c) + " is not a legal NCName character");

}

如果无法让验证器运行的更快一点，那么就尽量少做些验证工作。Wolfgang

Hoscheck注意到XML文档中同样的名字重复出现，比如在XHTML文件中p, table, div, span,

strong等元素反复出现，这样我们可以在核实某个名字是合法之后，便把它存放在某处，下次便不再重复检查该名字。但是使用这种缓存机制时要特别小心：

在一个集合中寻找某些名字(尤其是短名字)或许会比重新验证它们占用更多的时间。此外，如果集合需要在线程之间共享，那么线程竞争也可能成为一个严重的问

题。

为此，Elliotte仅对用于命名空间的URI缓存，因为与元素名字相比，命名空间的URI验证消耗了更昂贵的开销，并且重复出现的频率更高：

private final static class URICache {

private final static int LOAD = 6;

private String[] cache = new String[LOAD];

private int position = 0;

synchronized boolean contains(String s) {

for (int i = 0; i < LOAD; i++) {

if (s == cache[i])

return true;

}

return false;

}

synchronized void put(String s) {

cache[position] = s;

position++;

if (position == LOAD) position = 0;

}

上面的代码实现了一个固定大小的URI缓存。它没有使用哈希映射或哈希表，而是使用固定大小的数组以及线性搜索。对于小的列表来说，哈希查找要比简单的数

组遍历更慢。此外，多线程同时解析一个文件，固定大小的数组可能会过小，一种可能的解决方案是将缓存作为每个线程的局部变量。

Elliotte对以上经验的总结：

不要让性能的考虑妨碍你做正确的事情。使优美但速度较慢的代码变得更快，总比使快速但难看的代码变得更优美要更加容易。

Elliotte Rusty Harold：Java开源库XOM的创始者，并著有多本关于Java和XML的著作。代表作有：《Effective XML》和《Java Secrets》等。

小爱酱的万水千山

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫