删除xml中的非法字符

Ever get a System.Xml.XmlException that says:

“Hexadecimal value 0x[whatever] is an invalid character”

…when trying to load a XML document using one of the .NET XML API objects like XmlReader, XmlDocument, or XDocument? Was “0x[whatever]” by chance one of these characters?

0×00
0×01
0×02
0×03
0×04
0×05
0×06
0×07
0×08
0x0B
0x0C
0x0E
0x0F
0×10
0×11
0×12
0×13
0×14
0×15
0x1A
0x1B
0x1C
0x1D
0x1E
0x1F
0×16
0×17
0×18
0×19
0x7F

The problem that causes these XmlExceptions is that the data being read or loaded contains characters that are illegal according to the XML specifications. Almost always, these characters are in the ASCII control character range (think whacky characters like null, bell, backspace, etc). These aren’t characters that have any business being in XML data; they’re illegal characters that should be removed, usually having found their way into the data from file format conversions, like when someone tries to create an XML file from Excel data, or export their data to XML from a format that may be stored as binary.

The decimal range for ASCII control characters is 0 – 31, and 127. Or, in hex, 0×00 – 0x1F. (The control character 0x7F is not disallowed, but its use is “discouraged” to avoid compatibility issues.) If any character in the string or stream that contains the XML data contains one of these control characters, an XmlException will be thrown by whatever System.Xml or System.Xml.Linq class (e.g. XmlReader, XmlDocument, XDocument) is trying to load the XML data. In fact, if XML data contains the character ‘/b’ (bell), your motherboard will actually make the bell sound before the XmlException is thrown.

There are a few exceptions though: the formatting characters ‘/n’, ‘/r’, and ‘/t’ are not illegal in XML, per the 1.0 and 1.1 specifications, and therefore do not cause this XmlException. Thus, if you’re encountering XML data that is causing an XmlException because the data “contains invalid characters”, the feeds you’re processing need to be sanitized of illegal XML characters per the XML 1.0 specification (which is what System.Xml conforms to—not XML 1.1) should be removed. The methods below will accomplish this:

01/// <summary>
02/// Remove illegal XML characters from a string.
03/// </summary>
04public string SanitizeXmlString(string xml)
05{
06    if (xml == null)
07    {
08        throw new ArgumentNullException("xml");
09    }
10  
11    StringBuilder buffer = new StringBuilder(xml.Length);
12  
13    foreach (char c in xml)
14    {
15        if (IsLegalXmlChar(c))
16        {
17            buffer.Append(c);
18        }
19    }
20  
21    return buffer.ToString();
22}
23  
24/// <summary>
25/// Whether a given character is allowed by XML 1.0.
26/// </summary>
27public bool IsLegalXmlChar(int character)
28{
29    return
30    (
31         character == 0x9 /* == '/t' == 9   */          ||
32         character == 0xA /* == '/n' == 10  */          ||
33         character == 0xD /* == '/r' == 13  */          ||
34        (character >= 0x20    && character <= 0xD7FF  ) ||
35        (character >= 0xE000  && character <= 0xFFFD  ) ||
36        (character >= 0x10000 && character <= 0x10FFFF)
37    );
38}

Useful as these methods are, don’t go off pasting them into your code anywhere. Create a class instead. Here’s why: let’s say you use the routine to sanitize a string in one section of code. Then another section of code uses that same string that has been sanitized. How does the other section positively know that the string doesn’t contain any control characters anymore, without checking? It doesn’t. Who knows where that string has been (if it’s been sanitized) before it gets to a different routine, further down the processing pipeline. Program defensive and agnostically. If the sanitized string isn’t a string and is instead a different type that represents sanitized strings, you can guarantee that the string doesn’t contain illegal characters.

Now, if the strings that need to be sanitized are being retrieved from a Stream, via a TextReader, for example, we can create a custom StreamReader class that will skip over illegal characters. Let’s say that you’re retrieving XML like so:

01string xml;
02  
03using (WebClient downloader = new WebClient())
04{
05    using (TextReader reader =
06        new StreamReader(downloader.OpenRead(uri)))
07    {
08        xml = reader.ReadToEnd();
09    }
10}
11  
12// Do something with xml...

You could use the sanitizing methods above like this:

01string xml;
02  
03using (WebClient downloader = new WebClient())
04{
05    using (TextReader reader =
06        new StreamReader(downloader.OpenRead(uri)))
07    {
08        xml = reader.ReadToEnd();
09    }
10}
11  
12// Sanitize the XML
13  
14xml = SanitizeXmlString(xml);
15  
16// Do something with xml...

But creating a class that inherits from StreamReader and avoiding the costly string-building operation performed by SanitizeXmlString() is much more efficient. The class will have to override a couple methods when it’s finished, but when it is, a Stream could be consumed and sanitized like this instead:

01string xml;
02  
03using (WebClient downloader = new WebClient())
04{
05    using(XmlSanitizingStream reader =
06        new XmlSanitizingStream(downloader.OpenRead(uri)))
07    {
08        xml = reader.ReadToEnd()
09    }
10}
11  
12// xml contains no illegal characters

The declaration for this XmlSanitizingStream, with IsLegalXmlChar() that we’ll need, looks like:

01public class XmlSanitizingStream : StreamReader
02{
03    // Pass 'true' to automatically detect encoding using BOMs.
05  
06    public XmlSanitizingStream(Stream streamToSanitize)
07        : base(streamToSanitize, true)
08    { }
09  
10    /// <summary>
11    /// Whether a given character is allowed by XML 1.0.
12    /// </summary>
13    public static bool IsLegalXmlChar(int character)
14    {
15        return
16        (
17             character == 0x9 /* == '/t' == 9   */          ||
18             character == 0xA /* == '/n' == 10  */          ||
19             character == 0xD /* == '/r' == 13  */          ||
20            (character >= 0x20    && character <= 0xD7FF  ) ||
21            (character >= 0xE000  && character <= 0xFFFD  ) ||
22            (character >= 0x10000 && character <= 0x10FFFF)
23        );
24    }
25  
26    // ...

To get this XmlSanitizingStream working correctly, we’ll first need to override two methods integral to the StreamReader: Peek(), and Read(). The Read method should only return legal XML characters, and Peek() should skip past a character if it’s not legal.

01private const int EOF = -1;
02public override int Read()
03{
04    // Read each char, skipping ones XML has prohibited
05    int nextCharacter;
06    do
07    {
08        // Read a character
09        if ((nextCharacter = base.Read()) == EOF)
10        {
11            // If the char denotes end of file, stop
12            break;
13        }
14    }
15    // Skip char if it's illegal, and try the next
16    while (!XmlSanitizingStream.
17            IsLegalXmlChar(nextCharacter));
18    return nextCharacter;
19}
20public override int Peek()
21{
22    // Return next legal XML char w/o reading it 
23    int nextCharacter;
24    do
25    {
26        // See what the next character is
27        nextCharacter = base.Peek();
28    }
29    while
30    (
31        // If it's illegal, skip over
32        // and try the next.
33        !XmlSanitizingStream
34        .IsLegalXmlChar(nextCharacter) &&
35        (nextCharacter = base.Read()) != EOF
36    );
37    return nextCharacter;
38}

Next, we’ll need to override the other Read* methods (Read, ReadToEnd, ReadLine, ReadBlock). These all use Peek() and Read() to derive their returns. If they are not overridden, calling them on XmlSanitizingStream will invoke them on the underlying base StreamReader. That StreamReader will then use its Peek() and Read() methods, not the XmlSanitizingStream’s, resulting in unsanitized characters making their way through.

To make life easy and avoid writing these other Read* methods from scratch, we can disassemble the TextReader class using Reflector, and copy its versions of the other Read* methods, without having to change more than a few lines of code related to ArgumentExceptions.

The complete version of XmlSanitizingStream can be downloaded here. Rename the file extension to “.cs” from “.doc” after downloading.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值