Java and XML, 3rd Edition 译文：Chapter 1. Introduction-CSDN博客

本文链接：https://blog.csdn.net/f5love1989/article/details/6514706

In the next two chapters, I'm going to give you a crash course in XML and constraints. Since there is so much material available on XML and related specifications, I'd rather cruise through this material quickly and get on to Java. For those of you who are completely new to XML, you might want to have a few of the following books around as reference:

XML in a Nutshell, by Elliotte Rusty Harold and W. Scott Means

Learning XML, by Erik Ray

Learning XSLT, by Michael Fitzgerald

XSLT, by Doug Tidwell

These are all O'Reilly books, and I have them scattered about my own workspace. With that said, let's dive in.

在接下来的两章，我将给大家带来XML和约束的突击课程。因为已经有了太多XML及其相关规范的资料，（所以）我宁愿快速漫游一下这些资料，然后把话题转到Java上。对那些完全不熟悉XML的读者，你也许需要下面列出的一些书籍作为参考：

XML in a Nutshell, Elliotte Rusty Harold and W. Scott Means 著

Learning XML, Erik Ray 著

Learning XSLT, Michael Fitzgerald 著

XSLT, Doug Tidwell 著

所有的这些书都是O'Reilly的书，我把它们分开放在在我自己的工作空间。话就说到这，让我们开始开始沉浸（在XML中）吧！

1.1. XML 1.0

It all begins with the XML 1.0 Recommendation, which you can read in its entirety at http://www.w3.org/TR/REC-xml. Example 1-1 shows an XML document that conforms to this specification. I'll use it to illustrate several important concepts.

1.1 XML 1.0

让我们从XML 1.0提案开始，你可以在 http://www.w3.org/TR/REC-xml 查看有关XML 1.0 规范的所有详细信息。下面的例子 1-1展示了一个符合XML 1.0 规范的XML文档。我将使用这个XML文档来阐明一些种重要的概念。

Example 1-1. A typical XML document is long and verbose

例 1-1 . 一个典型的XML文档是冗长的

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:dc="http://purl.org/dc/elements/1.1/" 
         xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" 
         xmlns:l="http://purl.org/rss/1.0/modules/link/" 
         xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <!--Generated by Blogger v5.0-->
  <channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">
    <title>Neil Gaiman's Journal</title>
    <link>http://www.neilgaiman.com/journal/journal.asp</link>
    <description>Neil Gaiman's Journal</description>
    <dc:date>2005-04-30T01:57:38Z</dc:date>
    <dc:language>en-US</dc:language>
    <admin:generatorAgent rdf:resource="http://www.blogger.com/" />
    <admin:errorReportsTo rdf:resource="mailto:rss-errors@blogger.com" />
    <items>
      <rdf:Seq>
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/jetlag-morning.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/demon-days.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/more-from-mailbag.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/two-days.asp" />
        <rdf:li 
  rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp" />
      </rdf:Seq>
    </items>
  </channel>

  <!-- and so on... -->
</rdf:RDF>

For those of you who are curious, this is the RSS feed for Neil Gaiman's blog (http://www.neilgaiman.com). It uses a lot of RSS syntax, which I'll cover in Chapter 12 in detail.

对那些好奇的读者，XML文档中列出的网址（http://www.neilgaiman.com）是Neil Gaiman's的博客。这个网站使用到了大量的RSS语法，我将会在第12章讲述到这些细节。

A lot of this specification describes what is mostly intuitive. If you've done any HTML authoring, or SGML, you're already familiar with the concept of elements (such as items and channel in Example 1-1) and attributes (such as resource and content). XML defines how to use these items and how a document must be structured. XML spends more time defining tricky issues like whitespace than introducing any concepts that you're not at least somewhat familiar with. One exception may be that some of the elements in Example 1-1 are in the form:

[prefix]:[element name]

Such as rdf:li. These are elements in an XML namespace, something I'll explain in detail shortly.

An XML document can be broken into two basic pieces: the header, which gives an XML parser and XML applications information about how to handle the document, and the content, which is the XML data itself. Although this is a fairly loose division, it helps us differentiate the instructions to applications within an XML document from the XML content itself, and is an important distinction to understand. The header is simply the XML declaration, in this format:

<?xml version="1.0" encoding="UTF-8"?>

This header includes an encoding, and can also indicate whether the document is a standalone document or requires other documents to be referenced for a complete understanding of its meaning:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

The rest of the header is made up of items like the DOCTYPE declaration (not included in the example):

<!DOCTYPE RDF SYSTEM "DTDs/RDF-gaiman.dtd">

In this case, the declaration refers to a file on the local system, in the directory DTDs/ called RDF-gaiman.dtd. Any time you use a relative or absolute file path or a URL, you want to use the SYSTEM keyword. The other option is using the PUBLIC keyword, and following it with a public identifier. This means that the W3C or another consortium has defined a standard DTD that is associated with that public identifier. As an example, take the DTD statement for XHTML 1.0:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Here, a public identifier is supplied (the funny little string starting with -//), followed by a system identifier (the URL). If the public identifier cannot be resolved, the system identifier is used instead.

You may also see processing instructions at the top of a file, and they are generally considered part of a document's header, rather than its content. They look like this:

<?xml-stylesheet href="XSL/JavaXML.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="XSL/JavaXML.wml.xsl" type="text/xsl" 
                 media="wap"?>
<?cocoon-process type="xslt"?>

Each is considered to have a target (the first word, like xml-stylesheet or cocoon-process) and data (the rest). Often, the data is in the form of name-value pairs, which can really help readability. This is only a good practice, though, and not required, so don't depend on it.

Other than that, the bulk of your XML document should be content; in other words, elements, attributes, and data that you have put into it.

1.1.1. The Root Element

The root element is the highest-level element in the XML document, and must be the first opening tag and the last closing tag within the document. It provides a reference point that enables an XML parser or XML-aware application to recognize a beginning and end to an XML document. In Example 1-1, the root element is RDF:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:dc="http://purl.org/dc/elements/1.1/" 
         xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" 
         xmlns:l="http://purl.org/rss/1.0/modules/link/" 
         xmlns:content="http://purl.org/rss/1.0/modules/content/">

    <!-- Document content -->
</rdf:RDF>

This tag and its matching closing tag surround all other data content within the XML document. XML specifies that there may be only one root element in a document. In other words, the root element must enclose all other elements within the document. Aside from this requirement, a root element does not differ from any other XML element. It's important to understand this, because XML documents can reference and include other XML documents. In these cases, the root element of the referenced document becomes an enclosed element in the referring document and must be handled normally by an XML parser. Defining root elements as standard XML elements without special properties or behavior allows document inclusion to work seamlessly.

1.1.2. Elements

So far, I have glossed over defining an actual element. Let's take an in-depth look at elements, which are represented by arbitrary names and must be enclosed in angle brackets. There are several different variations of elements in the sample document, as shown here:

 <!-- Standard element opening tag -->
  <items>

  <!-- Standard element with attribute -->
  <rdf:li 
    rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp">

  <!-- Element with textual data -->
  <dc:creator>Neil Gaiman</dc:creator>

  <!-- Empty element -->
  <l:permalink l:type="text/html" 
      rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp"
  />

  <!-- Standard element closing tag -->
  </items>

This isn't actual XML; it's just a collection of examples. Trying to parse something like this would fail, as there are opening tags without corresponding closing tags.

The first rule in creating elements is that their names must start with a letter or underscore, and then may contain any amount of letters, numbers, underscores, hyphens, or periods. They may not contain embedded spaces:

<!-- Embedded spaces are not allowed -->
<my element name>

XML element names are also case-sensitive. Generally, using the same rules that govern Java variable naming will result in sound XML element naming. Using an element named tcbo to represent Telecommunications Business Object is not a good idea because it is cryptic, while an overly verbose tag name like beginningOfNewChapter just clutters up a document. Keep in mind that your XML documents will probably be seen by other developers and content authors, so clear documentation through good naming is essential.

Every opened element must in turn be closed. There are no exceptions to this rule as there are in many other markup languages, like HTML. An ending element tag consists of the forward slash and then the element name: </items>. Between an opening and closing tag, there can be any number of additional elements or textual data. However, you cannot mix the order of nested tags; the first opened element must always be the last closed element. If any of the rules for XML syntax are not followed in an XML document, the document is not well-formed. A well-formed document is one in which all XML syntax rules are followed, and all elements and attributes are correctly positioned. However, a well-formed document is not necessarily valid, which means that it follows the constraints set upon a document by its DTD or schema. There is a significant difference between a well-formed document and a valid one; the rules I discuss in this section ensure that your document is well-formed, while the rules discussed in Chapter 2 ensure that your document is valid.

As an example of a document that is not well-formed, consider this XML fragment:

<tag1>
 <tag2>
</tag1>

The order of nesting of tags is incorrect, as the opened <tag2> is not followed by a closing </tag2> within the surrounding tag1 element. However, even if these syntax errors are corrected, there is still no guarantee that the document will be valid.

While this example of a document that is not well-formed may seem trivial, remember that this would be acceptable HTML, and commonly occurs in large tables within an HTML document. In other words, HTML and many other markup languages do not require well-formed XML documents. XML's strict adherence to ordering and nesting rules allows data to be parsed and handled much more quickly than when using markup languages without these constraints.

The last rule I'll look at is the case of empty elements. I already said that XML tags must always be paired; an opening tag and a closing tag constitute a complete XML element. There are cases where an element is used purely by itself, like a flag stating a chapter is incomplete, or where an element has attributes but no textual data, like an image declaration in HTML. These would have to be represented as:

<admin:generatorAgent rdf:resource="http://www.blogger.com/"> </admin:generatorAgent> <img src="/images/xml.gif"></img>

This is obviously a bit silly, and adds clutter to what can often be very large XML documents. The XML specification provides a means to signify both an opening and closing element tag within one element:

<admin:generatorAgent rdf:resource="http://www.blogger.com/" /> <img src="/images/xml.gif"

What's with the Space Before the End Slash?

Well, let me tell you. I've had the unfortunate pleasure of working with Java and XML since late 1998, when things were rough at best. And some web browsers at that time (and some today, to be honest) would only accept XHTML (HTML that is well-formed) in very specific formats. Most notably, tags like <br> that are never closed in HTML must be closed in XHTML, resulting in <br/>. Some of these browsers would completely ignore a tag like this; however, oddly enough, they would happily process <br /> (note the space before the end slash). I got used to making my XML not only well-formed, but consumable by these browsers. I've never had a good reason to change these habits, so you get to see them in action here.

This nicely solves the problem of unnecessary clutter, and still follows the rule that every XML element must have a matching end tag; it simply consolidates both start and end tag into a single tag.