What you need to know about... Java and XML
by Aaron Elkiss
Introduction
XML stands for eXtensible Markup Language. It is a structured document format that allows us, among other things, to represent semistructured data in a portable fashion. That means that (at least in theory) we can use a single toolkit (the XML Parser) to extract structure and content from our document instead of having to adapt to many different ad-hoc conventions for text and binary data. XML alone doesn‘t give us any clue as to the meaning of a document. There are many ways of encoding various layers of meaning using XML, ranging from XML Schema to RDF and even Cascading Style Sheets, but they aren‘t covered here.
XML files can have associated Document Type Definitions (or DTDs) which explicitly specify the allowed structure. The main use for this is in validating parsers which enforce that a document conform to a DTD. In this tutorial, we‘ll assume that there is no DTD for our XML documents but we know the structure of documents we‘re working with ahead of time.
Java is among many modern languages that support the manipulation of XML. One crucial component of that is Java‘s support for Unicode. In this article you will learn how to effectively work with XML and Unicode with Java. The focus is on short real-world examples that will allow you to get started working with Java and XML right away. The presentation is geared towards those who need to work with XML and Unicode but don‘t want to spend a lot of time learning all the intricacies.
In addition, I include several suggestions on how to handle malformed data. This is a frequent scenario in many domains including my field, natural language processing. Data can be corrupted in transit and it‘s not always possible to get the supplier to fix it. I‘ll provide you with practical tools and ideas to solve this problem.
If you‘re not familiar with Unicode, you may want to skim the section on character sets and encodings or check out Joel Spolsky‘s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) before reading the material on XML. Otherwise, you should be able to jump right in. This tutorial assumes a basic level of familiarity with Java, i.e. that you‘re fairly comfortable with the material in chapters 1-7 of Dive Into Java.