XML and Java technologies: Data binding, Part 2: Performance
XML and Java technologies: Data binding, Part 2: Performance
After kicking the tires in Part 1, take data binding frameworks out for a test drive
Summary: Enterprise Java expert Dennis Sosnoski checks out the speed and memory usage of several frameworks for XML data binding in Java. These include all the code generation approaches discussed in Part 1, the Castor mapped binding approach discussed in an earlier article, and a surprise new entry in the race. If you're working with XML in your Java applications you'll want to learn how these data binding approaches stack up!
Part 1 provides background on why you'd want to use data binding for XML, along with an overview of the available Java frameworks for data binding. If you haven't already read Part 1, you'll probably want to at least glance over it now. In this part I'm going straight to the issue of performance without further discussion of the whys and hows!
For performance tests of the data binding frameworks, I generated documents containing mock airline flight timetable information. These are based on the same structure I defined in the earlier article on mapped data binding with Castor (seeResources). Here's a sample of that structure, herein referred to as the compact format because it uses mainly attributes for data:
Listing 1. Compact document format
<?xml version="1.0"?> <timetable> <carrier ident="AR" rating="9"> <URL>http://www.arcticairlines.com</URL> <name>Arctic Airlines</name> </carrier> <carrier ident="CA" rating="7"> <URL>http://www.combinedlines.com</URL> <name>Combined Airlines</name> </carrier> <airport ident="SEA"> <location>Seattle, WA</location> <name>Seattle-Tacoma International Airport</name> </airport> <airport ident="LAX"> <location>Los Angeles, CA</location> <name>Los Angeles International Airport</name> </airport> <route from="SEA" to="LAX"> <flight carrier="AR" depart="6:23a" arrive="8:42a" number="426"/> <flight carrier="CA" depart="8:10a" arrive="10:52a" number="833"/> <flight carrier="AR" depart="9:00a" arrive="11:36a" number="433"/> </route> <route from="LAX" to="SEA"> <flight carrier="CA" depart="7:45a" arrive="10:20a" number="311"/> <flight carrier="AR" depart="9:27a" arrive="12:04p" number="593"/> <flight carrier="AR" depart="12:30p" arrive="3:07p" number="102"/> </route> </timetable>
Note that the airport name information in Listing 1 usually is a single line of code. To accomodate column size, some lines of code are split and appear on two lines.
In addition to the compact format, I also tried a variation with more use of child elements for data values (only staying with attributes for IDs and IDREFs). Here's the same data presented in that format, which I refer to here as the full format:
Listing 2. Full document format
<?xml version="1.0"?> <timetable> <carrier ident="AR"> <rating>9</rating> <URL>http://www.arcticairlines.com</URL> <name>Arctic Airlines</name> </carrier> <carrier ident="CA"> <rating>7</rating> <URL>http://www.combinedlines.com</URL> <name>Combined Airlines</name> </carrier> <airport ident="SEA"> <location>Seattle, WA</location> <name>Seattle-Tacoma International Airport</name> </airport> <airport ident="LAX"> <location>Los Angeles, CA</location> <name>Los Angeles International Airport</name> </airport> <route from="SEA" to="LAX"> <flight carrier="AR"> <number>426</number> <depart>6:23a</depart> <arrive>8:42a</arrive> </flight> <flight carrier="CA"> <number>833</number> <depart>8:10a</depart> <arrive>10:52a</arrive> </flight> <flight carrier="AR"> <number>433</number> <depart>9:00a</depart> <arrive>11:36a</arrive> </flight> </route> <route from="LAX" to="SEA"> <flight carrier="CA"> <number>311</number> <depart>7:45a</depart> <arrive>10:20a</arrive> </flight> <flight carrier="AR"> <number>593</number> <depart>9:27a</depart> <arrive>12:04p</arrive> </flight> <flight carrier="AR"> <number>102</number> <depart>12:30p</depart> <arrive>3:07p</arrive> </flight> </route> </timetable>
Often, the relative performance of XML frameworks differs greatly depending on the size of documents being used, so I included both large and small documents in these performance tests. The large documents (time-comp.xml and time-full.xml) use identical data values in the two different formats shown above. Because of this, the sizes are considerably different (106 KB for the compact format versus 211 KB for the full format). The small documents are in collections, each containing 34 documents ranging in size from 1.4-3.3 KB for the compact format (ttcomp) and 2.2-5.8 KB for the full format (ttfull). As with the large documents, corresponding documents in the small document collections contain the same data values. The full set of documents used in the tests is available from the Downloads page (see Resources).
I would prefer to test with more document variations than just the two formats used for these results. However, the amount of effort involved in adding more documents for a data binding test is substantial because of the need to provide W3C XML Schema (Schema) and Document Type Definition (DTD) descriptions for code generation, along with mapping files and base classes for the mapped versions. The two formats used here, with both large and small document variations, should at least give a fairly representative picture of how the data binding alternatives perform for typical business documents. They probably allow the mapped binding approaches to show better memory usage than would be typical of general documents, though, because most of the data values in these documents can be converted to primitive types. This results in a very compact internal representation. With documents where most of the data values need to be kept as
Strings, the memory advantage of the mapped binding approaches would be diminished.
All test results were obtained using a 1.4GHz Athlon system with 256MB of DDR RAM, running RedHat Linux 7.2. I used Sun's JDK 1.4.1 for Linux in all tests. The specific versions of each data binding framework tested are as follows: JAXB Beta 1, Castor 0.9.4.1, JBind 1.0 Beta 12/07, Quick 4.3.1, and Zeus Beta 3.5 (JiBX is a special case -- see So what's JiBX? following the test results for details). All tests except JBind and JiBX used the Piccolo SAX2 parser, version 1.0.3. This is the fastest SAX2 parser I'm aware of, and generally meets or beats the speed of the XMLPull parser used for the JiBX tests (XPP3 version 1.1.2). JBind was unable to work with the Piccolo parser, so for testing JBind I used Xerces Java 2, version 2.2.0.
To provide a performance comparison between data binding and other alternative approaches I also ran a timing test of the same files using just the SAX2 parser, and timing and memory tests using the dom4j document model (a performance leader among the document models, and one that allows different SAX2 parsers to be used for parsing input documents). For these tests, I used dom4j version 1.3.
I used the same basic framework for these timing and memory usage tests as in my earlier tests with document models (see the author's document model performance article in Resources). This benchmark framework first reads all documents into internal memory buffers, then times multiple passes of input and output operations on the documents. The test results shown inInput timings and Output timings are the best times over several passes. This should be representative of long-term performance in a server-type environment where the same code is executed repeatedly.
Figures 1 and 2 give the timing results for reading an XML document (unmarshalling it, in data binding terms) and constructing an in-memory representation using the dom4j document model and various data binding approaches. In these charts you can regard the first timing value, for SAX2, as a base time for parsing the documents. The document models and data binding implementations use the parse results to build their representations in memory, so they're never going to be faster than the parser itself. The two data binding tests based on mappings, rather than code generation, are noted in the captions.
Figure 1. Reading large documents to memory
Figure 2. Reading small documents to memory
dom4j is able to construct its in-memory representation of the documents in less than twice the amount of time taken by the parser alone. The only data binding framework that beats this performance is JiBX. JAXB, Quick, and Zeus all turn in respectable performance figures compared to dom4j, but take nearly twice as long as JiBX overall. Castor is very slow by comparison, both with mapped bindings and with generated code.
JBind performs a full order of magnitude slower than most of the binding frameworks in these tests. A small part of this poor performance is due to the slower parser used for the JBind tests (because it failed to work with the parser used for the other tests). A larger part is probably due to JBind forcing document validation against the Schema on input, which can add considerable overhead. Most of the poor performance is probably attributable to the JBind framework itself, though, which uses a very indirect approach to binding (building on top of a DOM document model, in the current implementation).
All the tests except for JBind were run without full validation. Most of the data binding frameworks include a certain inherent level of validation (assuring, for instance, that the content model of elements is matched) just by their design. Most can also use validating parsers (such as Xerces Java 2) for full checking of documents on input, and some (including JAXB) can perform full validation of bound data in memory. Since the main concern in these tests was performance, I disabled optional validation wherever possible (including using both property file and unmarshaller/marshaller settings in Castor).
Figures 3 and 4 give the timing results for generating the XML text serialization (marshalling it, in data binding terms) of an in-memory representation using dom4j and various data binding approaches. These charts use the same vertical scale as the previous pair to simplify comparisons, but differ in that there's no equivalent to the SAX2 parser figure.
Figure 3. Writing large documents from memory
Figure 4. Writing small documents from memory
dom4j offers better performance than any of the data binding approaches in this area, beating JiBX by a smidgen and Zeus by not much more. The other data binding frameworks take about twice as long, with Quick the slowest of all (no pun intended, of course). There's not nearly as much variation here as in the input tests, though the fact that dom4j does better than any of the data binding frameworks suggests that they all still have room for improvement.
Figures 5 and 6 show the other part of the performance story, looking at memory usage. Running out of memory can be a problem when using very large documents (generally in the 5+ MB range) with document models. How do the data binding approaches compare in the amount of memory used for the document representation?
Figure 5. Large document memory usage
Figure 6. Small document memory usage
The differences here are much larger than in the time performance comparisons, and show a very different pattern. While dom4j performed well in the time measurements, in terms of memory usage it's much worse than any of the data binding frameworks (except for JBind, which builds on an internal document model equivalent to dom4j's representation). Compared to the best performers in this area, dom4j takes more than 10 times the memory to represent the same data.
The two mapped binding approaches use the same internal structure for the bound data, so they show identical memory usage. This gives them a tie for first place in the memory efficiency arena, turning in a performance several times better than the data binding approaches using generated code. This is partially because the mapped binding uses a very compact representation for data values. The mapped binding converts most of them to
int values in these tests (a
String with even one or two characters will take up 20 bytes or more in most Java Virtual Machines (JVM), versus only 4 bytes for an
int). The overhead of this conversion adds to read and write times, but it does have other benefits beyond just the memory size reduction. For actually working with the data,
ints are far more convenient and efficient than
Besides the more extensive use of primitive values in the mapped bindings, another reason for the greater memory efficiency of this approach is that generated code approaches usually add control information to the actual data present in each bound object. This control information pads the size of the objects, reducing one of the main benefits of data binding.
The data binding frameworks using generated code consume at least several times the memory of the mapped bindings in these tests, but (with the exception of JBind) are still much smaller than dom4j's document model representation. This is no surprise -- a document model such as dom4j needs to construct objects to represent every component of the document (including the actual data text, along with structure components such as elements and attributes), while the data bindings only need to hold the actual data. Much of that data is still stored as
Strings with the generated code bindings, but some values can be converted to
ints and others to object references.
Zeus is the only data binding approach considered here that directly stores all data as
Strings, which contributes to giving it the largest memory usage of the general data bindings. JBind's memory usage is still larger, by far. This is partially due to its internal use of a document model, but the amount of memory used by JBind is several times larger than that needed by a document model (such as dom4j) alone. Judging from this memory usage, it looks like JBind creates many additional objects to link between the binding facade and the actual data in the document model.
Figures 1 through 6 illustrate how the data binding frameworks perform in extended test runs that are representative of server environments. I thought it would also be interesting to see how these frameworks compare when used in a single-execution environment, such as where an application just uses the data binding code to read or write a configuration file. Figure 7 shows the results.
Figure 7. Startup time
Figure 7 shows the amount of time -- from when the benchmark program starts executing until after the round-trip operation returns (unmarshalling to objects, then marshalling the objects back out to a document) -- on a single short document. The difference from the previous timing figures is that here most of the time is spent in classloading and native code generation by the JVM for the data binding framework code. By comparing these results with the earlier timing charts, you can see that this startup time is generally several times larger than the actual processing time for even a fairly large document. If you're only working with a few documents per execution of your program, this startup time is going to be a more significant factor than the best case times shown earlier.
The size of the jar files used by the data binding framework is one major influence on this startup time. JiBX is the smallest, with a total size of less than 60KB for the runtime and parser. JAXB, Castor, and JBind are the largest, weighing in at roughly 1MB each. The time is also affected by the initialization required for each framework. In the case of Castor with a mapped binding this includes processing the mapping definition file, and for JBind it includes processing the Schema definition for the document.
Now that I've shown the performance results, I should probably say something about the framework that came in at the head of the pack in almost every test. Well, the fact is that it's a ringer -- JiBX is a data binding framework designed for performance, so if it's meeting its design requirements it should be the top performer in these tests.
JiBX actually originated from this series of articles. When I began looking at the available data binding frameworks I was surprised to see that they didn't perform all that well compared with document models such as dom4j. This was contrary to my expectations, since the data binding approach actually reduces the amount of document information kept in memory -- a document model holds on to everything, while a data binding only needs the actual data. I thought that an approach that works with less data should generally be faster than one that works with more.
In looking at how the existing data binding frameworks operate, I saw two aspects that didn't look good from a performance standpoint. The first was extensive use of reflection in many of the frameworks. Reflection is a way of accessing information about a Java language class at runtime. It can be used to access fields and methods in instances of a class, giving a way of dynamically hooking together classes at runtime without the need for any source code links between the classes. Reflection is a very powerful Java Technology feature, but suffers a performance disadvantage when compared to calling a method or accessing a field directly in compiled code.
The second aspect I questioned was the use of a SAX2 parser for unmarshalling documents. SAX2 is a very useful standard for parsing XML, but its event driven approach is not well suited to data binding and similar applications. The problem here is that the code processing the SAX2 events needs to maintain state information for everything it processes, and this adds both complexity and overhead.
I created the code that grew into JiBX to test some ways around these problematic aspects of the other data binding frameworks, and to experiment with extending the mapped binding approach beyond what's supported by Castor. Instead of reflection, JiBX uses byte code enhancement to add hooks into application code at project build time. Instead of SAX2, JiBX is based on a pull parser architecture (currently XMLPull). Rather than generating code from a DTD or Schema, JiBX works with a binding definition that associates user-supplied classes with XML structure.
These techniques are not unique to JiBX. Byte code enhancement is used by many JDO (Java Data Objects) implementations for basically the same purpose as in JiBX (to add access hooks to existing compiled code). The original JAXB code (since discarded) was based on a pull parser architecture similar to XMLPull. The mapped approach to data binding is supported (although with some limitations) by both Castor and Quick. Even though the individual techniques aren't new, the combination of them still makes for a very interesting alternative to the other data binding frameworks.
I'll give a full rundown on JiBX in Part 3 of this article. JiBX is still at an early development stage. For the performance tests, I hand wrote the code that would normally be added through byte code enhancement and ran it using the then-current version of the JiBX runtime. As of this article going to publication, I'm still wrapping up the enhancement code, and there are a number of other features I'd love to see added. If you can't wait until Part 3 to find out more about JiBX, check Resources for a link to the JiBX site. You can even start contributing to the future development of JiBX, as well as making use of JiBX in your own applications.
This look at data binding performance shows some interesting results, but doesn't fundamentally change the recommendations from Part 1. Castor provides the best current support for data binding using code generation from W3C XML Schema definitions. Its unmarshalling performance is weak compared to other alternatives, but it does give good memory utilization and a fairly fast startup time. The Castor developers say that they plan to focus on performance issues prior to their 1.0 release, so you may also see some improvement in the unmarshalling performance by then.
JAXB still looks like a good choice for the code generation approach in the future (the beta license only allows evaluation use). The current reference implementation beta is both bulky in terms of jar size and somewhat inefficient in terms of memory usage, but here again you may see better performance in the future. As of this writing, the current version is still a beta, and even after it's released commercial or open source projects may improve performance over the reference implementation. Since it will be a standard part of the J2EE platform, JAXB is definitely going to play an important role in working with XML and Java technologies.
The performance results also confirm the use of JBind, Quick, and Zeus as most appropriate for applications with special requirements rather than for general usage. JBind's XML Code approach can provide a great basis for an application built around processing of an XML document, but the performance of the current implementation is liable to be a problem. Quick and Zeus offer code generation from DTDs, but as I mentioned in Part 1, it's generally pretty easy to convert DTDs to Schemas. On the downside, Quick seems overly complex to use and Zeus supports only
Strings for bound data values (no primitives or object references using ID-IDREF or an equivalent).
For mapped approaches to data binding, Castor has the advantage of a fairly stable implementation and substantial real-world usage. Quick can be used for this type of binding as well, but again seems complex to set up. JiBX is new and not yet in full usage, but offers excellent performance along with a high degree of flexibility.
If you haven't read Part 1, you may want to refer back to it to learn more about the features of these data binding frameworks. Part 1 also discusses the tradeoffs between code generation and mapped approaches to data binding. In Part 3, I'll present the new JiBX framework in more depth. This includes how JiBX maps Java objects to XML, along with the byte code enhancement process JiBX uses at build time to minimize runtime overhead. Check back for full details on this exciting approach to pumping up framework performance!
- Part 1 of this series on data binding provides background on why you'd want to use data binding for XML, along with an overview of the available Java frameworks for data binding (developerWorks, January 2003).
- Download the full set of documents used in the tests for this article.
- If you need background on XML, try the developerWorks "Introduction to XML" tutorial (August 2002).
- Review the author's previous developerWorks articles covering performance (September 2001) and usage (February 2002) comparisons for Java XML document models.
- Read Brett McLaughlin's overview of Quick in "Converting between Java objects and XML with Quick," which shows you how to use this framework to quickly and painlessly turn your Java data into XML documents, without the class generation semantics required by other data binding frameworks (developerWorks, August 2002).
- For an introduction to the basics of object-relational data binding (similar in intent to the JDO standard, but not compatible), read "Getting started with Castor JDO," by Bruce Snyder (developerWorks, August 2002).
- Get the details on the Java Data Objects (JDO) API for persistence of Java language objects.
- Find out more about the Java Architecture for XML Binding (JAXB), the evolving standard for Java Platform data binding.
- Take a closer look at the Castor framework, which supports both mapped and generated bindings.
- Get to know JBind, a framework that focuses less on allowing Java language applications to easily work with XML, and more on building application code frameworks around XML.
- The Quick framework is based on a series of development efforts that predate both the Java Platform and XML. It provides an extremely flexible framework for working with XML on the Java Platform.
- Explore the details of Zeus, which (like Quick) generates code based on DTD descriptions of XML documents but is simpler to use -- and more limited -- than Quick.
- Learn more about the new JiBX framework for mapped bindings.
- Read more about the interplay of Java Technology and XML.
- Reference JSR 31 - the XML Data Binding Specification.
- Find more information on the technologies covered in this article at the developerWorks XML and Java technology zones.
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.