JAST: Java Abstract Syntax Trees

Natural Java idioms for processing XML data

You are here: JAST Home / User Guide / DOM Parsing /
Department of Computer Science

Processing XML Document Object Models

The JAST 2.5 Toolkit provides standard readers and writers for converting between XML text files and DOM-trees stored in memory. The XML data can be read from any kind of text input stream, whether a file or a URL. The main components to use are XMLReader and XMLWriter, both found in the top-level package: uk.ac.sheffield.jast. The DOM-tree is constructed out of Java nodes that exactly mimic the structure of the XML document, with type names such as: Document, Declaration, Instruction, Doctype, Element, Attribute, Data, Text and Comment. These are all found in the package: uk.ac.sheffield.jast.xml.

This part of the user guide describes how to use the standard readers and writers, and how to manipulate the DOM-tree nodes to insert and retrieve data programmatically. The DOM-tree may also be built directly within a Java program and then serialised as XML. Once you have understood the basic DOM-processing concepts presented in this quick-start introduction, please refer to the JAST 2.5 package APIs for more detailed information. Note also that the document validation and XPath search engines act upon this DOM-tree model.

Designing an XML Data Model

The first thing you will need to do is decide what kind of data you wish to model. Having done this, you will develop an XML markup scheme, using a mixture of XML elements and attributes to describe and encode the data. For example, a document that stores information about people in a family might look like this:

	<?xml version="1.0" encoding="UTF-8"?>
	<Family>
	  <!-- The Smith family -->
	  <Person role="father" age="45">
	    John Smith
	  </Person>
	  <Person role="mother" age="41">
	    Mary Smith
	  </Person>
	  <Person role="son" age="16">
	    Ben Smith
	  </Person>
	  <Person role="daughter" age="14">
	    Alice Smith
	  </Person>
	</Family>
So, the main XML element nodes used for markup are called Family and Person; there is an XML declaration; and there is a comment. The Person element also has attributes called role and age. We assume that information like this is stored in a text file.

Reading XML into a Document Object Model

The main API class to use is XMLReader. This can be used to read an XML File, using the default, or a chosen, character set; and can be directed to preserve or discard extra formatting whitespace. It returns the DOM-tree as an instance of the class Document. By default, XMLReader reads data from a file input stream using the UTF-8 character set:

        File file = new File("my/xml/input.xml");  // Or whatever file
        XMLReader reader = new XMLReader(file);    // Uses UTF-8
        Document document = reader.readDocument();
        reader.close();
By default, XMLReader reads data from a URL input stream using the Latin-1 (ISO-8859-1) character set. This is recommended when reading from a URL input stream, since the HTTP protocol expects the Latin-1 encoding by default:
        URL url = new URL("https://www.my.site/input.xml");  // Any URL
        XMLReader reader = new XMLReader(url);              // Uses ISO-8859-1
        Document document = reader.readDocument();
        reader.close();
Both of the above one-argument constructors set the character encoding based on whether a File or URL argument is supplied. It is also possible to supply the character encoding explicitly as the second argument, if a non-default encoding is used:
        File file = new File("my/xml/input.xml");  // Or whatever file
        XMLReader reader = new XMLReader(file, "ISO-8859-1");
        Document document = reader.readDocument();
        reader.close();
It is mandatory to supply the encoding if the XMLReader is constructed with InputStream or Reader arguments, since in these cases, it is not possible to infer the encoding. Note that the character encoding declared in the XML file must match the character encoding used by the underlying input stream.

By default, XMLReader creates a compact DOM-tree in memory. It discards all extra formatting whitespace surrounding the XML markup text (to avoid wasting memory by storing extra whitespace nodes in the DOM-tree), but preserves any whitespace included within textual content. Alternatively, XMLReader may be instructed to preserve all layout text, before reading the document:

        File file = new File("my/xml/input.xml");
        XMLReader reader = new XMLReader(file);
        reader.preserveLayout(true);              // Keep all layout text
        Document document = reader.readDocument();
        reader.close();
This preserves all formatting whitespace as extra Text nodes in the DOM-tree. The method preserveLayout() may be called with arguments true or false, to preserve or ignore (by default) any whitespace. When layout is preserved, an XML document may be re-written with exactly the same layout as when it was read.

By default, XMLReader will check that a document is well-formed XML, and will raise an exception if the XML is ill-formed (missing tags, mis-matched tags, missing quotation-marks around values, etc.) It is possible to direct XMLReader to validate the document as it is read, using either a Document Type Definition (DTD), or an XML Schema Definition (XSD). This is described in the XML Validation Guide.

Writing a Document Object Model to XML

The main API class to use is XMLWriter. This can be used to write a Document as an XML File, using the default, or a chosen, character set; and may be directed to preserve the existing format, or pretty-print the document afresh. By default, XMLWriter writes to a file output stream using the UTF-8 character set:

	Document document ... ;                     // Created previously
        File file = new File("my/xml/output.xml");  // Or whatever file
        XMLWriter writer = new XMLWriter(file);     // Uses UTF-8
        writer.writeDocument(document);
        writer.close();
By default, XMLWriter writes to a general Writer output stream using the Latin-1 (ISO-8859-1) character set, since this is the recommended character set for the HTTP protocol; and most web service applications use this character set by default:
	Document document = ... ;                   // Created previously
	Writer stream = ... ;                       // Created previously
        XMLWriter writer = new XMLWriter(stream);   // Uses ISO-8859-1
        writer.writeDocument(document);
        writer.close();
This assumes that the stream uses the same Latin-1 character encoding. When writing to any kind of file or output stream, it is possible to specify that the XML document should be written using a different, non-default character set, by adding this explicitly as the second construction argument:
	Document document = ... ;                   // Created previously
        File xmlFile = new File("my/xml/output.xml");
        XMLWriter writer = new XMLWriter(xmlFile, "ISO-8859-1");  // Latin-1
        writer.writeDocument(document);
        writer.close();
in which case the XMLWriter will verify that the Document declared the same character encoding, before proceeding; otherwise it will raise an exception. The two-argument constructor is also useful in web applications, for example, when accessing the PrintWriter from a Java servlet's HTTPServletResponse response-object:
	Document document = ... ;                   // Created previously
	HTTPServletResponse response = ... ;        // Created by a Servlet
        XMLWriter writer = new XMLWriter(response.getWriter(), 
                response.getCharacterEncoding());
        writer.writeDocument(document);
        writer.close();
This style of constructing the XMLWriter ensures that the writer uses the same character encoding as the PrintWriter used by the servlet's response. You should also ensure that the XML declaration at the head of the Document uses the same character set.

By default, XMLWriter pretty-prints the output XML file using a standard tree-structured layout with newlines and two-character indentation for nested XML structures. Alternatively, XMLWriter may be instructed to preserve the original layout of the DOM-tree in memory, before writing the document:

        File xmlFile = new File("my/xml/output.xml");
        XMLWriter writer = new XMLWriter(xmlFile);
        writer.preserveLayout(true);                // Write native layout
        writer.writeDocument(document);
        writer.close();
The method preserveLayout() may be called with arguments true or false, respectively to disable or enable pretty-printing. If layout is preserved both when reading and writing the document, the output XML document will have exactly the same layout as when it was read.

Accessing the Contents of the DOM-Tree

The main classes of interest to programmers are Document, Element, Attribute and Text; although there are other types of node that represent the XML declaration, a document type definition, or a special stylesheet instruction, or escaped character data. The DOM-tree nodes are designed according to the Composite Design Pattern, that is, everything in the DOM-tree is some kind of Content and respects a common API. The more specific kinds of node extend this API in different ways. Please see the full API descriptions for each of these node types, which may be found in the package: uk.ac.sheffield.jast.xml.

The following is just an example of how the nodes of the XML memory-tree can be accessed within a Java program:

	Declaration header = document.getDeclaration();  // XML declaration.
        Doctype doctype = document.getDoctype();         // Optional doctype.
        Comment comment = document.getComment();         // Optional comment
        Element root = document.getRootElement();        // Root element.
	List<Content> contents = document.getContents(); // All subnodes.
	Content node = document.getContent(2);           // Third sub-node.
 
	String name = root.getName();                    // Element name.
	int contentType = root.getType();                // Bitmask type.
	List<Element> allChildren = root.getChildren();
	List<Element> someChildren = root.getChildren("Person");
	Element child = root.getChild("Person");         // First so-named.
	Content parent = child.getParent();              // Same object as root.
	String text = child.getText();                   // Textual content.

        List<Attribute> properties = child.getAttributes();
	Attribute property = child.getAttribute("age");
	String ageStr = property.getValue();	         // If property != null
	int age = property.intValue();	                 // If property != null
	String value = child.getValue("age");            // Access value directly
In addition, there is provision to iterate over all nodes in a memory-tree. The iteration may include the starting node, or just all of its descendants. For explicit access to different kinds of Content node, such as Text, Data and Comment nodes, you must use Filter and its subclasses to filter the contents of a given node. Please see the package uk.ac.sheffield.jast.filter.

Constructing and Manipulating the DOM-Tree

The DOM-tree can be constructed directly in Java, using API calls of the relevant DOM-tree nodes. The main classes of interest are Document, Element, Attribute and Text. All construction-methods are designed to nest, so that the Java code looks somewhat like the structure of the XML file being created. Please see the full API descriptions for how to construct each of these node types.

The following is just an example of how the nodes of the XML DOM-tree can be created within a Java program, using the return value of the previous setter as the target of the next setter (suitably nested):

	Document document = new Document();           // Default encoding.
	Element root = new Element("Family")
		.addContent(new Comment("The Smith family"))
		.addContent(new Element("Person")
			.setText("John Smith")        // Sets all text.
			.setValue("role", "father")   // Sets attribute.
			.setValue("age", "45"))       // End of add John
		.addContent(new Element("Person")
				// Another way to add text content.
			.addContent(new Text("Mary Smith"))
			.setValue("role", "mother")
			.setValue("age", "41"))       // End of add Mary
		.addContent(new Element("Person")
			.setText("Ben Smith")
				// Another way to set an attribute.
			.setAttribute(new Attribute("role", "son"))
			.setValue("age", "16"))       // End of add Ben
		.addContent(new Element("Person")
				// Another way to add text incrementally.
			.addContent(new Text("Alice"))
			.addContent(new Text(" Smith"))
			.setValue("role", "daughter")
			.setValue("age", "14")));     // End of Family
	document.setRootElement(root);
The node APIs contain many further manipulation-methods that remove specific nodes, or all nodes of a given type, or the node at a given index. All Content nodes may have at most one parent node. If you wish to reuse part of an XML DOM-tree, you must detach the subtree from the source document before adding it to the destination document. Alternatively, you may clone() part of the source tree and add this copied subtree to the destination. Attaching any node to more than one parent node will raise an exception.

Notification of Exceptions

Both XMLReader and XMLWriter may raise various kinds of IOException, if a problem occurs with the underlying file system. Ill-formed XML syntax is reported through SyntaxError, whereas attempting to construct an illegal memory-tree is reported through SemanticError. In general, faulty user code may raise the following:

  • FileNotFoundException - raised if the specified file cannot be found (wrong pathname given)
  • UnsupportedEncodingException - raised if the character set encodings are inconsistent
  • IOException - raised if a fault in the filesystem occurs while reading an XML input file
  • SyntaxError - raised if a syntax error is detected while parsing an XML input file
  • SemanticError - raised if any construction method violates XML DOM-tree rules
The latter are styled as errors, rather than exceptions, since the W3C standard requires malformed XML to be rejected outright, and not handled by exception-tolerant software.

Regent Court, 211 Portobello, Sheffield S1 4DP, United Kingdom