JAST: Java Abstract Syntax Trees

Natural Java idioms for processing XML data

You are here: JAST Home / User Guide / Java Binding /
Department of Computer Science

Binding XML to Custom Java Classes

The JAST 2.5 Toolkit provides custom readers and writers for converting between XML text files and arbitrary Abstract Syntax Trees built from your own Java classes. This facility is also known as XML-to-Java binding; and is useful because it uses Java types that are meaningful in your application domain. JAST adopts a convention-over-configuration approach in which, if you define your Java classes in a fairly standard way, then the readers and writers will automatically detect how to marshal Java to XML, and unmarshal XML back to Java. The main components to use are ASTReader and ASTWriter, both found in the top-level package: uk.ac.sheffield.jast. You have to provide your own Java classes for the AST nodes, but some examples are given in the package: uk.ac.sheffield.jast.ast.

This part of the user guide describes the conventions for designing Java AST classes, how to use the custom readers and writers to write Java models built from your AST classes as serialised XML files and how to read a serial XML file to restore an exact copy of your original Java model. The custom readers and writers are able to handle simple Java object trees, or circular and re-entrant Java object graphs. Once you have understood the basic AST-processing concepts presented in this quick-start introduction, please refer to the JAST 2.5 package APIs for more detailed information.

Designing an XML Data Model

The first thing you will need to do is decide what kind of data you wish to model. Having done this, you will develop an XML markup scheme, using a mixture of XML elements and attributes to describe and encode the data. For example, a catalogue that stores information about films and TV shows might look like this:

	<?xml version="1.0" encoding="UTF-8"?>
	<?java-binding xmlns="org.mydomain.catalogue"?>
	<Catalogue xmlns="https://mydomain.org/catalogue">
	  <Film year="1976" rating="PG">
	    <Title>Star Wars</Title>
	    <Director>George Lucas</Director>
	  </Film>
	  <TVShow year="1965">
	    <Title>Thunderbirds</Title>
	    <Director>Gerry Anderson</Director>
	  </TVShow>
	  <Film year="2007" rating="15">
	    <Title>Transformers</Title>
	    <Director>Michael Bay</Director>
	  </Film>
	</Catalogue>
So, the main XML nodes are called Catalogue, Film, TVShow, Title and Director; and the attributes year and rating are used in some nodes. Nodes like Title and Director are also known as leaf-nodes, because they are terminal nodes containing no further descendants, but only textual data (and possibly attributes). Other nodes are known as branch-nodes and contain descendants; in particular, one branch-node Catalogue is the root-node for the whole tree.

A further thing to note is that a default XML namespace URI is declared in the root node (for Java project called catalogue owned by a company called mydomain.org). Finally, note how an XML processing instruction appears after the XML declaration, declaring that the Java binding will map elements from this namespace to classes in the Java package org.mydomain.catalogue. These extra declarations are optional, but if present, will tell JAST how to unmarshal the XML file to Java.

Designing Java AST Node Classes

Once you have a stable XML model, you can consider developing the Java AST model. The basic notion is that, for each differently-named XML element, you will provide a Java class with the same name that stores the information held by this element. By default, an XML element named Film will be mapped to a Java class of the same name. The first thing that your AST class must do is provide a public default constructor (with no arguments). This is needed so that the unmarshalling ASTReader can create a fresh instance of this node, every time it encounters an XML element of the same name.

public class Film {
	// default constructor
    public Film() {
    }
    ...
}

Since XML permits more liberal identifiers than Java, some XML names must be normalised, for example, a namespace-prefixed XML identifier car:ford-focus would be mapped to a Java class FordFocus (see below about the use of namespaces). The normalising algorithm removes namespace prefixes and all internal punctuation, capitalising the letter following each removed punctuation mark, on the assumption that this occurred at a word boundary (a Java style known as CapitalCase). Similarly, XML attribute names that do not conform to Java syntax are normalised (to the Java style known as camelCase).

All the Java classes that make up your AST will be provided by you, in a Java package. The unmarshalling reader ASTReader must be told about which package to use, when mapping XML elements to Java classes. Similarly, the marshalling writer ASTWriter must be told about this package. Each Java AST class in this package will correspond to one named XML element. It may declare a number of Java fields to store attribute data from the XML element. It may also declare a number of Java fields to store dependent AST nodes, which will be mapped to Java from other XML elements. It may declare one field called content to store simple textual or numeric content. The marshalling and unmarshalling tools work out how to set attributes, set textual content, or attach dependent nodes by using Java reflection, a means of interrogating an object to find out what methods its class defines for setting and getting values.

Mapping XML Attributes to Java

In the XML data model above, the XML element called Film declares the attributes year and rating (representing the year of the film's release and its censorship rating). In Java, you will provide a corresponding class called Film, with private fields corresponding to the attributes:

public class Film {
	// fields storing XML attributes
    private int year;
    private String rating;
    ...
}

Note how these fields can be of any simple type - not just the String type. The tools will convert between XML text data and any of the simple Java types. In order for the unmarshalling ASTReader to recognise these Java fields as the targets for mapping XML attributes, you should provide the class with conventionally-named getter- and setter-methods for these fields:

public class Film {
	// fields storing XML attributes
    private int year;
    private String rating;
	// methods for accessing XML attributes
    public int getYear() {		// getter for year-attribute
        return year;
    }
    public Film setYear(int year) {	// setter for year-attribute
        this.year = year;
        return this;
    }
    public String getRating() {
        return rating;
    }
    public Film setRating(String rating) {
        this.rating = rating;
        return this;
    }
    ...
}

That is, for every attribute-field named X and having the type T, you must provide two methods with the names: T getX() and setX(T val), which respectively return, and accept, a value with the same type as the field. This style should be familiar if you have ever created Java classes in the style of Java Beans, a web-programming convention. Any Java field which has these two methods will be marshalled as an XML attribute by ASTWriter (unless the field is called content - see below). If you wish for your class to have a secret, internal field, then you may prefix this field with the Java keyword transient, which will prevent it from being serialised. The setter-methods conventionally return this, the object being modified by the setter (but could return void).

Mapping Dependent XML Elements to Java

In the XML data model above, the XML element called Film also has some dependent children elements, called Title and Director. These elements will be mapped recursively to Java objects, instances of other AST classes. In the Film class, you will provide private Java fields to attach these objects to the Film object:

public class Film {
    ...
	// fields storing XML dependent elements
    private Title title;
    private Director director;
    ...
}

Note how each of these fields is strongly-typed with the respective type of the node to be stored there. The fields happen to be declared in the order title, and then director. This order will be the order in which the marshalling ASTWriter serialises the XML element children. If you wish the dependent XML children to be marshalled in a different order, simply change the order in which the Java fields are declared. In order that the unmarshalling ASTReader may recognise these Java fields as the targets for attaching dependent XML elements, you should provide the Film class with conventionally-named adder- and getter-methods:

public class Film {
    ...
	// fields storing XML dependent elements
    private Title title;
    private Director director;
    ...
        // methods for accessing XML dependent elements
    public Title getTitle() {		  // getter for Title-child
        return title;
    }
    public Film addTitle(Title title) {	  // adder for the Title-child
        this.title = title;
        return this;
    }
    public Director getDirector() {	
        return director;
    }
    public Film addDirector(Director director) {
        this.director = director;
        return this;
    }
}

That is, for every dependent AST node with the type T that is stored in a field named X, you should provide an adder-method with the name addT(T obj) accepting an object of this type; and a getter-method named T getX() that returns a value of this type (below we also show that getters of the form: Collection<T> getX() are also possible). Note that the adder-method is named after the type you are adding, whereas the getter-method is named after the field in which you stored it. This is deliberately assymmetrical, to distinguish dependent elements from attributes.

Any Java field which has these two methods will be marshalled as a dependent XML child-element by ASTWriter. If you wish for your class to have a secret field storing an internal reference to another object, then you may prefix this field with the Java keyword transient, which will prevent it from being serialised. The getter-methods shown here return a single object. Below, we show how they could also return any collection of objects of the type that was added. The adder-methods conventionally return this, the object being modified by the adder (although they could return void).

Mapping XML Text Content to Java

In the XML data model above, the XML element called Title only has textual content, the title of the film. Similarly, the XML element called Director only stores textual content, the name of the director. These leaf-nodes in the XML tree store simple content, which could be text (as here) or some other simple integer or real value. Any such class which has content must declare a private Java field called content for storing this information, with a pair of setter- and getter methods to access the content. The conventions are exactly the same as for storing XML attributes, except that the reserved name for this field is always called content:

public class Director {
    ...
	// field storing simple content
    private String content;
        // methods to access simple content
    public String getContent() {
	return content;
    }
    public Director setContent(String content) {
	this.content = content;
	return this;
    }
}

That is, if a class stores XML content of the basic type T, it must provide a field named T content and two methods, T getContent() and setContent(T val). The name of the field is what distinguishes this content-field from other fields used to store XML attributes. In this example, the content is naturally of the type String. If you wish to store strongly-typed numeric content, then the field-type and the types returned and accepted by the access-methods may be of the appropriate numeric type (similar to attributes; see above).

The marshalling and unmarshalling tools can convert between XML text content and any of the Java basic types, and use the declared types of the fields to work out how to attempt to convert text data, rasing an exception if the text cannot be converted to this type. Furthermore, if you wish to store an arbitrary object as content (or as an attribute), then so long as this object's class provides a constructor-from-String, and a standard String conversion method toString(), then this object may also be stored as content (or as an attribute).

Factoring Common Behaviour in AST Nodes

Sometimes, different AST classes may end up looking quite similar, and it would be a chore to have to repeat similar coding for several classes. For example, the Film and TVShow classes overlap considerably, in terms of their dependent- and attribute-fields, and their associated getter- and setter-methods. Fortunately, you may arrange your AST classes in a hierarchy, according to their similarities, just as you would expect in Java. The following Show class is intended as the abstract superclass of both Film and TVShow:

public abstract class Show {
    private int year;
    private String rating;
    private Title title;
    private Director director;
	// public default constructor
    public Show() { ... }
	// methods to add XML dependents
    public Show addTitle(Title title) { ... }
    public Show addDirector(Director director) { ... }
	// methods to access XML dependents
    public Title getTitle() { ... }
    public Director getDirector() { ... }
	// methods to set XML attributes
    public Show setYear(int year) { ... }
    public Show setRating(String rating) { ... }
	// methods to access XML attributes
    public int getYear() { ... }
    public String getRating() { ... }
}

All the common fields and methods needed are defined in one place. Now, it is very easy to define the AST classes for Film and TVShow as subclasses of Show, using Java inheritance, and obtain all the expected fields and construction methods from the superclass:

public class Film extends Show {
    public Film() {}      // only needs a default constructor
}

public class TVShow extends Show {
    public TVShow() {}    // only needs a default constructor
}

The JAST toolkit makes it very easy for programmers to factor out common behaviour in classes, as you would expect. Other XML Java-binding tools cannot do this as easily (for example, JAXB will generate duplicated APIs from an XML Schema).

Handling Heterogeneous Collections of AST Nodes

Furthermore, the JAST toolkit makes it easy to manipulate polymorphic lists of AST nodes having heterogeneous types. For example, let us assume that, in the root node Catalogue, we do not care to distinguish between the action of adding a Film and that of adding a TVShow. Instead, we are only interested in adding polymorphic Show objects. Accordingly, we can design the construction API for Catalogue in the following way:

public class Catalogue {
	// field to store heterogeneous dependents
    private List<Show> shows;
	// default constructor creates the list field
    public Catalogue() {
        shows = new ArrayList<Show>();
    } 
	// methods required to add/access dependents
    public Catalogue addShow(Show show) {
        shows.add(show);
        return this;
    }
    public List<Show> getShows() {
        return shows;
    }
    ...  // possibly other methods, as desired
}

Two things have happened here. Firstly, rather than providing Catalogue with separate add-methods addFilm(Film) and addTVShow(TVShow), we have decided that a Catalogue need not distinguish the two, and have simply provided addShow(Show) that accepts a polymorphic Show argument. The JAST reflection tools will automatically discover this more general method, if you don't supply the more specific methods (which would take priority).

Secondly, the get-method getShows() will now be used to access the heterogeneous list of films and TV shows. This method will be detected automatically, by reflecting the name of the field. Notice how, in contrast to earlier examples, this dependent-field's get-method returns a list of objects. These objects will be marshalled in the same order that they were added to the list, as XML elements of mixed kinds (the example XML file above illustrates the mixed children of the Catalogue root node).

Although storing dependent nodes in a Java List is the most common case, it is also possible to store them in a Set or a Map. In the case of a Map, the dependent node should be stored as a value in the Map, indexed against some key (typically an identifying attribute of the stored node). The JAST unmarshaller will seek to discover a suitable adder-method for the type of node stored in any collection-typed field, and from this will also determine that the field can be serialised as a collection of dependent XML elements. Note that if unordered Set or Map Java implementations are chosen, the order of saved nodes may not be stable.

Finally, note how this capability leverages the assymmetric adder- and getter-methods. The adder-methods are sensitive to the type of AST node being added; whereas the getter-methods are sensitive to the type of the field being read. This is necessary in order to support all of the Java collection-types in an intuitively natural way.

Unmarshalling from an XML File to a Java AST

The main class to use for unmarshalling an XML file into an in-memory AST is ASTReader, found in the top-level package uk.ac.sheffield.jast. This can be used to read XML from a file or other input stream, using either the default, or a specified, character set, and always discards extra formatting whitespace. If the XML input has no information about Java binding, use the following style:

    File file = new File("my/xml/input.xml");  // Or whatever file
    ASTReader reader = new ASTReader(file);
    reader.useDomain("mydomain.org");             // Or whatever domain
    reader.usePackage("org.mydomain.catalogue");  // Or whatever package
    Catalogue root = (Catalogue) reader.readDocument();
    reader.close();

In this example, the useDomain() instruction tells the reader about your company domain. The usePackage() instruction tells the reader the name of the Java package, owned by this domain, which defines the Java AST classes that you wish to use. This name must resolve to a Java package in the usual way. The result returned by the reader is always an instance of your own root class, here an instance of Catalogue. However, since the reader can only guess that it has the most general Java type Object, you must downcast the result to your chosen AST class-type (in this example, we downcast to Catalogue).

If the XML input declares a default XML namespace and a Java-binding processing instruction mapping this namespace to the desired package, then the useDomain() and usePackage() instructions may be omitted, giving the following shorter style:

    File file = new File("my/xml/input.xml");  // Or whatever file
    ASTReader reader = new ASTReader(file);
    Catalogue root = (Catalogue) reader.readDocument();  // Uses UTF-8
    reader.close();

We will assume, in the following examples, that the Java-binding information is given in the XML files. If not, then invoke the two additional methods on the reader, to set up the desired Java-binding.

By default, ASTReader reads input from a file stream using the UTF-8 character set. However, when reading input from a URL stream, it uses the Latin-1 (ISO-8859-1) character set by default. This is recommended when reading from a URL input stream, since the HTTP protocol expects the Latin-1 encoding by default:

    URL url = new URL("https://www.my.site/input.xml");  // Any URL
    ASTReader reader = new ASTReader(url);
    Catalogue root = (Catalogue) reader.readDocument();	// Uses ISO-8859-1
    reader.close();

This allows you to unmarshal XML files over the Internet. You may also specify a non-default character set explicitly (so long as the XML document declares that it uses the same character encoding). If there is a conflict between the declared and actual character encoding, this will raise an UnsupportedEncodingException. The following shows how to read a file using the Latin-1 character set (overriding the default UTF-8):

    File file = new File("my/xml/input.xml");  
    ASTReader reader = new ASTReader(file, "ISO-8859-1");  // Latin-1
    Catalogue root = (Catalogue) reader.readDocument();
    reader.close();

Note how the reader is instructed, either by a Java-binding instruction in the XML file, or by usePackage(), to map XML elements to Java classes from the named package, before reading the input file. The reader will understand that your classes have fully qualified names, like: org.mydomain.catalogue.Film. The Java runtime will attempt to find classes in this package in the usual way, either seeking them in a JAR library you included in your project, or by searching the package directories under your working directory. If you fail to specify otherwise, the reader will expect to find the these classes in the default Java package (the working directory) instead.

When unmarshalling a serial XML file into an arbitrarily-connected object graph, circular or re-entrant structures may be restored, if the XML file observes certain conventions on the use of id and ref attributes. Whenever ASTReader encounters an XML element with a new serial id value, it creates a brand-new instance of the corresponding Java class. If it encounters a reference XML element, with a ref attribute, then instead of creating a new object, it restores the in-memory object reference to point to the earlier object, whose id value matches the ref value. (If the numbers get out of sequence, then the XML file is corrupted and an exception is raised).

Marshalling from a Java AST to an XML File

The main class to use for marshalling an in-memory AST to a serialised XML file is ASTWriter, found in the top-level package uk.ac.sheffield.jast. This can be used to write the AST to an XML file using either the default, or a specified, character set. The mapping from Java identifiers to XML identifiers can be restored using the mapping discovered during reading (see below). Marshalling will introduce two extra attributes called id and ref, which are reserved names for the JAST toolkit. They help flatten circular, or re-entrant object-graph structures during marshalling, such that these may be restored during unmarshalling.

If the Java AST has never been marshalled to XML before, then the writer will need to know what Java-binding to use. In this case, use the following style:

    Catalogue root = ... ;                      // Created previously
    File file = new File("my/xml/output.xml");  // Or whatever file
    ASTWriter writer = new ASTWriter(file);
    writer.useDomain("mydomain.org");             // Or whatever domain
    writer.usePackage("org.mydomain.catalogue");  // Or whatever package
    writer.writeDocument(root);			// Uses UTF-8
    writer.close();

In this example, the useDomain() instruction tells the writer about your company domain. The usePackage() instruction tells the writer the name of the Java package, owned by this domain, which contains the Java AST classes that you wish to marshal to XML. All of the classes should come from this domain (if not, see below). The writer will pretty-print the XML file according to a standard layout, with newlines and two-character indentation for nested XML structures.

By default the ASTWriter marshals an in-memory Java AST to a serialised XML file using the UTF-8 character set. However, when writing to a general Writer output stream, it uses the Latin-1 (ISO-8859-1) character set by default, since this is the recommended character set for the HTTP protocol; and most web service applications use this character set by default:

    Catalogue root = ... ;                      // Created previously
    HTTPServletResponse response = ... ;	// Created by a servlet
    ASTWriter writer = new ASTWriter(response.getWriter());
    writer.useDomain("mydomain.org");
    writer.usePackage("org.mydomain.catalogue");
    writer.writeDocument(root);			// Uses ISO-8859-1
    writer.close();

In this web-service example, we access the PrintWriter from a Java HTTPServletResponse object, which uses Latin-1 by default. The two-argument constructor may also be used to specify a character set explicitly, as the second argument. A safer way of generating output to send via HTTP might be the following:

    Catalogue root = ... ;                      // Created previously
    HTTPServletResponse response = ... ;	// Created by a servlet
    ASTWriter writer = new ASTWriter(
        response.getWriter(), 
        response.getCharacterEncoding());
    writer.useDomain("mydomain.org");
    writer.usePackage("org.mydomain.catalogue");
    writer.writeDocument(root);                // Uses explicit encoding
    writer.close();
in which case the same character encoding will be declared in the serialised XML document as that used by the PrintWriter output stream, which wrote the document.

When serialising an arbitrary graph of Java objects as XML elements, every new object encountered will be written out as a named XML element. When ASTWriter encounters this object for the first time, it will add an id attribute, whose value is the next serial identifier in sequence, starting from 1. If the object is encountered a second time, it is not written out in full, but a reference XML element is written, with the same name and a ref attribute, whose value is the same as the object identifer.

When marshalling a graph of Java objects to XML, ASTWriter will add an XML processing instruction declaring the Java-binding from the default XML namespace to the Java package that you specified in the usePackage() method. It will also add a default namespace URI declaration to the root XML element, using the domain name that you specified in the useDomain() method to determine how to convert the Java package name into a unique namespace URI, using some of the package name as a domain, and the rest as a project identifier.

Controlling the Java-to-XML Mapping

Whenever an ASTReader unmarshals an XML file, it creates an internal data structure recording everything about the XML-to-Java mapping. This may include information about how certain XML names have been normalised to ensure that they conform to legal Java names, or information about which XML namespaces were mapped to different Java packages used for different libraries of user-defined AST classes. All of this information is stored in a single object of the type Metadata.

Whenever an ASTWriter marshals an in-memory AST back to serial XML, the easist way of restoring the same Java-to-XML mappings is to use the same Metadata instance that was constructed during unmarshalling. This can remove the need, when writing, to declare XML namespaces, or mapped packages explicitly. The following illustrates this:

    File in = new File("my/xml/input.xml");      // Declares a Java-binding
    ASTReader reader = new ASTReader(in);
    Catalogue root = (Catalogue) reader.readDocument();
    Metadta metadata = reader.getMetaData();     // Extract the metadata
    reader.close();

    // Program does something with the AST

    File out = new File("my/xml/output.xml");	 // Whatever output file
    ASTWriter writer = new ASTWriter(out);
    writer.setMetadata(metadata);		 // Restore old metadata
    writer.writeDocument(root);
    writer.close();

That is, the Metadata object is extracted from the reader using getMetadata(); and the writer is instructed to use the same metadata using setMetadata(metadata). For example, this will ensure that when an XML element with a namespace prefix and non-standard Java name cat:TV-show is mapped to a Java class org.mydomain.catalogue.TVShow, then this class will be mapped back to cat:TV-show when it is written. It will also ensure that the output XML file will have the same XML version and encoding as used in the input file; and will declare the same XML namespaces (mapping these to the same Java packages) as the input file. If the metadata is not transferred, then the writer will use default settings and the element will be serialised as TVShow, expected to be a Java class in the default Java package.

XML namespace prefixes can be used to identify XML elements that should be mapped to Java classes from different user-defined packages. This is a common requirement in some model-driven engineering applications, where models are serialised as XML. Below, we imagine a transport-related Java model, in which the AST classes are split over a core package and a separate transport package. We specify that XML elements from different namespaces xmlns:core and xmlns:tran should be mapped to different packages in the following way:

    File file = new File("my/xml/input.xml");	// Whatever input file
    ASTReader reader = new ASTReader(file);
    reader.useDomain("mydomain.org");
    reader.usePackage("org.mydomain.model.core", "xmlns:core");
    reader.usePackage("org.mydomain.model.transport", "xmlns:tran");
    Catalogue root = (Catalogue) reader.readDocument();
    reader.close();

This tells the reader to use classes in the core model package org.mydomain.model.core when mapping XML elements that begin with the prefix core, for example, core:Container, and tells the reader to use classes in the transport package org.mydomain.model.transport when mapping XML elements that begin with the prefix tran, for example, tran:Vehicle. Exactly the same information may be given to an ASTWriter, to ensure that classes from particular packages are mapped to XML elements from different namespaces.

XML namespaces work exactly like Java packages, in that they provide a scope for XML elements that might otherwise have the same name. We leverage XML namespaces, in order to map XML elements from different namespaces to Java classes from different packages. Notice how this style of usePackage() has two arguments: the first argument is the Java package name, and the second is an XML namespace declaration, introducing the special prefix. In the earlier one-argument usage of usePackage(), all XML elements were assumed to come from the default XML namespace xmlns.

Apart from this, it is possible to access Metadata directly, using its own API. This allows you to set XML file properties and declare explicit mappings between XML identifiers and Java class names. Metadata properties map XML namespace attributes to their corresponding URIs. Metadata bindings map XML namespaces to Java packages. If Java-binding is declared, but no URI was declared for a namespace, JAST will try to synthesise a URI from the domain and the Java package name. Please see the JAST 2.5 package APIs for more details.

Notification of Exceptions

ASTReader and ASTWriter may raise various kinds of IOException, if a problem occurs with the underlying file system. Ill-formed XML syntax is reported through SyntaxError, whereas the inability to construct or manipulate an AST node class is reported through SemanticError. This covers a variety of errors, including missing constructors, missing methods, or failing methods. In summary, faulty user code may raise the following:

  • FileNotFoundException - raised if the specified file cannot be found (wrong pathname given)
  • UnsupportedEncodingException - raised if the character set encodings are inconsistent
  • IOException - raised if a fault in the filesystem occurs while reading an XML input file
  • SyntaxError - raised if a syntax error is detected while parsing an XML input file
  • SemanticError - raised if any required construction or access method is not found, or fails

The latter are styled as errors, rather than exceptions, since the W3C standard requires malformed XML to be rejected outright, and not handled by exception-tolerant software.

Regent Court, 211 Portobello, Sheffield S1 4DP, United Kingdom