Binding XML to Custom Java Classes
The JAST 2.5 Toolkit provides custom readers and
writers for converting between XML text files and arbitrary Abstract Syntax Trees
built from your own Java classes. This facility is also known as XML-to-Java
binding; and is useful because it uses Java types that are meaningful in your
application domain. JAST adopts a convention-over-configuration approach in which,
if you define your Java classes in a fairly standard way, then the readers and writers
will automatically detect how to marshal Java to XML, and unmarshal XML back to Java.
The main components to use are ASTReader and ASTWriter ,
both found in the top-level package: uk.ac.sheffield.jast . You have
to provide your own Java classes for the AST nodes, but some examples are given
in the package: uk.ac.sheffield.jast.ast .
This part of the user guide describes the conventions for designing Java AST
classes, how to use the custom readers and writers to write Java models built
from your AST classes as serialised XML files and how to read a serial XML file
to restore an exact copy of your original Java model. The custom readers and
writers are able to handle simple Java object trees, or circular and re-entrant
Java object graphs. Once you have understood the basic AST-processing concepts
presented in this quick-start introduction, please refer to the
JAST 2.5 package APIs for more detailed
information.
Designing an XML Data Model
The first thing you will need to do is decide what kind of data you wish
to model. Having done this, you will develop an XML markup scheme, using a
mixture of XML elements and attributes to describe and encode the data. For
example, a catalogue that stores information about films and TV shows might
look like this:
<?xml version="1.0" encoding="UTF-8"?>
<?java-binding xmlns="org.mydomain.catalogue"?>
<Catalogue xmlns="https://mydomain.org/catalogue">
<Film year="1976" rating="PG">
<Title>Star Wars</Title>
<Director>George Lucas</Director>
</Film>
<TVShow year="1965">
<Title>Thunderbirds</Title>
<Director>Gerry Anderson</Director>
</TVShow>
<Film year="2007" rating="15">
<Title>Transformers</Title>
<Director>Michael Bay</Director>
</Film>
</Catalogue>
So, the main XML nodes are called Catalogue , Film ,
TVShow , Title and Director ; and the
attributes year and rating are used in some nodes.
Nodes like Title and Director are also known as
leaf-nodes, because they are terminal nodes containing no further
descendants, but only textual data (and possibly attributes). Other nodes
are known as branch-nodes and contain descendants; in particular,
one branch-node Catalogue is the root-node for the
whole tree.
A further thing to note is that a default XML namespace URI is declared in
the root node (for Java project called catalogue owned by a company
called mydomain.org). Finally, note how an XML processing instruction
appears after the XML declaration, declaring that the Java binding will map
elements from this namespace to classes in the Java package
org.mydomain.catalogue. These extra declarations are optional, but
if present, will tell JAST how to unmarshal the XML file to Java.
Designing Java AST Node Classes
Once you have a stable XML model, you can consider developing the Java AST
model. The basic notion is that, for each differently-named XML element, you
will provide a Java class with the same name that stores the information held
by this element. By default, an XML element named Film will be
mapped to a Java class of the same name. The first thing that your AST class
must do is provide a public default constructor (with no arguments). This is
needed so that the unmarshalling ASTReader can create a fresh
instance of this node, every time it encounters an XML element of the
same name.
public class Film {
// default constructor
public Film() {
}
...
}
Since XML permits more liberal identifiers than Java, some XML names must
be normalised, for example, a namespace-prefixed XML identifier
car:ford-focus would be mapped to a Java class
FordFocus (see below about the use of namespaces).
The normalising algorithm removes namespace prefixes and all internal
punctuation, capitalising the letter following each removed punctuation mark,
on the assumption that this occurred at a word boundary (a Java style known as
CapitalCase). Similarly, XML attribute names that do not conform to
Java syntax are normalised (to the Java style known as camelCase).
All the Java classes that make up your AST will be provided by you, in a Java
package. The unmarshalling reader ASTReader must be told about
which package to use, when mapping XML elements to Java classes. Similarly,
the marshalling writer ASTWriter must be told about this package.
Each Java AST class in this package will correspond to one named XML element.
It may declare a number of Java fields to store attribute data from the XML
element. It may also declare a number of Java fields to store dependent AST
nodes, which will be mapped to Java from other XML elements. It may declare
one field called content to store simple textual or numeric
content. The marshalling and unmarshalling tools work out how to set
attributes, set textual content, or attach dependent nodes by using Java
reflection, a means of interrogating an object to find out what methods
its class defines for setting and getting values.
Mapping XML Attributes to Java
In the XML data model above, the XML element called Film
declares the attributes year and rating
(representing the year of the film's release and its censorship rating).
In Java, you will provide a corresponding class called Film ,
with private fields corresponding to the attributes:
public class Film {
// fields storing XML attributes
private int year;
private String rating;
...
}
Note how these fields can be of any simple type - not just the String type.
The tools will convert between XML text data and any of the simple Java types.
In order for the unmarshalling ASTReader to recognise these Java
fields as the targets for mapping XML attributes, you should provide the class
with conventionally-named getter- and setter-methods for these fields:
public class Film {
// fields storing XML attributes
private int year;
private String rating;
// methods for accessing XML attributes
public int getYear() { // getter for year-attribute
return year;
}
public Film setYear(int year) { // setter for year-attribute
this.year = year;
return this;
}
public String getRating() {
return rating;
}
public Film setRating(String rating) {
this.rating = rating;
return this;
}
...
}
That is, for every attribute-field named X and having the type
T, you must provide two methods with the names: T getX() and
setX(T val), which respectively return, and accept, a value with the same
type as the field. This style should be familiar if you have ever created Java
classes in the style of Java Beans, a web-programming convention.
Any Java field which has these two methods will be marshalled as an XML attribute
by ASTWriter (unless the field is called content - see
below). If you wish for your class to have a secret, internal field, then you may
prefix this field with the Java keyword transient , which will prevent it
from being serialised. The setter-methods conventionally return this ,
the object being modified by the setter (but could return void).
Mapping Dependent XML Elements to Java
In the XML data model above, the XML element called Film also has some
dependent children elements, called Title and Director .
These elements will be mapped recursively to Java objects, instances of other AST
classes. In the Film class, you will provide private Java fields to
attach these objects to the Film object:
public class Film {
...
// fields storing XML dependent elements
private Title title;
private Director director;
...
}
Note how each of these fields is strongly-typed with the respective type of the node
to be stored there. The fields happen to be declared in the order title ,
and then director . This order will be the order in which the marshalling
ASTWriter serialises the XML element children. If you wish the dependent
XML children to be marshalled in a different order, simply change the order in which
the Java fields are declared. In order that the unmarshalling
ASTReader may recognise these Java fields as the targets for attaching
dependent XML elements, you should provide the Film class with
conventionally-named adder- and getter-methods:
public class Film {
...
// fields storing XML dependent elements
private Title title;
private Director director;
...
// methods for accessing XML dependent elements
public Title getTitle() { // getter for Title-child
return title;
}
public Film addTitle(Title title) { // adder for the Title-child
this.title = title;
return this;
}
public Director getDirector() {
return director;
}
public Film addDirector(Director director) {
this.director = director;
return this;
}
}
That is, for every dependent AST node with the type T that is stored in
a field named X, you should provide an adder-method with the name
addT(T obj) accepting an object of this type; and a getter-method
named T getX() that returns a value of this type (below we also show that
getters of the form: Collection<T> getX() are also possible). Note that
the adder-method is named after the type you are adding, whereas the getter-method is
named after the field in which you stored it. This is deliberately assymmetrical, to
distinguish dependent elements from attributes.
Any Java field which has these two methods will be marshalled as a dependent
XML child-element by ASTWriter . If you wish for your class to have a
secret field storing an internal reference to another object, then you may prefix this
field with the Java keyword transient , which will prevent it from being
serialised. The getter-methods shown here return a single object. Below, we show how
they could also return any collection of objects of the type that was added. The
adder-methods conventionally return this , the object being modified by
the adder (although they could return void).
Mapping XML Text Content to Java
In the XML data model above, the XML element called Title only has
textual content, the title of the film. Similarly, the XML element called
Director only stores textual content, the name of the director.
These leaf-nodes in the XML tree store simple content, which could be text (as here)
or some other simple integer or real value. Any such class which has content
must declare a private Java field called content for storing this
information, with a pair of setter- and getter methods to access the content. The
conventions are exactly the same as for storing XML attributes, except that the
reserved name for this field is always called content :
public class Director {
...
// field storing simple content
private String content;
// methods to access simple content
public String getContent() {
return content;
}
public Director setContent(String content) {
this.content = content;
return this;
}
}
That is, if a class stores XML content of the basic type T, it must
provide a field named T content and two methods, T getContent()
and setContent(T val). The name of the field is what distinguishes this
content-field from other fields used to store XML attributes.
In this example, the content is naturally of the type String. If you wish to store
strongly-typed numeric content, then the field-type and the types returned and
accepted by the access-methods may be of the appropriate numeric type (similar to
attributes; see above).
The marshalling and unmarshalling tools can convert between XML text content and
any of the Java basic types, and use the declared types of the fields to work out
how to attempt to convert text data, rasing an exception if the text cannot be
converted to this type. Furthermore, if you wish to store an arbitrary object as
content (or as an attribute), then so long as this object's class provides a
constructor-from-String, and a standard String conversion method
toString() , then this object may also be stored as content
(or as an attribute).
Factoring Common Behaviour in AST Nodes
Sometimes, different AST classes may end up looking quite similar, and
it would be a chore to have to repeat similar coding for several classes.
For example, the Film and TVShow classes overlap
considerably, in terms of their dependent- and attribute-fields, and their
associated getter- and setter-methods. Fortunately, you may arrange your
AST classes in a hierarchy, according to their similarities, just as you
would expect in Java. The following Show class is intended as the
abstract superclass of both Film and TVShow :
public abstract class Show {
private int year;
private String rating;
private Title title;
private Director director;
// public default constructor
public Show() { ... }
// methods to add XML dependents
public Show addTitle(Title title) { ... }
public Show addDirector(Director director) { ... }
// methods to access XML dependents
public Title getTitle() { ... }
public Director getDirector() { ... }
// methods to set XML attributes
public Show setYear(int year) { ... }
public Show setRating(String rating) { ... }
// methods to access XML attributes
public int getYear() { ... }
public String getRating() { ... }
}
All the common fields and methods needed are defined in one place. Now,
it is very easy to define the AST classes for Film and
TVShow as subclasses of Show , using Java inheritance,
and obtain all the expected fields and construction methods from the
superclass:
public class Film extends Show {
public Film() {} // only needs a default constructor
}
public class TVShow extends Show {
public TVShow() {} // only needs a default constructor
}
The JAST toolkit makes it very easy for programmers to factor out common
behaviour in classes, as you would expect. Other XML Java-binding tools
cannot do this as easily (for example, JAXB will generate duplicated APIs
from an XML Schema).
Handling Heterogeneous Collections of AST Nodes
Furthermore, the JAST toolkit makes it easy to manipulate polymorphic lists
of AST nodes having heterogeneous types. For example, let us assume that, in
the root node Catalogue , we do not care to distinguish between
the action of adding a Film and that of adding a TVShow .
Instead, we are only interested in adding polymorphic Show objects.
Accordingly, we can design the construction API for
Catalogue in the following way:
public class Catalogue {
// field to store heterogeneous dependents
private List<Show> shows;
// default constructor creates the list field
public Catalogue() {
shows = new ArrayList<Show>();
}
// methods required to add/access dependents
public Catalogue addShow(Show show) {
shows.add(show);
return this;
}
public List<Show> getShows() {
return shows;
}
... // possibly other methods, as desired
}
Two things have happened here. Firstly, rather than providing
Catalogue with separate add-methods addFilm(Film) and
addTVShow(TVShow) , we have decided that a Catalogue
need not distinguish the two, and have simply provided addShow(Show)
that accepts a polymorphic Show argument. The JAST
reflection tools will automatically discover this more general
method, if you don't supply the more specific methods (which would take
priority).
Secondly, the get-method getShows() will now be used to
access the heterogeneous list of films and TV shows. This method will be
detected automatically, by reflecting the name of the field.
Notice how, in contrast to earlier examples, this dependent-field's
get-method returns a list of objects. These objects will be marshalled in
the same order that they were added to the list, as XML elements of mixed
kinds (the example XML file above illustrates the mixed children of the
Catalogue root node).
Although storing dependent nodes in a Java List is the most
common case, it is also possible to store them in a Set or a
Map . In the case of a Map , the dependent node
should be stored as a value in the Map , indexed against
some key (typically an identifying attribute of the stored node). The JAST
unmarshaller will seek to discover a suitable adder-method for the type of
node stored in any collection-typed field, and from this will also
determine that the field can be serialised as a collection of dependent XML
elements. Note that if unordered Set or Map Java
implementations are chosen, the order of saved nodes may not be stable.
Finally, note how this capability leverages the assymmetric adder-
and getter-methods. The adder-methods are sensitive to the type of AST
node being added; whereas the getter-methods are sensitive to the type
of the field being read. This is necessary in order to support all of
the Java collection-types in an intuitively natural way.
Unmarshalling from an XML File to a Java AST
The main class to use for unmarshalling an XML file into an in-memory
AST is ASTReader , found in the top-level package
uk.ac.sheffield.jast . This can be used to read XML from
a file or other input stream, using either the default, or a specified,
character set, and always discards extra formatting whitespace. If the
XML input has no information about Java binding, use the following style:
File file = new File("my/xml/input.xml"); // Or whatever file
ASTReader reader = new ASTReader(file);
reader.useDomain("mydomain.org"); // Or whatever domain
reader.usePackage("org.mydomain.catalogue"); // Or whatever package
Catalogue root = (Catalogue) reader.readDocument();
reader.close();
In this example, the useDomain() instruction tells the
reader about your company domain. The usePackage() instruction
tells the reader the name of the Java package, owned by this domain, which
defines the Java AST classes that you wish to use. This name must resolve
to a Java package in the usual way. The result returned by
the reader is always an instance of your own root class, here an instance of
Catalogue . However, since the reader can only guess that it has
the most general Java type Object , you must downcast the result
to your chosen AST class-type (in this example, we downcast to
Catalogue ).
If the XML input declares a default XML namespace and a Java-binding
processing instruction mapping this namespace to the desired package, then
the useDomain() and usePackage() instructions may
be omitted, giving the following shorter style:
File file = new File("my/xml/input.xml"); // Or whatever file
ASTReader reader = new ASTReader(file);
Catalogue root = (Catalogue) reader.readDocument(); // Uses UTF-8
reader.close();
We will assume, in the following examples, that the Java-binding information
is given in the XML files. If not, then invoke the two additional methods on
the reader, to set up the desired Java-binding.
By default, ASTReader reads input from a file stream using the
UTF-8 character set. However, when reading input from a URL stream, it uses
the Latin-1 (ISO-8859-1) character set by default. This is recommended when
reading from a URL input stream, since the HTTP protocol expects the
Latin-1 encoding by default:
URL url = new URL("https://www.my.site/input.xml"); // Any URL
ASTReader reader = new ASTReader(url);
Catalogue root = (Catalogue) reader.readDocument(); // Uses ISO-8859-1
reader.close();
This allows you to unmarshal XML files over the Internet. You may also
specify a non-default character set explicitly (so long as the XML document
declares that it uses the same character encoding). If there is a conflict
between the declared and actual character encoding, this will raise an
UnsupportedEncodingException . The following shows how to read
a file using the Latin-1 character set (overriding the default UTF-8):
File file = new File("my/xml/input.xml");
ASTReader reader = new ASTReader(file, "ISO-8859-1"); // Latin-1
Catalogue root = (Catalogue) reader.readDocument();
reader.close();
Note how the reader is instructed, either by a Java-binding instruction
in the XML file, or by usePackage() , to map XML elements to
Java classes from the named package, before reading the input file. The
reader will understand that your classes have fully qualified names, like:
org.mydomain.catalogue.Film . The Java runtime will attempt to
find classes in this package in the usual way, either seeking them in a JAR
library you included in your project, or by searching the package directories
under your working directory. If you fail to specify otherwise, the reader
will expect to find the these classes in the default Java package (the
working directory) instead.
When unmarshalling a serial XML file into an arbitrarily-connected
object graph, circular or re-entrant structures may be restored, if the
XML file observes certain conventions on the use of id and
ref attributes. Whenever ASTReader encounters
an XML element with a new serial id value, it creates a
brand-new instance of the corresponding Java class. If it encounters a
reference XML element, with a ref attribute, then instead
of creating a new object, it restores the in-memory object reference to
point to the earlier object, whose id value matches the
ref value. (If the numbers get out of sequence, then the
XML file is corrupted and an exception is raised).
Marshalling from a Java AST to an XML File
The main class to use for marshalling an in-memory AST to a serialised
XML file is ASTWriter , found in the top-level package
uk.ac.sheffield.jast . This can be used to write the AST to
an XML file using either the default, or a specified, character set. The
mapping from Java identifiers to XML identifiers can be restored using the
mapping discovered during reading (see below). Marshalling will introduce
two extra attributes called id and ref , which are
reserved names for the JAST toolkit. They help flatten circular, or
re-entrant object-graph structures during marshalling, such that these
may be restored during unmarshalling.
If the Java AST has never been marshalled to XML before, then the writer
will need to know what Java-binding to use. In this case, use the following
style:
Catalogue root = ... ; // Created previously
File file = new File("my/xml/output.xml"); // Or whatever file
ASTWriter writer = new ASTWriter(file);
writer.useDomain("mydomain.org"); // Or whatever domain
writer.usePackage("org.mydomain.catalogue"); // Or whatever package
writer.writeDocument(root); // Uses UTF-8
writer.close();
In this example, the useDomain() instruction tells the
writer about your company domain. The usePackage() instruction
tells the writer the name of the Java package, owned by this domain, which
contains the Java AST classes that you wish to marshal to XML. All of the
classes should come from this domain (if not, see below). The writer will
pretty-print the XML file according to a standard layout, with newlines and
two-character indentation for nested XML structures.
By default the ASTWriter marshals an in-memory Java AST
to a serialised XML file using the UTF-8 character set. However, when
writing to a general Writer output stream, it uses the Latin-1
(ISO-8859-1) character set by default, since this is the recommended
character set for the HTTP protocol; and most web service applications use
this character set by default:
Catalogue root = ... ; // Created previously
HTTPServletResponse response = ... ; // Created by a servlet
ASTWriter writer = new ASTWriter(response.getWriter());
writer.useDomain("mydomain.org");
writer.usePackage("org.mydomain.catalogue");
writer.writeDocument(root); // Uses ISO-8859-1
writer.close();
In this web-service example, we access the PrintWriter
from a Java HTTPServletResponse object, which uses Latin-1
by default.
The two-argument constructor may also be used to specify a character set
explicitly, as the second argument. A safer way of generating output
to send via HTTP might be the following:
Catalogue root = ... ; // Created previously
HTTPServletResponse response = ... ; // Created by a servlet
ASTWriter writer = new ASTWriter(
response.getWriter(),
response.getCharacterEncoding());
writer.useDomain("mydomain.org");
writer.usePackage("org.mydomain.catalogue");
writer.writeDocument(root); // Uses explicit encoding
writer.close();
in which case the same character encoding will be declared in the
serialised XML document as that used by the PrintWriter
output stream, which wrote the document.
When serialising an arbitrary graph of Java objects as XML elements,
every new object encountered will be written out as a named XML element.
When ASTWriter encounters this object for the first time,
it will add an id attribute, whose value is the next serial
identifier in sequence, starting from 1. If the object is encountered a
second time, it is not written out in full, but a reference XML element is
written, with the same name and a ref attribute, whose
value is the same as the object identifer.
When marshalling a graph of Java objects to XML, ASTWriter
will add an XML processing instruction declaring the Java-binding from
the default XML namespace to the Java package that you specified in the
usePackage() method. It will also add a default namespace
URI declaration to the root XML element, using the domain name that you
specified in the useDomain() method to determine how to
convert the Java package name into a unique namespace URI, using some
of the package name as a domain, and the rest as a project identifier.
Controlling the Java-to-XML Mapping
Whenever an ASTReader unmarshals an XML file, it creates
an internal data structure recording everything about the XML-to-Java
mapping. This may include information about how certain XML names have
been normalised to ensure that they conform to legal Java names, or
information about which XML namespaces were mapped to different Java
packages used for different libraries of user-defined AST classes. All
of this information is stored in a single object of the type
Metadata .
Whenever an ASTWriter marshals an in-memory AST back to
serial XML, the easist way of restoring the same Java-to-XML mappings is
to use the same Metadata instance that was constructed
during unmarshalling. This can remove the need, when writing, to declare
XML namespaces, or mapped packages explicitly. The following illustrates
this:
File in = new File("my/xml/input.xml"); // Declares a Java-binding
ASTReader reader = new ASTReader(in);
Catalogue root = (Catalogue) reader.readDocument();
Metadta metadata = reader.getMetaData(); // Extract the metadata
reader.close();
// Program does something with the AST
File out = new File("my/xml/output.xml"); // Whatever output file
ASTWriter writer = new ASTWriter(out);
writer.setMetadata(metadata); // Restore old metadata
writer.writeDocument(root);
writer.close();
That is, the Metadata object is extracted from the
reader using getMetadata() ; and the writer is instructed
to use the same metadata using setMetadata(metadata) .
For example, this will ensure that when an XML element with a namespace
prefix and non-standard Java name cat:TV-show is mapped to
a Java class org.mydomain.catalogue.TVShow , then this
class will be mapped back to cat:TV-show when it is written.
It will also ensure that the output XML file will have the same XML version
and encoding as used in the input file; and will declare the same XML
namespaces (mapping these to the same Java packages) as the input file.
If the metadata is not transferred, then the writer will use default
settings and the element will be serialised as TVShow ,
expected to be a Java class in the default Java package.
XML namespace prefixes can be used to identify XML elements that
should be mapped to Java classes from different user-defined packages.
This is a common requirement in some model-driven engineering
applications, where models are serialised as XML. Below, we imagine
a transport-related Java model, in which the AST classes are split over
a core package and a separate transport package.
We specify that XML elements from different namespaces
xmlns:core and xmlns:tran should be mapped
to different packages in the following way:
File file = new File("my/xml/input.xml"); // Whatever input file
ASTReader reader = new ASTReader(file);
reader.useDomain("mydomain.org");
reader.usePackage("org.mydomain.model.core", "xmlns:core");
reader.usePackage("org.mydomain.model.transport", "xmlns:tran");
Catalogue root = (Catalogue) reader.readDocument();
reader.close();
This tells the reader to use classes in the core model package
org.mydomain.model.core when mapping XML elements that begin
with the prefix core , for example, core:Container ,
and tells the reader to use classes in the transport package
org.mydomain.model.transport when mapping XML elements that
begin with the prefix tran , for example, tran:Vehicle .
Exactly the same information may be given to an ASTWriter ,
to ensure that classes from particular packages are mapped to XML
elements from different namespaces.
XML namespaces work exactly like Java packages, in that they provide
a scope for XML elements that might otherwise have the same name. We
leverage XML namespaces, in order to map XML elements from different
namespaces to Java classes from different packages. Notice how this
style of usePackage() has two arguments: the first
argument is the Java package name, and the second is an XML
namespace declaration, introducing the special prefix. In the earlier
one-argument usage of usePackage() , all XML elements
were assumed to come from the default XML namespace
xmlns .
Apart from this, it is possible to access Metadata directly,
using its own API. This allows you to set XML file properties and declare
explicit mappings between XML identifiers and Java class names.
Metadata properties map XML namespace attributes to their
corresponding URIs. Metadata bindings map XML namespaces
to Java packages. If Java-binding is declared, but no URI was declared
for a namespace, JAST will try to synthesise a URI from the domain and the
Java package name. Please see the
JAST 2.5 package APIs for more details.
Notification of Exceptions
ASTReader and ASTWriter may raise various kinds
of IOException , if a problem occurs with the underlying file
system. Ill-formed XML syntax is reported through SyntaxError ,
whereas the inability to construct or manipulate an AST node class is
reported through SemanticError . This covers a variety of errors,
including missing constructors, missing methods, or failing methods.
In summary, faulty user code may raise the following:
FileNotFoundException - raised if the specified file
cannot be found (wrong pathname given)
UnsupportedEncodingException - raised if the character
set encodings are inconsistent
IOException - raised if a fault in the filesystem occurs
while reading an XML input file
SyntaxError - raised if a syntax error is detected while
parsing an XML input file
SemanticError - raised if any required construction or
access method is not found, or fails
The latter are styled as errors, rather than exceptions, since the W3C
standard requires malformed XML to be rejected outright, and not handled
by exception-tolerant software.
|