Processing XML Document Object Models
The JAST 2.5 Toolkit provides standard
readers and writers for converting between XML text files and DOM-trees
stored in memory. The XML data can be read from any kind of text input
stream, whether a file or a URL. The main components to use are
XMLReader and XMLWriter , both found in the
top-level package: uk.ac.sheffield.jast . The DOM-tree is
constructed out of Java nodes that exactly mimic the structure of the XML
document, with type names such as: Document, Declaration, Instruction,
Doctype, Element, Attribute, Data, Text and Comment. These
are all found in the package: uk.ac.sheffield.jast.xml .
This part of the user guide describes how to use the standard readers
and writers, and how to manipulate the DOM-tree nodes to insert and
retrieve data programmatically. The DOM-tree may also
be built directly within a Java program and then serialised as XML. Once
you have understood the basic DOM-processing concepts presented in this
quick-start introduction, please refer to the
JAST 2.5 package APIs for more detailed
information. Note also that the document validation and XPath search
engines act upon this DOM-tree model.
Designing an XML Data Model
The first thing you will need to do is decide what kind of data you wish
to model. Having done this, you will develop an XML markup scheme, using a
mixture of XML elements and attributes to describe and encode the data. For
example, a document that stores information about people in a family might
look like this:
<?xml version="1.0" encoding="UTF-8"?>
<Family>
<!-- The Smith family -->
<Person role="father" age="45">
John Smith
</Person>
<Person role="mother" age="41">
Mary Smith
</Person>
<Person role="son" age="16">
Ben Smith
</Person>
<Person role="daughter" age="14">
Alice Smith
</Person>
</Family>
So, the main XML element nodes used for markup are called Family
and Person ; there is an XML declaration; and there is a comment.
The Person element also has attributes called role
and age . We assume that information like this is stored in a
text file.
Reading XML into a Document Object Model
The main API class to use is XMLReader . This can be used to
read an XML File, using the default, or a chosen, character set; and can be
directed to preserve or discard extra formatting whitespace. It returns the
DOM-tree as an instance of the class Document . By default,
XMLReader reads data from a file input stream using the UTF-8
character set:
File file = new File("my/xml/input.xml"); // Or whatever file
XMLReader reader = new XMLReader(file); // Uses UTF-8
Document document = reader.readDocument();
reader.close();
By default, XMLReader reads data from a URL input stream using
the Latin-1 (ISO-8859-1) character set. This is recommended when reading from
a URL input stream, since the HTTP protocol expects the Latin-1 encoding by
default:
URL url = new URL("https://www.my.site/input.xml"); // Any URL
XMLReader reader = new XMLReader(url); // Uses ISO-8859-1
Document document = reader.readDocument();
reader.close();
Both of the above one-argument constructors set the character encoding based
on whether a File or URL argument is supplied. It is
also possible to supply the character encoding explicitly as the second argument,
if a non-default encoding is used:
File file = new File("my/xml/input.xml"); // Or whatever file
XMLReader reader = new XMLReader(file, "ISO-8859-1");
Document document = reader.readDocument();
reader.close();
It is mandatory to supply the encoding if the XMLReader is constructed
with InputStream or Reader arguments, since in these
cases, it is not possible to infer the encoding. Note that the character encoding
declared in the XML file must match the character encoding used by the underlying
input stream.
By default, XMLReader creates a compact DOM-tree in memory. It
discards all extra formatting whitespace surrounding the XML markup text (to avoid
wasting memory by storing extra whitespace nodes in the DOM-tree), but preserves
any whitespace included within textual content. Alternatively, XMLReader
may be instructed to preserve all layout text, before reading the document:
File file = new File("my/xml/input.xml");
XMLReader reader = new XMLReader(file);
reader.preserveLayout(true); // Keep all layout text
Document document = reader.readDocument();
reader.close();
This preserves all formatting whitespace as extra Text nodes in the
DOM-tree. The method preserveLayout() may be called with arguments
true or false, to preserve or ignore (by default) any whitespace. When layout is
preserved, an XML document may be re-written with exactly the same layout as when
it was read.
By default, XMLReader will check that a document is well-formed
XML, and will raise an exception if the XML is ill-formed (missing tags, mis-matched
tags, missing quotation-marks around values, etc.) It is possible to direct
XMLReader to validate the document as it is read, using either a
Document Type Definition (DTD), or an XML Schema Definition (XSD). This is
described in the
XML Validation Guide.
Writing a Document Object Model to XML
The main API class to use is XMLWriter . This can be used to
write a Document as an XML File, using the default, or a chosen,
character set; and may be directed to preserve the existing format, or pretty-print
the document afresh. By default, XMLWriter writes to a file output
stream using the UTF-8 character set:
Document document ... ; // Created previously
File file = new File("my/xml/output.xml"); // Or whatever file
XMLWriter writer = new XMLWriter(file); // Uses UTF-8
writer.writeDocument(document);
writer.close();
By default, XMLWriter writes to a general Writer output stream
using the Latin-1 (ISO-8859-1) character set, since this is the recommended character
set for the HTTP protocol; and most web service applications use this character set
by default:
Document document = ... ; // Created previously
Writer stream = ... ; // Created previously
XMLWriter writer = new XMLWriter(stream); // Uses ISO-8859-1
writer.writeDocument(document);
writer.close();
This assumes that the stream uses the same Latin-1 character encoding.
When writing to any kind of file or output stream, it is possible to specify that the
XML document should be written using a different, non-default character set, by adding
this explicitly as the second construction argument:
Document document = ... ; // Created previously
File xmlFile = new File("my/xml/output.xml");
XMLWriter writer = new XMLWriter(xmlFile, "ISO-8859-1"); // Latin-1
writer.writeDocument(document);
writer.close();
in which case the XMLWriter will verify that the Document
declared the same character encoding, before proceeding; otherwise it will raise an
exception. The two-argument constructor is also useful in web applications, for
example, when accessing the PrintWriter from a Java servlet's
HTTPServletResponse response-object:
Document document = ... ; // Created previously
HTTPServletResponse response = ... ; // Created by a Servlet
XMLWriter writer = new XMLWriter(response.getWriter(),
response.getCharacterEncoding());
writer.writeDocument(document);
writer.close();
This style of constructing the XMLWriter ensures that the writer uses
the same character encoding as the PrintWriter used by the servlet's
response . You should also ensure that the XML declaration at the head
of the Document uses the same character set.
By default, XMLWriter pretty-prints the output XML file using a
standard tree-structured layout with newlines and two-character indentation for
nested XML structures. Alternatively, XMLWriter may be instructed to
preserve the original layout of the DOM-tree in memory, before writing the
document:
File xmlFile = new File("my/xml/output.xml");
XMLWriter writer = new XMLWriter(xmlFile);
writer.preserveLayout(true); // Write native layout
writer.writeDocument(document);
writer.close();
The method preserveLayout() may be called with arguments true or false,
respectively to disable or enable pretty-printing. If layout is preserved both when
reading and writing the document, the output XML document will have exactly the same
layout as when it was read.
Accessing the Contents of the DOM-Tree
The main classes of interest to programmers are Document ,
Element , Attribute and Text ; although there
are other types of node that represent the XML declaration, a document type
definition, or a special stylesheet instruction, or escaped character data.
The DOM-tree nodes are designed according to the Composite Design Pattern,
that is, everything in the DOM-tree is some kind of Content and
respects a common API. The more specific kinds of node extend this API in different
ways. Please see the full API descriptions for each of these node types, which
may be found in the package: uk.ac.sheffield.jast.xml .
The following is just an example of how the nodes of the XML memory-tree
can be accessed within a Java program:
Declaration header = document.getDeclaration(); // XML declaration.
Doctype doctype = document.getDoctype(); // Optional doctype.
Comment comment = document.getComment(); // Optional comment
Element root = document.getRootElement(); // Root element.
List<Content> contents = document.getContents(); // All subnodes.
Content node = document.getContent(2); // Third sub-node.
String name = root.getName(); // Element name.
int contentType = root.getType(); // Bitmask type.
List<Element> allChildren = root.getChildren();
List<Element> someChildren = root.getChildren("Person");
Element child = root.getChild("Person"); // First so-named.
Content parent = child.getParent(); // Same object as root.
String text = child.getText(); // Textual content.
List<Attribute> properties = child.getAttributes();
Attribute property = child.getAttribute("age");
String ageStr = property.getValue(); // If property != null
int age = property.intValue(); // If property != null
String value = child.getValue("age"); // Access value directly
In addition, there is provision to iterate over all nodes in a memory-tree.
The iteration may include the starting node, or just all of its descendants.
For explicit access to different kinds of Content node, such as
Text , Data and Comment nodes, you
must use Filter and its subclasses to filter the contents of
a given node. Please see the package uk.ac.sheffield.jast.filter .
Constructing and Manipulating the DOM-Tree
The DOM-tree can be constructed directly in Java, using API calls of the
relevant DOM-tree nodes. The main classes of interest are Document ,
Element , Attribute and Text .
All construction-methods are designed to nest, so that the Java code looks
somewhat like the structure of the XML file being created. Please see the
full API descriptions for how to construct each of these node types.
The following is just an example of how the nodes of the XML DOM-tree
can be created within a Java program, using the return value of the previous
setter as the target of the next setter (suitably nested):
Document document = new Document(); // Default encoding.
Element root = new Element("Family")
.addContent(new Comment("The Smith family"))
.addContent(new Element("Person")
.setText("John Smith") // Sets all text.
.setValue("role", "father") // Sets attribute.
.setValue("age", "45")) // End of add John
.addContent(new Element("Person")
// Another way to add text content.
.addContent(new Text("Mary Smith"))
.setValue("role", "mother")
.setValue("age", "41")) // End of add Mary
.addContent(new Element("Person")
.setText("Ben Smith")
// Another way to set an attribute.
.setAttribute(new Attribute("role", "son"))
.setValue("age", "16")) // End of add Ben
.addContent(new Element("Person")
// Another way to add text incrementally.
.addContent(new Text("Alice"))
.addContent(new Text(" Smith"))
.setValue("role", "daughter")
.setValue("age", "14"))); // End of Family
document.setRootElement(root);
The node APIs contain many further manipulation-methods that remove specific
nodes, or all nodes of a given type, or the node at a given index. All
Content nodes may have at most one parent node. If you wish to
reuse part of an XML DOM-tree, you must detach the subtree from the source
document before adding it to the destination document. Alternatively, you may
clone() part of the source tree and add this copied subtree to
the destination. Attaching any node to more than one parent node will raise
an exception.
Notification of Exceptions
Both XMLReader and XMLWriter may raise various
kinds of IOException , if a problem occurs with the underlying file
system. Ill-formed XML syntax is reported through SyntaxError ,
whereas attempting to construct an illegal memory-tree is reported through
SemanticError . In general, faulty user code may raise the
following:
FileNotFoundException - raised if the specified file
cannot be found (wrong pathname given)
UnsupportedEncodingException - raised if the character
set encodings are inconsistent
IOException - raised if a fault in the filesystem occurs
while reading an XML input file
SyntaxError - raised if a syntax error is detected while
parsing an XML input file
SemanticError - raised if any construction method violates
XML DOM-tree rules
The latter are styled as errors, rather than exceptions, since the W3C
standard requires malformed XML to be rejected outright, and not handled
by exception-tolerant software.
|