Streaming Parser and Builder Interface for XML
The JAST 2.5 Toolkit provides a streaming
API for XML (SAX) that is able to process very large XML files serially.
The XML data can be read from any kind of text input stream, whether a file
or a URL. Rather than construct a DOM-tree in memory, the parser dispatches
events to a builder-interface. Events correspond to the start or end of an
element, or the detection of an attribute, or of content. It is entirely
up to the programmer what actions should be taken in response to these events.
The main components to use are the streaming
XMLParser , the Builder interface and the default
BasicBuilder implementation, all found in the
package: uk.ac.sheffield.jast.build .
This part of the user guide describes how to use the streaming XML parser
and how to respond selectively to the events detected by the parser.
Essentially, the programmer constructs a bespoke Builder
object, that takes the desired actions in response
to events. Once you have understood
the basic SAX-building concepts presented in this quick-start introduction,
please refer to the JAST 2.5 package APIs
for more detailed information.
Designing a Large XML Data Model
The first thing you will need to do is decide what kind of data you wish
to model. Having done this, you will develop an XML markup scheme, using a
mixture of XML elements and attributes to describe and encode the data. For
example, a document that stores information about a great many families in
the Electoral Register might look like this:
<?xml version="1.0" encoding="UTF-8"?>
<ElectoralRegister>
<Family surname="Smith">
<Person role="father" age="45">John Smith</Person>
<Person role="mother" age="41">Mary Smith</Person>
<Person role="son" age="16">Ben Smith</Person>
<Person role="daughter" age="14">Alice Smith</Person>
</Family>
<Family surname="Jones">
<Person role="father" age="52">Alec Jones</Person>
<Person role="mother" age="50">Gwen Jones</Person>
<Person role="son" age="23">Tom Jones</Person>
</Family>
...
</ElectoralRegister>
So, the main XML element nodes used for markup are called
ElectoralRegister , Family and Person .
The Family element also has an attribute called surname
The Person element also has attributes called role
and age . We assume that information about these families, and many
more not shown here, is stored in a very large text file.
The assumption is that the XML input file is so large, that it cannot be
stored within memory.
The Builder Interface for Streamed Events
Since the whole DOM-tree cannot be held within memory, then the programmer
must take some action in response to the detection of particular XML events,
as the parser encounters them. The interface Builder provides
an API for responding to events:
public interface Builder {
// Events that add text content
public void addAttribute(String identifier, String value);
public void addComment(String text);
public void addEscapedData(String text);
public void addLayoutText(String text);
public void addPrintingText(String text);
// Events that start/end a structure
public void startDocument();
public void endDocument();
public void startDeclaration(String target);
public void endDeclaration();
public void startInstruction(String target);
public void endInstruction();
public void startDoctype(String root);
public void endDoctype();
public void startElement(String identifier);
public void endElement();
// Other access methods
public Object getDocument(); // return whatever was built
public XMLParser getParser(); // return the streaming parser
public void setParser(XMLParser parser);
public Lexicon getLexicon(); // return the entity lexicon
public void setLexicon(Lexicon lexicon);
}
The Builder API describes a set of add-methods that
signal the arrival of attribute- or text-content; a set of start-methods
that signal the beginning of some kind of structure; and a set of
end-methods that signal the end of some kind of structure. The
programmer must provide a bespoke builder-class, which implements the
Builder -interface, and which, in its concrete methods, takes
appropriate actions in response to each of these events.
Since this might be a laborious coding task, the JAST toolkit provides a
default implementation of the Builder interface in the class
BasicBuilder . This class provides a default empty implementation
for each of the above event-processing methods. However, it also provides a
concrete implementation of the remaining access methods that access the
underlying parser and entity lexicon. By default, the streaming
XMLParser creates an instance of this BasicBuilder .
The programmer's task is then reduced to working out which of the above
methods should be overridden in a bespoke builder-class, which is designed
to inherit from BasicBuilder . This is useful, because the
programmer may choose to ignore those events which are not wanted. For
example, if the programmer only wishes to capture printing text and ignore
all layout text, comments and escaped data, then they only need to override
the addPrintingText(String text) method. This should do
something with the text argument, according to the intention of the
programmer.
Designing a Custom Builder for Streamed Events
As an example, we will design a custom builder that seeks to find every
person in the data aged over 50. We will call this class
ElderBuilder . We assume that we have a class Person
in our application, which provides suitable methods to set and get a person's
name and age; and we are not interested in any other information.
public class ElderBuilder extends BasicBuilder {
private List<Person> elders; // save persons over 50
private Person person = null; // reuse for each person
// default constructor
public ElderBuilder() {
elders = new ArrayList<Person>();
}
// check whether element is a Person
public void startElement(String identifier) {
if (identifer.equals("Person") {
person = new Person(); // start local person
}
}
// check whether a Person was created
public void endElement() {
if (person != null) {
elders.add(person); // save completed person
person = null; // clear local variable
}
}
// check whether the attribute was age
public void addAttribute(String identifier, String value) {
if (person != null) {
try {
if (identifier.equals("age") {
person.setAge(Integer.parseInt(value));
}
if (person.getAge() < 50) {
person = null; // no longer interested
}
}
catch (NumberFormatException ex) {
person = null; // ignore corrupt record
}
}
}
// capture the name of the elder Person
public void addPrintingText(String text) {
if (person != null) {
person.setName(text);
}
}
// return the list of elders
public List<Person> getDocument() {
return elders;
}
}
This example ElderBuilder works by recognising when a
Person element is encountered, and then it selectively
builds a Person object, if the age of this
person is 50 or more. If the element is not a Person ,
or if the age is less than 50, then the local variable
person is immediately set to null , since
we are no longer interested in it (Java will garbage-collect all
forgotten objects, if memory becomes full). If the person
instance survives until we reach the end of an element, then this
builder adds it to the list of elders . Finally, when
the XML file has been completely scanned, the parser will return
whatever was built, using the getDocument() method of
this builder, which returns a List<Person> here,
but in general could return any kind of Object .
Streaming XML with the Streaming Parser
The main API class to use is XMLParser . By default,
XMLParser does nothing with the streamed data, since its
builder is a BasicBuilder , which defines empty responses
to streamed events. If the programmer directs the XMLParser
to use a different builder, then when the parser dispatches events to the
builder, it will do whatever the programmer has specified.
XMLParser can be used to stream data from a file, or from
a URL, or from some other input stream, using the default, or a chosen
character set. By default, XMLParser reads data from a file
input stream using the UTF-8 character set:
Builder builder = new ElderBuilder(); // Or whatever builder
File file = new File("my/xml/input.xml"); // Or whatever file
XMLParser reader = new XMLParser(file); // Uses UTF-8
reader.setBuilder(builder);
Object result = reader.readDocument();
reader.close();
By default, XMLParser reads data from a URL input stream using
the Latin-1 (ISO-8859-1) character set. This is recommended when reading from
a URL input stream, since the HTTP protocol expects the Latin-1 encoding by
default:
Builder builder = new ElderBuilder(); // Any builder
URL url = new URL("https://www.my.site/input.xml"); // Any URL
XMLParser reader = new XMLParser(url); // Uses ISO-8859-1
reader.setBuilder(builder);
Object result = reader.readDocument();
reader.close();
Both of the above one-argument constructors set the character encoding based
on whether a File or URL argument is supplied. It is
also possible to supply the character encoding explicitly as the second argument,
if a non-default encoding is used:
Builder builder = new ElderBuilder(); // Or whatever builder
File file = new File("my/xml/input.xml"); // Or whatever file
XMLParser reader = new XMLParser(file, "ISO-8859-1");
reader.setBuilder(builder);
Object result = reader.readDocument();
reader.close();
It is mandatory to supply the character encoding if the XMLParser
is constructed with InputStream or Reader arguments,
since in these cases, it is not possible to infer the encoding. Note that the
character encoding declared in the XML file must match the character encoding
used by the underlying input stream.
An XMLParser will check that a document is well-formed XML, and
will raise an exception if the XML is ill-formed (missing tags, mis-matched
tags, missing quotation-marks around values, etc.). An XMLParser
can only perform document validation, if it is used with the provided
XMLBuilder , which constructs a complete DOM-tree (validation is
only performed against a full DOM-tree in memory). In this case, you set the
validation level of the builder (not the parser). Please refer to the
XML Validation Guide for more
information.
There is no equivalent streaming output mechanism, since the structures
built are entirely arbitrary, and it is not possible somehow to insert or
modify information arbitrarily in an XML file. However, if the programmer
uses one of XMLBuilder to build a DOM-tree, or
ASTBuilder to build a bespoke AST, then these may be output
using the regular XMLWriter or ASTWriter writers.
Notification of Exceptions
XMLParser may raise various kinds of
IOException , if a problem occurs with the underlying file
system. Ill-formed XML syntax is reported through SyntaxError ,
whereas attempting to construct an illegal memory-tree is reported through
SemanticError . In general, faulty user code may raise the
following:
FileNotFoundException - raised if the specified file
cannot be found (wrong pathname given)
UnsupportedEncodingException - raised if the character
set encodings are inconsistent
IOException - raised if a fault in the filesystem occurs
while reading an XML input file
SyntaxError - raised if a syntax error is detected while
parsing an XML input file
SemanticError - raised if any construction method violates
XML DOM-tree rules
The latter are styled as errors, rather than exceptions, since the W3C
standard requires malformed XML to be rejected outright, and not handled
by exception-tolerant software.
|