You are here: JAST Home / User Guide / SAX Building /

Streaming Parser and Builder Interface for XML

The JAST 2.5 Toolkit provides a streaming API for XML (SAX) that is able to process very large XML files serially. The XML data can be read from any kind of text input stream, whether a file or a URL. Rather than construct a DOM-tree in memory, the parser dispatches events to a builder-interface. Events correspond to the start or end of an element, or the detection of an attribute, or of content. It is entirely up to the programmer what actions should be taken in response to these events. The main components to use are the streaming XMLParser, the Builder interface and the default BasicBuilder implementation, all found in the package: uk.ac.sheffield.jast.build.

This part of the user guide describes how to use the streaming XML parser and how to respond selectively to the events detected by the parser. Essentially, the programmer constructs a bespoke Builder object, that takes the desired actions in response to events. Once you have understood the basic SAX-building concepts presented in this quick-start introduction, please refer to the JAST 2.5 package APIs for more detailed information.

Designing a Large XML Data Model

The first thing you will need to do is decide what kind of data you wish to model. Having done this, you will develop an XML markup scheme, using a mixture of XML elements and attributes to describe and encode the data. For example, a document that stores information about a great many families in the Electoral Register might look like this:

	<?xml version="1.0" encoding="UTF-8"?>
	<ElectoralRegister>
	    <Family surname="Smith">
		<Person role="father" age="45">John Smith</Person>
		<Person role="mother" age="41">Mary Smith</Person>
		<Person role="son" age="16">Ben Smith</Person>
		<Person role="daughter" age="14">Alice Smith</Person>
	    </Family>
	    <Family surname="Jones">
		<Person role="father" age="52">Alec Jones</Person>
		<Person role="mother" age="50">Gwen Jones</Person>
		<Person role="son" age="23">Tom Jones</Person>
	    </Family>
	    ...
	</ElectoralRegister>

So, the main XML element nodes used for markup are called ElectoralRegister, Family and Person. The Family element also has an attribute called surname The Person element also has attributes called role and age. We assume that information about these families, and many more not shown here, is stored in a very large text file. The assumption is that the XML input file is so large, that it cannot be stored within memory.

The Builder Interface for Streamed Events

Since the whole DOM-tree cannot be held within memory, then the programmer must take some action in response to the detection of particular XML events, as the parser encounters them. The interface Builder provides an API for responding to events:

	public interface Builder {
		// Events that add text content
	    public void addAttribute(String identifier, String value);
	    public void addComment(String text);
	    public void addEscapedData(String text);
	    public void addLayoutText(String text);
	    public void addPrintingText(String text);
		// Events that start/end a structure
	    public void startDocument();
	    public void endDocument();
	    public void startDeclaration(String target);
	    public void endDeclaration();
	    public void startInstruction(String target);
	    public void endInstruction();
	    public void startDoctype(String root);
	    public void endDoctype();
	    public void startElement(String identifier);
	    public void endElement();
		// Other access methods
	    public Object getDocument();   // return whatever was built
	    public XMLParser getParser();  // return the streaming parser
	    public void setParser(XMLParser parser);
	    public Lexicon getLexicon();   // return the entity lexicon
	    public void setLexicon(Lexicon lexicon);
	}

The Builder API describes a set of add-methods that signal the arrival of attribute- or text-content; a set of start-methods that signal the beginning of some kind of structure; and a set of end-methods that signal the end of some kind of structure. The programmer must provide a bespoke builder-class, which implements the Builder-interface, and which, in its concrete methods, takes appropriate actions in response to each of these events.

Since this might be a laborious coding task, the JAST toolkit provides a default implementation of the Builder interface in the class BasicBuilder. This class provides a default empty implementation for each of the above event-processing methods. However, it also provides a concrete implementation of the remaining access methods that access the underlying parser and entity lexicon. By default, the streaming XMLParser creates an instance of this BasicBuilder.

The programmer's task is then reduced to working out which of the above methods should be overridden in a bespoke builder-class, which is designed to inherit from BasicBuilder. This is useful, because the programmer may choose to ignore those events which are not wanted. For example, if the programmer only wishes to capture printing text and ignore all layout text, comments and escaped data, then they only need to override the addPrintingText(String text) method. This should do something with the text argument, according to the intention of the programmer.

Designing a Custom Builder for Streamed Events

As an example, we will design a custom builder that seeks to find every person in the data aged over 50. We will call this class ElderBuilder. We assume that we have a class Person in our application, which provides suitable methods to set and get a person's name and age; and we are not interested in any other information.

	public class ElderBuilder extends BasicBuilder {
	    private List<Person> elders;    // save persons over 50
	    private Person person = null;   // reuse for each person
		// default constructor
	    public ElderBuilder() {
		elders = new ArrayList<Person>();
	    }
		// check whether element is a Person
	    public void startElement(String identifier) {
		if (identifer.equals("Person") {
		    person = new Person();  // start local person
		}
	    }
		// check whether a Person was created
	    public void endElement() {
		if (person != null) {
		    elders.add(person);     // save completed person
		    person = null;          // clear local variable
		}
	    }
		// check whether the attribute was age
	    public void addAttribute(String identifier, String value) {
		if (person != null) {
		    try {
			if (identifier.equals("age") {
			    person.setAge(Integer.parseInt(value));
		        }
			if (person.getAge() < 50) {
			    person = null;  // no longer interested
			}
		    }
		    catch (NumberFormatException ex) {
			person = null;      // ignore corrupt record
		    }
		}
	    }
		// capture the name of the elder Person
	    public void addPrintingText(String text) {
		if (person != null) {
		    person.setName(text);
		}
	    }
		// return the list of elders
	    public List<Person> getDocument() {
		return elders;
	    }
	}

This example ElderBuilder works by recognising when a Person element is encountered, and then it selectively builds a Person object, if the age of this person is 50 or more. If the element is not a Person, or if the age is less than 50, then the local variable person is immediately set to null, since we are no longer interested in it (Java will garbage-collect all forgotten objects, if memory becomes full). If the person instance survives until we reach the end of an element, then this builder adds it to the list of elders. Finally, when the XML file has been completely scanned, the parser will return whatever was built, using the getDocument() method of this builder, which returns a List<Person> here, but in general could return any kind of Object.

Streaming XML with the Streaming Parser

The main API class to use is XMLParser. By default, XMLParser does nothing with the streamed data, since its builder is a BasicBuilder, which defines empty responses to streamed events. If the programmer directs the XMLParser to use a different builder, then when the parser dispatches events to the builder, it will do whatever the programmer has specified.

XMLParser can be used to stream data from a file, or from a URL, or from some other input stream, using the default, or a chosen character set. By default, XMLParser reads data from a file input stream using the UTF-8 character set:

	Builder builder = new ElderBuilder();      // Or whatever builder
        File file = new File("my/xml/input.xml");  // Or whatever file
        XMLParser reader = new XMLParser(file);    // Uses UTF-8
	reader.setBuilder(builder);
	Object result = reader.readDocument();
        reader.close();

By default, XMLParser reads data from a URL input stream using the Latin-1 (ISO-8859-1) character set. This is recommended when reading from a URL input stream, since the HTTP protocol expects the Latin-1 encoding by default:

	Builder builder = new ElderBuilder();               // Any builder
        URL url = new URL("https://www.my.site/input.xml");  // Any URL
        XMLParser reader = new XMLParser(url);              // Uses ISO-8859-1
	reader.setBuilder(builder);
	Object result = reader.readDocument();
        reader.close();

Both of the above one-argument constructors set the character encoding based on whether a File or URL argument is supplied. It is also possible to supply the character encoding explicitly as the second argument, if a non-default encoding is used:

	Builder builder = new ElderBuilder();      // Or whatever builder
        File file = new File("my/xml/input.xml");  // Or whatever file
        XMLParser reader = new XMLParser(file, "ISO-8859-1");
	reader.setBuilder(builder);
	Object result = reader.readDocument();
        reader.close();

It is mandatory to supply the character encoding if the XMLParser is constructed with InputStream or Reader arguments, since in these cases, it is not possible to infer the encoding. Note that the character encoding declared in the XML file must match the character encoding used by the underlying input stream.

An XMLParser will check that a document is well-formed XML, and will raise an exception if the XML is ill-formed (missing tags, mis-matched tags, missing quotation-marks around values, etc.). An XMLParser can only perform document validation, if it is used with the provided XMLBuilder, which constructs a complete DOM-tree (validation is only performed against a full DOM-tree in memory). In this case, you set the validation level of the builder (not the parser). Please refer to the XML Validation Guide for more information.

There is no equivalent streaming output mechanism, since the structures built are entirely arbitrary, and it is not possible somehow to insert or modify information arbitrarily in an XML file. However, if the programmer uses one of XMLBuilder to build a DOM-tree, or ASTBuilder to build a bespoke AST, then these may be output using the regular XMLWriter or ASTWriter writers.

Notification of Exceptions

XMLParser may raise various kinds of IOException, if a problem occurs with the underlying file system. Ill-formed XML syntax is reported through SyntaxError, whereas attempting to construct an illegal memory-tree is reported through SemanticError. In general, faulty user code may raise the following:

FileNotFoundException - raised if the specified file cannot be found (wrong pathname given)
UnsupportedEncodingException - raised if the character set encodings are inconsistent
IOException - raised if a fault in the filesystem occurs while reading an XML input file
SyntaxError - raised if a syntax error is detected while parsing an XML input file
SemanticError - raised if any construction method violates XML DOM-tree rules

The latter are styled as errors, rather than exceptions, since the W3C standard requires malformed XML to be rejected outright, and not handled by exception-tolerant software.