JAST: Java Abstract Syntax Trees

Natural Java idioms for processing XML data

You are here: JAST Home / User Guide / XML Validation /
Department of Computer Science

The XML Document Validation Engine

The JAST 2.5 Toolkit provides optional components for validating XML documents. By default, the XML parsers only check XML documents for well-formedness (viz. that correctly-paired markup tags are used, that XML elements are correctly nested and that the syntax for identifiers and punctuation is correct). Sometimes it is useful to perform a stronger check, to ensure that the expected kinds of element and attribute are present in a document. Validation can be requested by changing settings in the XML readers XMLReader or ASTReader or builders inheriting from BasicBuilder, found in the top-level package: uk.ac.sheffield.jast.

Two different validation schemes are supported. The tools can validate in-memory DOM-trees against either a Document Type Definition (DTD) or an XML Schema Definition (XSD). The DTD approach is older, but also offers the possibility of defining additional abbreviations for foreign characters, or boilerplate text, called Entity References. The XSD approach is newer, more complex and more flexible, although it does not handle Entity References. Both validation schemes use a common set of components for building a grammar, defined in the package: uk.ac.sheffield.jast.valid. Once you have understood the basic validation concepts presented in this quick-start introduction, please refer to the the JAST 2.5 package APIs for more detailed information.

Levels of XML Document Validation

All the XML parsers, readers and builders offer the method: setValidation(Validation level), which controls the validation level. The argument to this method is an enumerated value of the Validation type. Different values correspond to different settings, instructing parsers to use a different level of validation:

  • IGNORE: is the default disabled setting, which means that all validation is turned off (and that referenced DTDs or XSDs will not be read, which saves on the document processing time);
  • EXPAND: is the lowest setting, which means that Entity References will be expanded (and referenced DTDs will be read, which will slightly increase the document processing time);
  • DOCTYPE: is the middle setting, which means that Entity References will be expanded and the document will also be validated against a Doctype Definition (and hence referenced DTDs will be read);
  • SCHEMA: is the highest setting, which means that Entity References will be expanded and the document will also be validated against an XML Schema Definition (and hence referenced DTDs and XSDs will be read).

Note that these settings are mostly cumulative - the higher settings all support prior expansion of Entity References. If Doctype validation is selected, the parser will look for a Doctype definition (either internal, or external) attached to the XML document. If Schema validation is selected, the parser will look for both the Doctype (to expand any Entity References) and for a referenced XML Schema document (to validate the XML document).

Enabling XML Document Validation

The main entry point to validation is by controlling the validation settings in the XML parsers XMLReader (for DOM-trees), ASTReader (for AST-trees) or in the builder BasicBuilder (for SAX-parsing) found in the top-level package: uk.ac.sheffield.jast. Since the validation engine acts on an in-memory DOM-tree, technically only DOM-trees may be validated. However, the lower validation settings are possible with other parsers or builders that create AST-trees, to allow expansion of Entity References as the XML file is being read.

In order to activate automatic document validation, one of the validation settings must be specified, before the XML document is actually read. The following example asks XMLReader to validate the XML document against a Document Type Definition (and will expand Entity References):

        File file = new File("my/xml/input.xml");  // Or whatever file
        XMLReader reader = new XMLReader(file);
	reader.setValidation(Validation.DOCTYPE);  // Request DTD validation
        Document document = reader.readDocument();
        reader.close();

The following example asks ASTReader just to expand Entity References as it reads the XML file (see below for an explanation of Entity References). This is the highest level of validation that can be supported, since the constructed AST is of a custom format, unknown to the JAST toolkit:

        File file = new File("my/xml/input.xml");  // Or whatever file
        ASTReader reader = new ASTReader(file);
	reader.setValidation(Validation.EXPAND);   // Request entity expansion
	reader.usePackage("org.my.catalogue");
        Catalogue catalogue = (Catalogue) reader.readDocument();
        reader.close();

The following example uses a SAX-builder strategy, but since the parser is given the XMLBuilder that constructs a full DOM-tree, then full validation is possible, so we may ask it to validate the XML document against an XML Schema Definition (and expand any Entity References):

        File file = new File("my/xml/input.xml");  // Or whatever file
	    // Create the builder
	XMLBuilder builder = new XMLBuilder();
	builder.setValidation(Validation.SCHEMA);  // Request XSD validation
	    // Set up the parser
        XMLParser parser = new XMLParser(file);
	parser.setBuilder(builder);		   // Attach the builder
        Document document = (Document) parser.readDocument();
        parser.close();

Requesting any level of document validation will cause further files to be opened. An external Doctype definition must be read and analysed before the main XML file, so that Entity References may be expanded. An external XML Schema file may be read after the main XML file. A grammar will be compiled from either of these sources, then applied to the in-memory DOM-tree (if full validation was requested).

Expanding Entity References

General Entity References are supported in the JAST 2.5 toolkit. In XML, an Entity is any kind of non-XML datum that must be escaped within an XML document. Some examples of these are the punctuation characters used during mark-up, which cannot be included literally in a document, but which can be escaped:

 	&lt;    the less than character     <
	&gt;    the greater than character  >
	&amp;   the ampersand character     &
	&quot;  the quotation character     "
	&apos;  the apostrophe character    '

These punctuation characters (shown on the right) are escaped in XML, by encoding them using the standard Entity References (shown on the left). If it is desired to use any of these characters literally in a document, their escaped forms must be used instead. These five Entity References are pre-defined in XML, so may always be used.

Further Entity References may be defined explicitly by the programmer, for example, to encode special foreign language characters, or even to encode standard boilerplate text by an abbreviation. These extra entity declarations will appear as part of a Doctype definition (see below). Some examples include:

	<!ENTITY copy "&#169;">
	<!ENTITY ajhs "Anthony J H Simons, MA PhD">

The first of these defines a Character Entity Reference that encodes the copyright © symbol. The second of these defines a String Entity Reference that encodes an abbreviation for my full name and qualifications. The idea is that the XML document may include the encoded reference, which will be replaced automatically by their expanded forms, when the text is read into memory. When inserting an Entity Reference, this always begins with the ampersand & symbol and ends with a semicolon ; punctuation mark:

 	&copy;    expands to:     ©
	&ajhs;    expands to:     Anthony J H Simons, MA PhD

All Entity References are stored in a Lexicon object, which is an integral part of every reader, writer and builder. These classes offer the methods: Lexicon getLexicon() and setLexicon(Lexicon obj) to retrieve, or reset their internal lexicon. While it is possible to manipulate lexicons through the Lexicon API directly, this is seldom necessary. It is more usual to allow a reader to populate its lexicon with any new Entity References discovered in its Doctype Definition; and then the programmer may transfer the lexicon as a whole to a writer, which will re-encode the output XML file using the same entity definitions.

When an XML document is read, all of its Entity References are expanded (if expansion has been requested). The text stored in-memory will contain the expanded characters or boilerplate text; that is, Entity References only exist in the serialised XML file. Assuming that the same Lexicon is used for both reading and writing, when an XML document is written out, the reverse process happens: any text that needs to be escaped will be escaped as Entity References in the XML file. If an expected Entity Reference definition is missing, this will raise an exception in a reader, but a writer will simply fail to encode text that has no corresponding Entity Reference definition.

Creating a Doctype Definition (DTD)

An XML document may optionally include a Doctype node, which declares a grammar of expected elements and attributes, starting from the root element. Grammar definitions may be provided as part of the Doctype node (the internal subset), or may be provided in an external DTD file referenced by the Doctype node (the external subset). The general shape of an internal Doctype Definition is as shown below:

	<!DOCTYPE Catalogue [
  	    <!ELEMENT Catalogue (Film | TVShow)*>
	    <!ELEMENT Film (Title, Director)>
	    <!ELEMENT TVShow (Title, Director)>
            <!ELEMENT Title (#PCDATA)>
	    <!ELEMENT Director (#PCDATA)>
            <!ATTLIST Film
		date CDATA #REQUIRED
		rating (U|PG|12|15|18) #IMPLIED>
            <!ATTLIST TVShow
		date CDATA #REQUIRED
		rating (U|PG|12|15|18) #IMPLIED>
	    <!ENTITY copy "&#169;">
	]>

This uses a BNF style to assert that a conforming document must have a particular grammar. The DOCTYPE node asserts that the root node must be Catalogue. The ELEMENT nodes assert that the Catalogue element contains a heterogeneous list of zero or more Film or TVShow elements. Each of these contains a sequence of Title and Director elements; and these leaf-nodes contain only text (indicated by #PCDATA). Furthermore, the ATTLIST nodes assert that the Film and TVShow elements must have a compulsory date attribute (indicated by #REQUIRED) and may have an optional rating attribute (indicated by #IMPLIED). Whereas the date value may be any text, the rating must be chosen from a restricted choice of symbols (enumerated above). Finally, the ENTITY node allows the copyright symbol to be abbrevated by &copy; within the XML document.

The same Doctype definition may be provided externally, as the DTD file Catalogue.dtd. In this case, the Doctype Definition within the XML file will be much shorter, and will refer to the external DTD file, in which all the above ELEMENT, ATTRIBUTE and ENTITY declarations are listed (viz. the DTD file lists the contents enclosed by the square brackets [] above, excluding the brackets). There are two possible formats for linking to the external subset:

	<!DOCTYPE Catalogue SYSTEM "Catalogue.dtd">

In this simple format, the DOCTYPE node only names the root Catalogue element, and declares a system identifier, which is the pathname to the referenced DTD file. Here, we assume that the DTD file is found locally, relative to the XML file; in general, the path name could be longer, or even a URL. The second format is more complex:

	<!DOCTYPE Catalogue PUBLIC 
		"-//AJHS//DTD Catalogue 1.0//EN" "Catalogue.dtd">

This is used when a Doctype is published more generally, and is given given a formal public identifier as well as the path to the DTD file. Formal public identifiers conform to a restricted syntax that indicates the owner, and the type of content referenced - see this explanation of Formal Public Identifiers on Wikipedia.

A Doctype may have both an external subset and an internal subset. If both parts are present, then the internal definitions take precedence over external definitions (that define the same thing). This allows the programmer to provide a general Doctype Declaration for a set of XML documents, but to customise what contents are allowed within specific documents, if they declare an internal subset. Please refer to the W3C Tutorial on DTD for more information.

Creating an XML Schema Definition (XSD)

An XML document may optionally refer to an external XML Schema Definition. An XML Schema is itself a well-formed XML document. An example of an XML Schema Definition, provided as a separate XSD file Catalogue.xsd, is given below:

	<?xml version="1.0" encoding="UTF-8"?>
	<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema"
		elementFormDefault="qualified">

  	    <xs:element name="Catalogue">
		<xs:complexType>
		    <xs:choice maxOccurs="unbounded">
			<xs:element name="Film" type="showType"/>
			<xs:element name="TVShow" type="showType"/>
		    </xs:choice>
		</xs:complexType>
	    </xs:element>
  
	    <xs:complexType name="showType">
		<xs:sequence>
		    <xs:element name="Title" type="xs:string"/>    
		    <xs:element name="Director" type="xs:string"/>
		</xs:sequence>
		<xs:attribute name="date" use="required" type="xs:integer"/>
 		<xs:attribute name="rating" use="implied" type="ratingType"/>
	    </xs:complexType>
    
	    <xs:simpleType name="ratingType">
		<xs:restriction base="xs:string">
		    <xs:enumeration value="U"/>
		    <xs:enumeration value="PG"/>
		    <xs:enumeration value="12"/>
		    <xs:enumeration value="15"/>
		    <xs:enumeration value="18"/>
		</xs:restriction>
	    </xs:simpleType>
	</xs:schema>

This defines a grammar that applies the same constraints as the Doctype grammar above. The root Catalogue element may contain an unbounded choice of Film and TVShow elements, which are of the same showType. This type contains a sequence of elements Title and Director, both of the simple string type, and two attributes, one of which is a mandatory integer (date) and the other of which is an optional enumeration (rating), of the type ratingType.

There are many styles of presenting an XML Schema; this is just one, which chooses to decouple some element definitions from the separate type definitions of those elements. The various styles have evocative names, such as: Russian Doll, Salami Slice, or Venetian Blind. Please refer to the W3C Tutorial on XSD for more information.

The XML Schema Definition file is referenced through certain distinguished attributes of the root element of the XML document. There are two styles, with and without the declaration of a local namespace for the owner of the schema. The first format simply declares where the XSD file is located:

	<Catalogue 
	    xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
	    xsi:noNamespaceSchemaLocation="Catalogue.xsd">
        ...
	</Catalogue>

In this simpler format, the xmlns:xsi attribute declares a namespace (standing for XML schema instance), and the second attribute xsi:noNamespaceSchemaLocation is defined within this namespace and refers, through its value, to the path leading to the XSD file. In the second format, an additional default namespace is defined for the owner of the XML schema:

	<Catalogue xmlns="https://www.my.domain.org"
	    xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
	    xsi:schemaLocation="Catalogue.xsd">
        ...
	</Catalogue>

This declares that the elements defined within the XML document (and hence the schema also) come from the namespace https://www.my.domain.org. Note that the attribute referencing the pathname to the XSD file is given the shorter name xsi:schemaLocation instead. Either of the two formats above is appropriate. Please refer to the W3C Tutorial on XSD for more information.

Level of Compliance to W3C Standards

JAST supports the creation of DOCTYPE nodes that contain a sequence of ELEMENT, ATTLIST and ENTITY definitions. The full BNF syntax for ELEMENT definitions is supported, including sequence, selection and iteration of single items or bracketed () structures. Multiplicity markers may specify optional ?, zero-to-many *, or one-to-many + occurrences. ELEMENT definitions may contain the EMPTY category marker to indicate empty content, or ANY category marker to indicate arbitrary content. See the W3C Tutorial on DTD for further details.

ATTLIST definitions may declare single, or multiple attributes for each element within the same declaration. Each defined attribute has a name, an attribute type, and either an occurrence specifier or a default value. Attribute types may be symbolic types such as ID, NMTOKEN or CDATA; or they may be enumerated selections. An occurrence specifier is either #REQUIRED (compulsory) or #IMPLIED (optional). The specifier #FIXED must be followed by a fixed value. Any other value is interpreted as a default value.

ENTITY definitions may declare general entities (but not parameter entities, which are not supported). General entities may be a character entity reference (for an escaped character) or a string entity reference (for boilerplate text). An ENTITY may declare an internal entity as shown above, or through either SYSTEM or PUBLIC identifiers, it may refer to an external entity, whose text expansion is given in a separate file. As a security restriction, the expansion-text must be stored in a simple text file ending with the extension .txt, to prevent the malicious use of external entities to read secret password files.

JAST supports the analysis of XML schemas written in a variety of styles, and containing a wide range of W3C XSD constructions. In terms of style, it accepts any of the Russian Doll, Salami Slice or Venetian Blind conventions for presenting a schema and also handles mixtures of these styles. It supports simple types, complex types, groups, attributes and attribute groups. It supports sequence, choice, all and any (element) specifiers. It supports the iterative constructs minOccurs and maxOccurs. It supports extension and restriction of simple content and complex content. Currently, the anyAttribute construction is not supported.

The majority of the W3C XSD type system is implemented in terms of filters that can be used to constrain the values of attributes or elements. All IEEE numerical types are supported. Most XSD basic types are supported (except for NOTATION and base64binary). XSD simple types are typically one of these basic types, or a restriction on one of these types, expressed either as an enumeration, a regular expression, or a numerical subrange. Both subranges and field-widths may be specified. Please refer to the W3C Tutorial on XSD for more information about the XSD type system. Please refer to the XML Filter Guide for information about JAST filters.

Visualisation of Compiled Grammars

A Doctype Definition is compiled into a single tree of grammar rules. An XML Schema Definition is compiled into a graph of grammar rules which may have more than one root entry point (XML Schemas do not identify one root node; so may be created to validate a set of related XML documents). When validating an XML document from a Doctype, the JAST toolkit applies the single-rooted grammar to the root element of the document, reporting whether it complies. When validating an XML document from a Schema, the JAST toolkit applies the relevant root rule chosen from the Schema, according to the name of the matching root element.

Once a grammar has been compiled (by analysing the DTD or XSD), the Doctype or XMLSchema node may be inspected, to access the compiled grammar. Doctype provides the method getGrammar(), and XMLSchema provides the method getGrammar(String root) to access the grammar for the named root element. Both of these methods return an ElementRule, the top-level element rule in the grammar.

It is possible to visualise the grammar compiled by a Doctype or XMLSchema. Every top-level ElementRule in the grammar is capable of writing a pretty-printed representation of itself, using a toString() method. This will print out one production from the grammar. To see the whole grammar, call the access method getProductions() on the top-level rule, and this will return a list of all the productions in the grammar. It is a simple matter to iterate through the list and print out each rule as a production in BNF format:

	Doctype doctype =  ... ;	// Obtained somehow
	ElementRule topRule = doctype.getGrammar();
	for (ElementRule rule : topRule.getProductions()) {
	    System.out.println(rule);
        }

	XMLSchema schema = ... ;	// Obtained somehow
	ElementRule topRule = schema.getGrammar("Catalogue");
	for (ElementRule rule : topRule.getProductions()) {
	    System.out.println(rule);
        }        

It is possible to re-invoke validation programatically. Both Doctype and XMLSchema have methods accept(Document doc) and accept(Element root) which return true if the document (or root element) is valid and false otherwise. They also have the methods validate(Document doc) and validate(Element root), which succeed silently and raise an exception if the document (or root element) is invalid.

Regent Court, 211 Portobello, Sheffield S1 4DP, United Kingdom