The XML Document Validation Engine
The JAST 2.5 Toolkit provides optional components
for validating XML documents. By default, the XML parsers only check XML documents
for well-formedness (viz. that correctly-paired markup tags are used, that XML
elements are correctly nested and that the syntax for identifiers and punctuation
is correct). Sometimes it is useful to perform a stronger check, to ensure that
the expected kinds of element and attribute are present in a document. Validation
can be requested by changing settings in the XML readers XMLReader or
ASTReader or builders inheriting from BasicBuilder ,
found in the top-level package: uk.ac.sheffield.jast .
Two different validation schemes are supported. The tools can validate
in-memory DOM-trees against either a Document Type Definition (DTD) or an XML Schema
Definition (XSD). The DTD approach is older, but also offers the possibility of
defining additional abbreviations for foreign characters, or boilerplate text,
called Entity References. The XSD approach is newer, more complex and
more flexible, although it does not handle Entity References. Both
validation schemes use a common set of components for building a grammar, defined
in the package:
uk.ac.sheffield.jast.valid . Once you have understood the basic
validation concepts presented in this quick-start introduction, please refer
to the the
JAST 2.5 package APIs for more detailed
information.
Levels of XML Document Validation
All the XML parsers, readers and builders offer the method:
setValidation(Validation level) , which controls the validation
level. The argument to this method is an enumerated value of the
Validation type. Different values correspond to different
settings, instructing parsers to use a different level of validation:
IGNORE : is the default disabled setting, which means that all
validation is turned off (and that referenced DTDs or XSDs will not be read,
which saves on the document processing time);
EXPAND : is the lowest setting, which means that Entity
References will be expanded (and referenced DTDs will be read, which will
slightly increase the document processing time);
DOCTYPE : is the middle setting, which means that Entity
References will be expanded and the document will also be validated
against a Doctype Definition (and hence referenced DTDs will be read);
SCHEMA : is the highest setting, which means that Entity
References will be expanded and the document will also be validated
against an XML Schema Definition (and hence referenced DTDs and XSDs will be read).
Note that these settings are mostly cumulative - the higher settings all
support prior expansion of Entity References. If Doctype validation
is selected, the parser will look for a Doctype definition (either internal, or
external) attached to the XML document. If Schema validation is selected, the
parser will look for both the Doctype (to expand any Entity References)
and for a referenced XML Schema document (to validate the XML document).
Enabling XML Document Validation
The main entry point to validation is by controlling the validation settings
in the XML parsers XMLReader (for DOM-trees), ASTReader
(for AST-trees) or in the builder BasicBuilder (for SAX-parsing) found
in the top-level package: uk.ac.sheffield.jast . Since the validation
engine acts on an in-memory DOM-tree, technically only DOM-trees may be validated.
However, the lower validation settings are possible with other parsers or builders
that create AST-trees, to allow expansion of Entity References as the
XML file is being read.
In order to activate automatic document validation, one of the validation
settings must be specified, before the XML document is actually read. The
following example asks XMLReader to validate the XML document against
a Document Type Definition (and will expand Entity References):
File file = new File("my/xml/input.xml"); // Or whatever file
XMLReader reader = new XMLReader(file);
reader.setValidation(Validation.DOCTYPE); // Request DTD validation
Document document = reader.readDocument();
reader.close();
The following example asks ASTReader just to expand Entity
References as it reads the XML file (see below for an explanation of
Entity References). This is the highest level of validation that can
be supported, since the constructed AST is of a custom format, unknown to the
JAST toolkit:
File file = new File("my/xml/input.xml"); // Or whatever file
ASTReader reader = new ASTReader(file);
reader.setValidation(Validation.EXPAND); // Request entity expansion
reader.usePackage("org.my.catalogue");
Catalogue catalogue = (Catalogue) reader.readDocument();
reader.close();
The following example uses a SAX-builder strategy, but since the parser
is given the XMLBuilder that constructs a full DOM-tree, then
full validation is possible, so we may ask it to validate the XML document
against an XML Schema Definition (and expand any Entity References):
File file = new File("my/xml/input.xml"); // Or whatever file
// Create the builder
XMLBuilder builder = new XMLBuilder();
builder.setValidation(Validation.SCHEMA); // Request XSD validation
// Set up the parser
XMLParser parser = new XMLParser(file);
parser.setBuilder(builder); // Attach the builder
Document document = (Document) parser.readDocument();
parser.close();
Requesting any level of document validation will cause further files to
be opened. An external Doctype definition must be read and analysed before
the main XML file, so that Entity References may be expanded. An
external XML Schema file may be read after the main XML file. A grammar
will be compiled from either of these sources, then applied to the in-memory
DOM-tree (if full validation was requested).
Expanding Entity References
General Entity References are supported in the JAST 2.5 toolkit.
In XML, an Entity is any kind of non-XML datum that must be escaped
within an XML document. Some examples of these are the punctuation characters
used during mark-up, which cannot be included literally in a document, but
which can be escaped:
< the less than character <
> the greater than character >
& the ampersand character &
" the quotation character "
' the apostrophe character '
These punctuation characters (shown on the right) are escaped in XML, by
encoding them using the standard Entity References (shown on the left).
If it is desired to use any of these characters literally in a document, their
escaped forms must be used instead. These five Entity References are
pre-defined in XML, so may always be used.
Further Entity References may be defined explicitly by the programmer,
for example, to encode special foreign language characters, or even to encode standard
boilerplate text by an abbreviation. These extra entity declarations will appear as
part of a Doctype definition (see below). Some examples include:
<!ENTITY copy "©">
<!ENTITY ajhs "Anthony J H Simons, MA PhD">
The first of these defines a Character Entity Reference that encodes
the copyright © symbol. The second of these defines a String Entity
Reference that encodes an abbreviation for my full name and qualifications.
The idea is that the XML document may include the encoded reference, which will
be replaced automatically by their expanded forms, when the text is read into
memory. When inserting an Entity Reference, this always begins with
the ampersand & symbol and ends with a semicolon ;
punctuation mark:
© expands to: ©
&ajhs; expands to: Anthony J H Simons, MA PhD
All Entity References are stored in a Lexicon object,
which is an integral part of every reader, writer and builder. These classes
offer the methods:
Lexicon getLexicon() and setLexicon(Lexicon obj)
to retrieve, or reset their internal lexicon. While it is possible to manipulate
lexicons through the Lexicon API directly, this is seldom necessary.
It is more usual to allow a reader to populate its lexicon with any new Entity
References discovered in its Doctype Definition; and then the programmer may
transfer the lexicon as a whole to a writer, which will re-encode the output XML
file using the same entity definitions.
When an XML document is read, all of its Entity References are
expanded (if expansion has been requested). The text stored in-memory will
contain the expanded characters or boilerplate text; that is, Entity
References only exist in the serialised XML file. Assuming that the same
Lexicon is used for both reading and writing, when an XML document
is written out, the reverse process happens: any text that needs to be escaped
will be escaped as Entity References in the XML file. If an expected
Entity Reference definition is missing, this will raise an exception
in a reader, but a writer will simply fail to encode text that has no
corresponding Entity Reference definition.
Creating a Doctype Definition (DTD)
An XML document may optionally include a Doctype node, which
declares a grammar of expected elements and attributes, starting from the root
element. Grammar definitions may be provided as part of the Doctype
node (the internal subset), or may be provided in an external DTD file
referenced by the Doctype node (the external subset).
The general shape of an internal Doctype Definition is as shown below:
<!DOCTYPE Catalogue [
<!ELEMENT Catalogue (Film | TVShow)*>
<!ELEMENT Film (Title, Director)>
<!ELEMENT TVShow (Title, Director)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Director (#PCDATA)>
<!ATTLIST Film
date CDATA #REQUIRED
rating (U|PG|12|15|18) #IMPLIED>
<!ATTLIST TVShow
date CDATA #REQUIRED
rating (U|PG|12|15|18) #IMPLIED>
<!ENTITY copy "©">
]>
This uses a BNF style to assert that a conforming document must have a
particular grammar. The DOCTYPE node asserts that the root
node must be Catalogue . The ELEMENT nodes assert
that the Catalogue element contains a heterogeneous list of zero
or more Film or TVShow elements. Each of these
contains a sequence of Title and Director elements;
and these leaf-nodes contain only text (indicated by #PCDATA ).
Furthermore, the ATTLIST nodes assert that the
Film and TVShow elements must have a compulsory
date attribute (indicated by #REQUIRED )
and may have an optional rating attribute (indicated by
#IMPLIED ). Whereas the date value may be any
text, the rating must be chosen from a restricted choice
of symbols (enumerated above). Finally, the ENTITY node
allows the copyright symbol to be abbrevated by ©
within the XML document.
The same Doctype definition may be provided externally, as the DTD file
Catalogue.dtd . In this case, the Doctype Definition within
the XML file will be much shorter, and will refer
to the external DTD file, in which all the above ELEMENT ,
ATTRIBUTE and ENTITY declarations are listed
(viz. the DTD file lists the contents enclosed by the square brackets
[] above, excluding the brackets). There are two possible
formats for linking to the external subset:
<!DOCTYPE Catalogue SYSTEM "Catalogue.dtd">
In this simple format, the DOCTYPE node only names the root
Catalogue element, and declares a system identifier,
which is the pathname to the referenced DTD file. Here, we assume that the
DTD file is found locally, relative to the XML file; in general, the path
name could be longer, or even a URL. The second format is more complex:
<!DOCTYPE Catalogue PUBLIC
"-//AJHS//DTD Catalogue 1.0//EN" "Catalogue.dtd">
This is used when a Doctype is published more generally, and is given
given a formal public identifier as well as the path to the DTD file.
Formal public identifiers conform to a restricted syntax that indicates the
owner, and the type of content referenced - see this explanation of
Formal
Public Identifiers on Wikipedia.
A Doctype may have both an external subset and an internal
subset. If both parts are present, then the internal definitions take
precedence over external definitions (that define the same thing). This
allows the programmer to provide a general Doctype Declaration for a set of
XML documents, but to customise what contents are allowed within specific documents,
if they declare an internal subset. Please refer to the
W3C
Tutorial on DTD for more information.
Creating an XML Schema Definition (XSD)
An XML document may optionally refer to an external XML Schema Definition.
An XML Schema is itself a well-formed XML document. An example of an XML Schema
Definition, provided as a separate XSD file Catalogue.xsd , is given
below:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:element name="Catalogue">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element name="Film" type="showType"/>
<xs:element name="TVShow" type="showType"/>
</xs:choice>
</xs:complexType>
</xs:element>
<xs:complexType name="showType">
<xs:sequence>
<xs:element name="Title" type="xs:string"/>
<xs:element name="Director" type="xs:string"/>
</xs:sequence>
<xs:attribute name="date" use="required" type="xs:integer"/>
<xs:attribute name="rating" use="implied" type="ratingType"/>
</xs:complexType>
<xs:simpleType name="ratingType">
<xs:restriction base="xs:string">
<xs:enumeration value="U"/>
<xs:enumeration value="PG"/>
<xs:enumeration value="12"/>
<xs:enumeration value="15"/>
<xs:enumeration value="18"/>
</xs:restriction>
</xs:simpleType>
</xs:schema>
This defines a grammar that applies the same constraints as the Doctype grammar
above. The root Catalogue element may contain an unbounded choice
of Film and TVShow elements, which are of the same
showType . This type contains a sequence of elements Title
and Director , both of the simple string type, and two attributes,
one of which is a mandatory integer (date ) and the other of which is
an optional enumeration (rating ), of the type ratingType .
There are many styles of presenting an XML Schema; this is just one, which chooses
to decouple some element definitions from the separate type definitions of those
elements. The various styles have evocative names, such as: Russian Doll,
Salami Slice, or Venetian Blind. Please refer to the
W3C
Tutorial on XSD for more information.
The XML Schema Definition file is referenced through certain distinguished
attributes of the root element of the XML document. There are two styles, with
and without the declaration of a local namespace for the owner of the schema.
The first format simply declares where the XSD file is located:
<Catalogue
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="Catalogue.xsd">
...
</Catalogue>
In this simpler format, the xmlns:xsi attribute declares a namespace
(standing for XML schema instance), and the second attribute
xsi:noNamespaceSchemaLocation is defined within this namespace and
refers, through its value, to the path leading to the XSD file. In the second
format, an additional default namespace is defined for the owner of the
XML schema:
<Catalogue xmlns="https://www.my.domain.org"
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="Catalogue.xsd">
...
</Catalogue>
This declares that the elements defined within the XML document (and hence the
schema also) come from the
namespace https://www.my.domain.org . Note that the attribute referencing
the pathname to the XSD file is given the shorter name xsi:schemaLocation
instead. Either of the two formats above is appropriate. Please refer to the
W3C
Tutorial on XSD for more information.
Level of Compliance to W3C Standards
JAST supports the creation of DOCTYPE nodes that contain a sequence
of ELEMENT , ATTLIST and ENTITY definitions.
The full BNF syntax for ELEMENT definitions is supported, including
sequence, selection and iteration of single items or bracketed ()
structures. Multiplicity markers may specify optional ? , zero-to-many
* , or one-to-many + occurrences. ELEMENT
definitions may contain the EMPTY category marker to indicate empty
content, or ANY category marker to indicate arbitrary content. See the
W3C
Tutorial on DTD for further details.
ATTLIST definitions may declare single, or multiple attributes for
each element within the same declaration. Each defined attribute has a name, an
attribute type, and either an occurrence specifier or a default value. Attribute
types may be symbolic types such as ID , NMTOKEN or
CDATA ; or they may be enumerated selections. An occurrence specifier
is either #REQUIRED (compulsory) or #IMPLIED (optional).
The specifier #FIXED must be followed by a fixed value. Any other
value is interpreted as a default value.
ENTITY definitions may declare general entities (but not
parameter entities, which are not supported). General entities may be a
character entity reference (for an escaped character) or a string entity
reference (for boilerplate text). An ENTITY may declare an
internal entity as shown above, or through either SYSTEM
or PUBLIC identifiers, it may refer to an external entity,
whose text expansion is given in a separate file. As a security restriction, the
expansion-text must be stored in a simple text file ending with the extension
.txt , to prevent the malicious use of external entities to read
secret password files.
JAST supports the analysis of XML schemas written in a variety
of styles, and containing a wide range of W3C XSD constructions. In terms
of style, it accepts any of the Russian Doll, Salami Slice or
Venetian Blind conventions for presenting a schema and also handles
mixtures of these styles. It supports simple types, complex types, groups,
attributes and attribute groups.
It supports sequence , choice ,
all and any (element) specifiers. It supports the
iterative constructs minOccurs and maxOccurs . It
supports extension and restriction of simple
content and complex content. Currently, the anyAttribute
construction is not supported.
The majority of the W3C XSD type system is implemented in terms of filters
that can be used to constrain the values of attributes or elements. All IEEE
numerical types are supported. Most XSD basic types are supported (except for
NOTATION and base64binary ).
XSD simple types are
typically one of these basic types, or
a restriction on one of these types, expressed either as an enumeration, a
regular expression, or a numerical subrange. Both subranges and field-widths
may be specified. Please refer to the
W3C
Tutorial on XSD for more information about the XSD type system.
Please refer to the
XML Filter Guide for information
about JAST filters.
Visualisation of Compiled Grammars
A Doctype Definition is compiled into a single tree of grammar rules.
An XML Schema Definition is compiled into a graph of grammar rules which
may have more than one root entry point (XML Schemas do not identify one
root node; so may be created to validate a set of related XML documents).
When validating an XML document from a Doctype, the JAST toolkit applies
the single-rooted grammar to the root element of the document, reporting
whether it complies. When validating an XML document from a Schema, the
JAST toolkit applies the relevant root rule chosen from the Schema,
according to the name of the matching root element.
Once a grammar has been compiled (by analysing the DTD or XSD), the
Doctype or XMLSchema node may be inspected, to
access the compiled grammar. Doctype provides the method
getGrammar() , and XMLSchema provides the method
getGrammar(String root) to access the grammar for the named
root element. Both of these methods return an ElementRule ,
the top-level element rule in the grammar.
It is possible to visualise the grammar compiled by a Doctype
or XMLSchema . Every top-level ElementRule in
the grammar is capable of writing a pretty-printed representation of
itself, using a toString() method. This will print out
one production from the grammar. To see the whole grammar, call the
access method getProductions() on the top-level rule, and
this will return a list of all the productions in the grammar. It is
a simple matter to iterate through the list and print out each rule
as a production in BNF format:
Doctype doctype = ... ; // Obtained somehow
ElementRule topRule = doctype.getGrammar();
for (ElementRule rule : topRule.getProductions()) {
System.out.println(rule);
}
XMLSchema schema = ... ; // Obtained somehow
ElementRule topRule = schema.getGrammar("Catalogue");
for (ElementRule rule : topRule.getProductions()) {
System.out.println(rule);
}
It is possible to re-invoke validation programatically.
Both Doctype and XMLSchema have
methods accept(Document doc) and accept(Element
root) which return true if the document (or root
element) is valid and false otherwise. They also have the
methods validate(Document doc) and validate(Element
root) , which succeed silently and raise an exception if the
document (or root element) is invalid.
|