control structure
do xml-parse document (with id-checking switch-expression)? (with utf-8 switch-expression)? (creating xml-dtds{string-expression})? scan string-source-expression local-declaration* action* done do xml-parse scan string-source-expression local-declaration* action* done do xml-parse instance (with document-element string-expression)? (with (xml-dtds{string-expression} | current xml-dtd))? (with id-checking switch-expression)? scan string-source-expression local-declaration* action* done
do xml-parse
is used to invoke the XML parser. A number of activities
must occur within a do xml-parse
block.
markup source
, either directly as #content
or by executing
exactly one parse continuation operator (%c
or suppress
) to fire markup rules.
Well-formed parsing (that is, XML parsing without validating against a DTD) is invoked by leaving out the document
keyword and other do xml-parse
arguments except for the scan
keyword followed by
the input.
do xml-parse scan file #args[1] output "%c" done
Earlier versions of OmniMark required the keyword instance
following do xml-parse
when
configuring the parser for well-formed parsing. This use of the keyword instance
is
deprecated: the instance
keyword should be reserved for validating against a pre-compiled DTD.
The instance supplied to the parser when performing a well-formed parse may still include a DTD: the parser will read and use entity definitions from the DTD but will not validate against the structural information in the DTD.
The document
keyword is used to configure the XML parser to validate its input against a DTD. The
DTD is supplied as part of the input.
do xml-parse document scan file #args[1] output "%c" done
This assumes that the file whose name is specified on the command-line contains an XML document. If the DTD
and the instance are in different files, they can be joined:
do xml-parse document scan file #args[1] || file #args[2] output "%c" done
If the same DTD is to be used to parse several input instances, it is best to pre-compile the DTD and store
it on the built-in sgml-dtds
shelf:
do xml-parse document creating xml-dtds{"my-dtd"} scan file #args[1] suppress doneIf the instance file names are stored on a shelf
my-instances
, then each instance can then be
processed in turn:
repeat over my-instances do xml-parse instance with xml-dtds{"my-dtd"} scan file my-instances output "%c" done again
A nested XML parse can use the same DTD as an outer XML parse to validate its own input: for instance,
process using group "one" do xml-parse document scan "<!DOCTYPE a [" || "<!ELEMENT a (b | #PCDATA)*>" || "<!ELEMENT b (#PCDATA)>]>" || "<a><b>Hello, World!</b></a>" output "%c" done group "one" element "a" using group "b" do xml-parse instance with current xml-dtd scan "<a>Salut, Monde!</a>" output "%c" done output "%c" element "b" output "%c" group "b" element "a" output "%c"
In this program, the XML parse launched in the element
rule for a
inside group
one
uses the same DTD as the parse launched in the process
rule.
It is possible to parse a partial instance: a piece of data comprising an element from a DTD which is not
the
doctype
element of that DTD. In this case, the element to be used as the effective doctype
for parsing the data is specified using the document-element
argument:
do xml-parse instance with document-element "lamb" with xml-dtds{"my-dtd"} scan file #args[1] output "%c" doneXML comments, processing instructions and even marked sections can precede and follow the element's start and end tags, but anything else (particularly other elements, data, or entity references) is an error.
By default, OmniMark checks all XML idref
attributes to make sure they reference valid ID
s. This checking may not be appropriate in processing a partial instance. It also takes time. It can
be disabled using with id-checking
followed by a switch
expression. The following code will
parse the specified document without checking IDREF
s:
do xml-parse document with id-checking false scan file #args[1] output "%c" done
The XML standard specifies that XML documents use the Unicode character set. However, there are many different encodings of the Unicode character set. OmniMark can process documents in any of these encodings: see Character set encoding for details.
One character encoding issue that arises in markup processing is the question of how numeric character entities must be encoded. When doing either validating parsing or well-formed parsing, OmniMark uses UTF-8 encoding to represent numeric character entities. This behaviour can be configured when doing validating parsing: the parser can be configured to use either UTF-8 encoding, or to flag a markup error when it encounters a character entity whose value is greater than 255. This has a number of consequences.
with utf-8
modifier should be passed to do xml-parse
, with a switch
expression that evaluates to
false
:
do xml-parse document with utf-8 false scan file #args[1] output "%c" done