Markup processing can be conceptually divided into separate steps: parsing (with optional validation),
filtering, and generating the final output. Some applications may have other distinct steps, such as analysis,
aggregation, and reporting. The parsing of markup is typically performed by OmniMark actions do sgml-parse
or do xml-parse
. An example using the well-formed XML parser might look like:
process do xml-parse scan #main-input output "%c" done
If validation is desired or required, then the validating XML parser, the Xerces parser, or the SGML parser (as appropriate) can be used instead of the non-validating XML parser. Validating parsers validate the input document as they parse it. The remaining processing steps are usually accomplished by markup rules.
Validation can be performed separately from parsing, using a schema library such as OMRELAXNG. Separation of parsing and validation steps makes the processing pipeline more flexible: the parser is not required to perform all possible validations, validators are not required to perform parsing, and the user is free to combine any of the available parsers and validators as needed.
Legacy OmniMark programs typically process the markup coming from the parser immediately, so the body of the
do sgml-parse
contains a single output "%c"
action to fire the markup rules. We can
accomplish the same thing by applying do markup-parse
on our markup source
:
process do xml-parse scan #main-input do markup-parse #content output "%c" done done
In order to validate the markup coming from the parser, we can pass #content
to a markup validator. For
instance, we can validate using OMRELAXNG.
import "omrelaxng.xmd" prefixed by relaxng. process do xml-parse scan #main-input using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng" output #content done
The function relaxng.compile-schema
used above reads a given textual representation of a RELAX NG
schema and returns its compiled representation as an instance of the relaxng.relaxng-schema-type
opaque
type. The compiled schema is then passed to the markup sink function relaxng.validator
, which validates
all markup written into it against the schema.
The same compiled schema can be used to validate multiple document instances:
import "omrelaxng.xmd" prefixed by relaxng. process local relaxng.relaxng-schema-type my-compiled-schema initial {relaxng.compile-schema file "my-schema.rng"} repeat over #args as input-file do xml-parse scan file input-file using output as relaxng.validator against my-compiled-schema output #content done again
In the examples above, #content
has been only validated, without any further processing. To accomplish
both, we need to send the parsed markup in two directions. This can be done by adding a markup sink
function that does the processing, and using the &
operator to split the stream in two directions:
define markup sink function markup-processor into value string sink destination as using output as destination using group "process markup" do markup-parse #current-input output "%c" done process do xml-parse scan #main-input using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng" & markup-processor into #current-output output #content done
Keep in mind that #content
is a markup source
, not a DOM tree of the document. This means that
the markup is streamed, as the parser creates it, to be both validated and processed concurrently. We can send
the markup to an arbitrary number of destinations. For instance, we can validate the markup against two
different schemas and process it in two different ways, and the markup will stream to all four concurrently from
the parser:
do xml-parse scan #main-input using output as relaxng.validator against relaxng.compile-schema file "my-schema-1.rng" & relaxng.validator against relaxng.compile-schema file "my-schema-2.rng" & markup-processor into #current-output & another-markup-processor output #content done
Apart from adding more destinations to widen the processing pipeline, we can also extend the pipeline by
breaking it into multiple steps. For example, we could direct the markup into a preparatory filtering phase
before the main processing:
define markup sink function prepare-markup into value markup sink destination as using group "prepare markup" do markup-parse #current-input output "%c" done process do xml-parse scan #main-input using output as relaxng.validator against relaxng.compile-schema file "my-schema-1.rng" & prepare-markup into markup-processor into #current-output output #content done
The easiest way, however, to augment a legacy OmniMark program with schema validation is by using the function
relaxng.validated
. This function takes the markup source
created by a parser as an argument, and
produces another markup source
that can be processed further. Here's an example of its use:
process do xml-parse scan #main-input do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng" output "%c" done done
relaxng.validated
makes markup-processor
defined earlier unnecessary: the
markup source
produced by validated
can be processed by
do markup-parse
directly. Another difference between the functions validator
and
validated
is that the latter inserts all validation errors into the markup it produces, so in the
previous example a markup-error
rule would fire for them. The validator
function, on the
other hand, reports all validation errors in OmniMark's log stream. This behavior can be modified by specifying
a different markup sink
destination for validation errors:
define markup sink function error-processor into value string sink destination as using output as destination do markup-parse #current-input output "%c" done process do xml-parse scan #main-input using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng" report-errors-to error-processor into #current-output & markup-processor output #content done markup-error log output "<!--%n" || " Validation error: " || #message || "%n" || "-->"
We've already said that a markup parser converts a string source
to a markup source
. Another way
of looking at a markup parser is as a converter from a concrete representation of a markup stream (SGML or XML,
with or without DTD) to its abstract representation. The abstract markup stream is simply a sequence of data
characters interspersed with abstract markup events. The markup events are abstract because their original
textual representation is abstracted away and only its meaning is kept. As a consequence, a markup schema
designed for validating XML can be equally applied to validating SGML:
do sgml-parse document scan #main-input do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng" output "%c" done done
One problem with this example is that SGML is case-insensitive by default, while the RELAX NG specification
treats all names as case-sensitive. There are different solutions to this mismatch: one can either modify the
schema to all-uppercase names, or use an SGML declaration to specify that SGML names should be case-sensitive.
Since these solutions are somewhat intrusive, both validator
and validated
have an optional
argument, case-insensitive
, that can be used to specify that case should be ignored during
validation:
do sgml-parse document scan #main-input do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng" case-insensitive true output "%c" done done
External text entity events coming from a markup parser require special attention. If the markup parser is to
proceed with parsing, it has to be supplied with replacement text for the external text entity in question. In
other words, an external-text-entity
rule must be run, even if only the default
one. To prevent mixups, OmniMark also requires that exactly one external-text-entity
rule be
run for each entity reference. For this reason, it is good practice to split external text entity events out of
the markup stream meant for other processing as soon as the stream is produced by markup parser, and to divert
them into a different markup sink
responsible for resolving and expanding the entities.
The function split-external-text-entities
in the library OMMARKUPUTILITIES can help with this task. It
takes two markup sinks as arguments, one responsible for resolving external text entities and the other to
handle the rest of the markup stream. The result of this function is a markup sink
which can be directly fed all
markup produced by the parser:
define markup sink function entity-resolver as using group "resolve entities" do markup-parse #current-input output "%c" done process do sgml-parse document scan #main-input using output as split-external-text-entities (entity-resolver, relaxng.validator against relaxng.compile-schema file "my-schema.rng" case-insensitive true report-errors-to error-processor into #current-output & markup-processor into #current-output) & markup-processor into #current-output output #content done group "resolve entities" external-text-entity #implied output file "%eq"
If you are using the function validated
, the external text entity events will be ignored by the schema
and reproduced in the returned markup source
. If you apply do markup-parse
to the result of
validated
, external-text-entity
rules will fire as usual.