Grammar-Driven Markup Generation
Mario Blažević
Senior Software Developer
Abstract
This paper defines the concept of grammar-driven normalization of incomplete instances, sketches its implementation
for RELAX NG schema and XML documents, and presents an example of its practical use for automated document
conversion.
Note
I want to thank my colleagues Jacques Légaré, Joe Gollner, Patrick Baker, and Roy Amodeo for their
valuable comments, and Stilo International for providing the initial motivation for the work and giving me time to
do it properly.
Introduction
A grammar is a set of rules that describes the composition of a language. Grammars have existed for at least two
and half millenia, but their use was mainly didactic. The practical applications of grammars started with their
mathematical formalization in the 1950s [c56] and the emergence of computer programming languages
[b59].
There are three ways to apply a grammar:
Validate |
a given sentence to determine if it is grammatically correct. |
Generate |
a correct sentence according to the grammar. |
Parse |
a given sentence to determine its structure and constituent parts. |
The historical motivation for development of the theory of formal languages was to explain the method of
generation of valid sentences as well as their recognition and parsing. Generative grammars are still an active area
of reasearch in linguistics. In computer science, however, it is parsing that is the primary role of grammars; their
use for generation of valid sentences appears to be forgotten.
One reason for this lack of interest in grammar-driven generation is that the problem is poorly defined. The
output of a generator is a grammatically valid sentence. But what is its input, other than the grammar? The task of a
parser is clearer: while its output in practice can be anything, it is usually defined as a parse tree of some
form.
While many kinds of generator inputs are possible and interesting to contemplate, we shall concentrate on the
specific task of generating a grammatically valid sentence from a sequence of tokens (i.e., terminals) allowed by the grammar. Furthermore, the generator must exactly reproduce any
input that is already a grammatically valid sentence. We shall use the term normalization to refer to this form of sentence generation, and the term normalizer for the generator that performs it.
Given a normalizer that maps a set of token sequences T into a set of
cannonical, grammatically valid sequences S, one can define a new language S' as the union of the core language S and the non-core
language T.
For an actual example of this kind of language definition, we need look no further than SGML. An SGML parser
performs parsing, normalization, and validation at the same time. For example, given the following SGML DTD that
declares the end tag of element a and both tags of element b to be omissible:
<!element a - o (b)>
<!element b o o (#pcdata)>
an SGML parser with OMITTAG feature enabled will parse the instance <a>Hello, World!
into a syntax tree corresponding to <a><b>Hello, World!</b></a> . Any conformant SGML parser
with support for omissible tags will consider these two instances as equivalent, as long as the same DTD is used to
parse them. If the DTD should change, however, an instance that omits element tags may become not only invalid but
unparsable.
For this reason and others, an XML parser does not perform any normalization. A well-formed XML instance is
always parsed in the same way according to a fixed grammar, regardless of what DTD or schema the instance happens to
also conform to. Various types of XML schemas are used only for validation, not for parsing or normalization.
This paper will demonstrate how an XML schema (a RELAX NG schema specifically) can be used to generate a valid
instance of the schema from an invalid but well-formed XML instance. In other words, we shall normalize a well-formed
XML instance so it conforms to a given schema.
The only element tags allowed in the input are those that occur in the target schema. If normalization succeeds
with no errors, the output of the normalizer will:
be well-formed XML,
conform to the target schema, and
contain the entire input instance as its subsequence, interleaved with additional element tags.
The normalizer is much more forgiving than the SGML parser in what kinds of input it can normalize. The target
grammar is allowed to be ambiguous, though this is not recommended for performance reasons. There is no unique particle contribution constraint [w04], nor is the generation
restricted to contextually required elements [s86]. If the input
instance is ambiguous, i.e., there is more than one way to normalize it, the
normalization with the least amount of added markup will be chosen. This choice is not made at the point when the
particle is encountered, because the subsequent content could make that choice invalid, but deferred until the
enclosing element ends. The biggest constraint is that only plain element tags with no attributes can be
inserted. Even this constraint could be overcome by automatically generating some arbitrary grammatically correct
values for all required attributes, but we felt that this might cause more harm than benefit in practice.
The primary purpose of SGML's minimization features was making SGML easier to author. We do not recommend using
the XML normalizer in this way; we feel that it is important that every XML document in permanent storage should
conform to a clearly expressed schema.
The legitimate place for the normalization process is inside a processing pipeline, where it can usefully be
applied to or produce a transient XML document instance. Normalization can thus be applied before another process in
order to transform the input document into a form easier to process. The target schema for the normalizer may be just
a stricter form of the input document's schema, or it may specify additional disambiguating markup.
Alternatively, a normalizer can appear at the opposite end of the processing pipeline, where it ensures that the
output of the preceding process conforms to the target schema. We shall demonstrate this latter possibility of
practical use in a later section.
The rest of the paper is organized as follows: the next section will present an example of normalizer's use, and
afterwards its design and implementation will be explained. The section following after that explains the purpose and
results of the normalizer in production, and the paper concludes with a discussion of related work.
Example
This section will demonstrate the capabilities of our normalizer
through an example. Let us take the following RELAX NG schema as our target:
start = document
block = p | ol | ul
document = element document { title, block+, section* }
section = element section { title, block+, section* }
title = element title { text }
p = element p { text }
ol = element ol { li+ }
ul = element ul { li+ }
li = element li { block+ }
The goal is to convert our input to a valid instance of this grammar. As for the input, it may start as a plain
text file like the following one:
Normalizer test input
This is a short test input for the normalizer.
Purpose
The purpose of this document is to demonstrate the capabilities of the normalizer.
The normalizer is schema-driven, and it tries to fit its input into the target schema.
Constraints
The goal of the normalizer is to produce output that conforms to the schema,
while preserving complete input. It does not
always succeed:
- A piece of input that cannot be fit into the schema will be dropped.
- The normalizer does not attempt to conjure any required attribute values nor text data.
These must be present in the input.
We first need to ensure that our input is a well-fomed XML document, which can be achieved simply by putting the
entire text inside the <document>...</document> tags. If we then run this XML instance
through the normalizer, it will be converted to
<document>
<title></title>
<p>
Normalizer test input
This is a short test input for the normalizer.
Purpose
The purpose of this document is to demonstrate the capabilities of the normalizer.
The normalizer is schema-driven, and it tries to fit its input into the target schema.
Constraints
The goal of the normalizer is to produce output that conforms to the schema,
while preserving complete input. It does not
always succeed:
- A piece of input that cannot be fit into the schema will be dropped.
- The normalizer does not attempt to conjure any required attribute values nor text data.
These must be present in the input.
</p>
</document>
As we can verify, the output indeed does conform to the target schema. The content model of
document requires a title element, for example, and this element has been inserted. Since it
has no content, though, it may not be all that useful.
In order to produce a more appropriate instance, we need to place certain constraints on the solution. One way
to do this would be make the schema stricter. We could require the title content to be non-empty, and the paragraph
content to contain no line-breaks. Alternatively, we can specify the constraints by adding the desired markup to the
input document. For example, we can mark up all input lines that appear to be titles:
<document>
<title>Normalizer test input</title>
This is a short test input for the normalizer.
<title>Purpose</title>
The purpose of this document is to demonstrate the capabilities of the normalizer.
The normalizer is schema-driven, and it tries to fit its input into the target schema.
<title>Constraints</title>
The goal of the normalizer is to produce output that conforms to the schema,
while preserving complete input. It does not
always succeed:
- A piece of input that cannot be fit into the schema will be dropped.
- The normalizer does not attempt to conjure any required attribute values nor text data.
These must be present in the input.
</document>
Given this enriched input together with the original schema, the normalizer produces the following
output:
<document>
<title>Normalizer test input</title>
<p>
This is a short test input for the normalizer.
</p>
<section>
<title>Purpose</title>
<p>
The purpose of this document is to demonstrate the capabilities of the normalizer.
The normalizer is schema-driven, and it tries to fit its input into the target schema.
</p>
<section>
<title>Constraints</title>
<p>
The goal of the normalizer is to produce output that conforms to the schema,
while preserving complete input. It does not
always succeed:
- A piece of input that cannot be fit into the schema will be dropped.
- The normalizer does not attempt to conjure any required attribute values nor text data.
These must be present in the input.
</p>
</section>
</section>
</document>
The good news is that the normalizer has not only respected and kept our title markup, it has
inferred that every title but the first one has to start a new section. The bad news is that the second section has
been nested within the first one. While this does conform to the target schema, the two sections belong at the same
level. Furthermore, we want every line to be a separate paragraph; and there appears to be a list at the end of our
document and it should be marked up as such.
In this simple example we could just visually determine where each section, paragraph, or list item starts and
ends and mark them up with XML element tags. In practice, however, we'd rather have this work done automatically. A
simple pattern-matching rule would be capable of inserting a start tag wherever there is a hyphen at the beginning of
the line, but finding where it ends is much more difficult. To make this marking-up process easier, the normalizer
recognizes a number of processing instructions as output constraints. Once we apply them, our input document might
look like this:
<document>
<title>Normalizer test input</title>
<?derivative:start-anew <p>?>This is a short test input for the normalizer.
<?derivative:start-anew <section>?><title>Purpose</title>
<?derivative:start-anew <p>?>The purpose of this document is
to demonstrate the capabilities of the normalizer.
<?derivative:start-anew <p>?>The normalizer is schema-driven,
and it tries to fit its input into the target schema.
<?derivative:start-anew <section>?><title>Constraints</title>
<?derivative:start-anew <p>?>The goal of the normalizer is
to produce output that conforms to the schema,
while preserving complete input. It does not always succeed:
<?derivative:proceed-with <ul>?><?derivative:start-anew <li>?>
A piece of input that cannot be fit into the schema will be dropped.
<?derivative:proceed-with <ul>?><?derivative:start-anew <li>?>
The normalizer does not attempt to conjure any required attribute values nor text data.
<?derivative:start-anew <p>?>These must be present in the input.
</document>
The processing instructions basically act as placeholders for element start tags, letting the normalizer know
which element is supposed to enclose the following content. The exact location of the element's end tag will be
inferred by the normalizer. The details of the processing instructions and the inference algorithm will be explained in
the next chapter. The normalizer's output, based on our last enriched input above, looks as follows:
<document>
<title>Normalizer test input</title>
<p>This is a short test input for the normalizer.
</p>
<section>
<title>Purpose</title>
<p>The purpose of this document is to demonstrate the capabilities of the normalizer.
</p>
<p>The normalizer is schema-driven,
and it tries to fit its input into the target schema.
</p>
</section>
<section>
<title>Constraints</title>
<p>The goal of the normalizer is to produce output that conforms to the schema,
while preserving complete input. It does not always succeed:
</p>
<ul>
<li>
<p>
A piece of input that cannot be fit into the schema will be dropped.
</p>
</li>
<li>
<p>
The normalizer does not attempt to conjure any required attribute values nor text data.
</p>
<p>These must be present in the input.
</p>
</li>
</ul>
</section>
</document>
Design and Implementation
The basic idea that drove the normalizer design came after implementing the validator automaton [c02] for RELAX NG [c01]. This automaton uses a flexible validation algorithm based on
Brzozowski derivatives [b64]. We shall now sketch the basic idea of this validation algorithm; a
more thorough treatment can be found in the literature [a72].
Informally, the derivative of a grammar G with respect to a sequence of tokens
S is the grammar G', such that a sequence S' is valid according to G' if and only if the concatenation
of S and S' is valid according to G.
We can think of the grammar G as the initial state of a validating automaton,
and the grammar derivative calculation as its transition function. A very simple and generic validator algorithm can
be built on top of this concept [s05], and one such validator is the standard RELAX NG validation
algorithm.
While working on the implementation of an RELAX NG validator library [s09b] for OmniMark
[s09a], it became clear that the automaton could be relatively easily extended in two orthogonal
ways: with additional pattern constructs, and with additional actions to perform. In particular, the validating
automaton could be modified to produce arbitrary output, in the same way a deterministic finite automaton can be
extended to a Moore machine [m56].
The original RELAX NG validation algorithm [c02] represents the grammar as a tree data
structure. Here is its definition in Haskell [h02]:
data Pattern = Empty
| NotAllowed
| Text
| Choice Pattern Pattern
| Interleave Pattern Pattern
| Group Pattern Pattern
| OneOrMore Pattern
| List Pattern
| Data Datatype ParamList
| DataExcept Datatype ParamList Pattern
| Value Datatype String Context
| Attribute NameClass Pattern
| Element NameClass Pattern
| After Pattern Pattern
The constructors of this data type, with the exception of the last one, correspond to the constructs of the
RELAX NG schema language. The last constructor, After , appears when a start tag of an element is
consumed, and matches the remainder of the XML input together with the corresponding end tag.
We need two extensions to this data structure in order to enable the automaton to perform normalization. First,
we need to be able to assume the presence of an omitted element tag. Secondly, we must be able to output the
normalized version of our input, together with the omitted and inferred tags. So we add two more data
constructors:
data Pattern = Empty
| ...
| Region Bool QName Pattern
| Result [ResultDelta] Pattern
data ResultDelta = StartTag ChildNode | Add ChildNode | EndTag | Inferred ResultDelta
The Result constructor keeps track of the output that should be emitted if the pattern it wraps
succeeds. Whenever the current pattern (i.e., the state of the automaton) has a
Result node at its root, the accumulated chunk of output is flushed and the root node is replaced by its
child pattern.
The Region constructor denotes an open element region, which may have come from the input or may
have been inserted in order to normalize it. In the former role, it acts as a replacement for the After
node. To accomodate this change in the representation of open element regions, we have to make another adjustment to
the Interleave constructor. It needs an extra integer field to keep track of which of its patterns has
consumed unmatched start tags, and how many of them.
As long as the input conforms to the target schema, the normalizer acts the same way as the RELAX NG validator,
except it builds the Result nodes and flushes them at the output. When a particle that doesn't match the
current pattern is encountered, the normalizer attempts to recover by descending into the content models of elements
that are allowed at the current point. The derivative of the current pattern is then the sum of derivatives of all
content models that accept the particle, with each content derivative wrapped into a Region node and a
Result node; these keep track of the inferred element start tag.
This modification of the RELAX NG validation algorithm results in the normalizer whose complete source code is
listed in the paper's appendix Appendix A. This simple normalizer is completely correct, but far from
perfect. It easily runs into a combinatorial explosion of the pattern tree, and it cannot be customized to resolve the
ambiguity in a desired way. While the available space is too short to describe all of the additional features and
optimizations, the role of the processing instruction guides is too important to leave out.
Many interesting target schemas contain a large number of elements which have nearly identical content models
and which are allowed to appear in the same places. Such is the case, for example, with DocBook 5 elements
glossary , bibliography , index , toc , dedication ,
acknowledgements , preface , chapter , appendix ,
article , and colophon . This represents a problem for the normalizer because it must keep
track of all the alternative elements in case of recovery. If the user can specify which of the elements should be
used for recovery, this improves the output and the performance at the same time.
At the moment, the normalizer accepts the following processing instructions:
<?derivative:start-anew depth? start-tag?>
<?derivative:start-nested depth? start-tag?>
<?derivative:proceed-with depth? start-tag?>
<?derivative:ensure-inside element-name?>
<?derivative:ensure-outside element-name?>
where depth has the form region-identifier: integer.
Processing instructions are used instead of element tags because the input document has to be well-formed XML,
so it cannot contain unmatched element start tags. Using empty element tags in a special namespace could cause
difficulties because the target schema may or may not allow them in that position. Finally, since these instructions
target one specific tool, the normalizer, this is an appropriate use of processing instructions if there ever was
one.
The normalizer evaluates these processing instructions as follows:
start-anew |
Any existing element region that was open with the same region-identifier and larger
or equal integer depth will be closed. If depth is
not specified, any existing element region with the same element name will be closed. A new element region will be
started as if the specified start-tag appeared instead of the instruction.
|
start-nested |
Any existing element region that was open with the same region-identifier and larger
integer depth will be closed. A new element region will be started as if the specified
start-tag appeared instead of the instruction.
|
proceed-with |
Any existing element region that was open with the same region-identifier and larger
integer depth will be closed. If there is an existing element region with the same
region-identifier and equal integer depth, this region is kept and nothing else
happens. Otherwise a new element region will be started as if the specified start-tag
appeared instead of the instruction.
|
ensure-inside |
The current pattern is trimmed to leave only those possibilities which assume the specified element-name is currently open.
|
ensure-outside |
The existing schema of the remainder of the document is trimmed to leave only those possibilities which assume the
specified element-name is not currently open.
|
Results
The schema-driven normalizer described above has been implemented in the OmniMark programming language, to serve
as a component in various document conversion solutions. We shall now briefly describe this environment in order to
illustrate the practical benefits of the approach.
The wider purpose is to convert documents from a legacy presentation-oriented format into a semantically richer
XML format. Some examples of the source document format are FrameMaker, HTML, InDesign, or Quark Xpress. The target
format of the conversion can be DITA, DocBook, or a user-specified XML format.
One obvious simplification when faced with converting M source formats to N target formats is to introduce an
intermediate format, and thus replace M*N direct converters by M input converters to the intermediate format
plus N output converters from the intermediate to the target format. Another benefit of having the intermediate
document is that it can be enriched with user annotations to guide the output converter.
Experience we gained in building specific output converters showed that hand-written implementations were
difficult to verify and maintain, so we needed a more robust and more flexible solution. Some customers required
outputs that conform to a specialization of DITA, others wanted DocBook with various extensions.
The core of the output converters is the topic of this paper: the schema-driven normalizer. Its draft
implementation, given in the appendix, was coded in Haskell and tested using the QuickCheck library [c10]. After this proof of concept, the production implementation was developed in OmniMark. Once the
normalizer was ready, the old hand-coded output converters for DITA and DocBook were replaced with new converters in a
matter of days.
Compared to the hand-written output converters they replaced, the converters based on the schema-driven
normalizer provide several benefits:
Assuming the normalizer itself is correct, the output of the converter is guaranteed to be conformant to the
target schema. The correctness of hand-written converters' output is more difficult to verify.
A new target format can be supported very quickly as long as we have a schema for it, while before it would
require rewriting the entire converter. If the new target language is a strict subset of an existing one, no coding is
required whatsoever. For a small extension of an already supported schema, only a few lines of code must be
added.
Many users have various stylistic requirements that can be supported simply by restricting the target
schema, with no code modifications required.
The new converters are capable of automatically reordering parts of the input to make it conform to the
target schema, with no user intervention required beyond the usual annotations. One example that we have often
encountered in practice is that of DITA task topics: the DITA 1.1 schema requires the prereq ,
context , steps , result , example , and postreq elements
to be listed in this specific order. Most existing technical documentation is not written this way.
One downside of the new converters is their performance, especially their worst-case memory consumption. The
worst case in question is generally an input document which is annotated in a way that makes it non-conformant with
the target schema. The hand-coded converters would in this situation either report errors (which is the desired
behaviour) or generate invalid output (which is not). The schema-driven converters try to recover by inserting the
missing element tags and by rearranging the input. The recovery eventually fails and the normalizer produces an error
report, but its attempt costs a lot of time and memory. We are still looking for a way to fix this issue.
Related Work
Automatic grammar-driven sentence normalization has been done in linguistics, for the automatic correction of
grammatically incorrect English sentences [v88].
For markup languages, particularly XML, there has been some related work in the area of automatic markup [f93] [k97] [t98]. The main difference with our work is that the input of an automatic markup system is typically plain
text, usually produced by OCR. As a consequence, even those automatic markup systems whose output schema is not fixed
require additional configuration in form of pattern-matching rules. The normaliser approach presented in this paper
can be coupled with a pattern-matching engine as well, but it does not depend on one.
The automatic markup system that seems nearest to ours is the Vasari project
[a03], which is also schema-driven to some extent. The target schema must be enriched using a
special pattern-matching language which determines where the required element tags can be inserted. It is not clear if
the location of every tag required by the target schema must be specified this way.
There is a large body of work on the automatic translation between structured documents that conform to similar
schemas [a72] [m96] [t03]. The approaches based on the
concept of syntax-directed translation schema [l68] and its
extensions, such as extended syntax-directed translation schema [k95] and tree transformation grammars [l97], seem to
be most related to our work. The syntax-directed translation is simpler and more direct than our approach to
translation, but it requires both the source and target format to be well structured. Our problem, quite common in
practice, is that the source format is only weakly structured. The transformation process presented in this paper
proceeds in two stages: the weakly structured source document is first translated into another weakly structured
document which uses the tags allowed by the target schema. This intermediate document is then normalized to a valid
schema instance.
Appendix A. Sample implementation
This appendix contains a sample normalizer implementation in Haskell. The implementation depends on the sample
code for the RELAX NG validator [c02], which it expects to find in module file
Validator.hs .
module Normalizer where
import Control.Monad (liftM2)
import Control.Exception (assert)
import Data.List (minimumBy)
-- The RELAX NG validator automaton code from
http://www.thaiopensource.com/relaxng/derivative.html
import Validator (contains, datatypeAllows, datatypeEqual, strip, whitespace,
Uri, LocalName, ParamList, Prefix, Context, Datatype,
NameClass(..), QName(..), ChildNode(..), AttributeNode(..))
data Pattern = Empty
| NotAllowed
| Text
| Choice Pattern Pattern
| Interleave Int Pattern Pattern
| Group Pattern Pattern
| OneOrMore Pattern
| List Pattern
| Data Datatype ParamList
| DataExcept Datatype ParamList Pattern
| Value Datatype String Context
| Attribute NameClass Pattern
| Element NameClass Pattern
| Region Bool QName Pattern
| Result [ResultDelta] Pattern
deriving (Eq, Show)
data ResultDelta = StartTag ChildNode | Add ChildNode | EndTag | Inferred ResultDelta
deriving (Eq, Show)
nullable:: Pattern -> Bool
nullable (Group p1 p2) = nullable p1 && nullable p2
nullable (Interleave balance p1 p2) = balance == 0 &&
nullable p1 && nullable p2
nullable (Choice p1 p2) = nullable p1 || nullable p2
nullable (OneOrMore p) = nullable p
nullable (Element _ _) = False
nullable (Attribute _ _) = False
nullable (List _) = False
nullable (Value _ _ _) = False
nullable (Data _ _) = False
nullable (DataExcept _ _ _) = False
nullable NotAllowed = False
nullable Empty = True
nullable Text = True
nullable (Region False _ _) = False
nullable (Region True _ p) = nullable p
nullable (Result _ p) = nullable p
inferred :: ResultDelta -> Bool
inferred (Inferred _) = True
inferred _ = False
nullableResults:: Pattern -> [[ResultDelta]]
nullableResults (Group p1 p2) = case nullableResults p2 of [] -> []
l -> assert (all null l) (nullableResults p1)
nullableResults (Interleave 0 p1 p2) = liftM2 (++)
(nullableResults p1) (nullableResults p2)
nullableResults Interleave{} = []
nullableResults (Choice p1 p2) = nullableResults p1 ++ nullableResults p2
nullableResults (OneOrMore p) = nullableResults p
nullableResults (Element _ _) = []
nullableResults (Attribute _ _) = []
nullableResults (List _) = []
nullableResults (Value _ _ _) = []
nullableResults (Data _ _) = []
nullableResults (DataExcept _ _ _) = []
nullableResults NotAllowed = []
nullableResults (Region False _ _) = []
nullableResults (Region True q p) = map
(++ [Inferred EndTag]) (nullableResults p)
nullableResults Empty = [[]]
nullableResults Text = [[]]
nullableResults (Result r p) = map (r ++) (nullableResults p)
bestNullableResult :: Pattern -> [ResultDelta]
bestNullableResult p = if nullableResults p == []
then error ("bestNullableResult: " ++ show p)
else minimumBy innermostLength (nullableResults p)
where innermostLength r1 r2 = compare (length r1) (length r2)
childDeriv :: Context -> Pattern -> ChildNode -> Pattern
childDeriv cx p (TextNode s) = textDeriv cx False p s
childDeriv _ p (ElementNode qn cx atts children) =
let p1 = startTagOpenDeriv p (ElementNode qn cx atts [])
p2 = attsDeriv cx p1 atts
p3 = startTagCloseDeriv p2
p4 = childrenDeriv cx p3 children
in endTagDeriv p4
textDeriv :: Context -> Bool -> Pattern -> String -> Pattern
textDeriv cx inf (Choice p1 p2) s = choice (textDeriv cx inf p1 s)
(textDeriv cx inf p2 s)
textDeriv cx False (Interleave 0 p1 p2) s =
choice (interleave 0 (textDeriv cx False p1 s) p2)
(interleave 0 p1 (textDeriv cx False p2 s))
textDeriv cx False (Interleave balance p1 p2) s = if balance < 0
then interleave balance (textDeriv cx False p1 s) p2
else interleave balance p1 (textDeriv cx False p2 s)
textDeriv cx True (Interleave 0 p1 p2) s = choice
(interleave 0 (textDeriv cx True p1 s) p2)
(interleave 0 p1 (textDeriv cx True p2 s))
textDeriv cx True (Interleave balance p1 p2) s = if balance < 0
then interleave balance (textDeriv cx True p1 s) p2
else interleave balance p1 (textDeriv cx True p2 s)
textDeriv cx inf (Group p1 p2) s = let p = group (textDeriv cx inf p1 s) p2
in if nullable p1
then choice p (result (bestNullableResult p1) (textDeriv cx inf p2 s))
else p
textDeriv cx inf (OneOrMore p) s = group (textDeriv cx inf p s)
(choice (OneOrMore p) Empty)
textDeriv cx inf Text s = result [(if inf then Inferred else id)
(Add (TextNode s))] Text
textDeriv cx1 inf (Value dt value cx2) s = if datatypeEqual dt value cx2 s cx1
then result [(if inf then Inferred else id) (Add (TextNode s))] Empty
else NotAllowed
textDeriv cx inf (Data dt params) s = if datatypeAllows dt params s cx
then result [(if inf then Inferred else id) (Add (TextNode s))] Empty
else NotAllowed
textDeriv cx inf (DataExcept dt params p) s =
if datatypeAllows dt params s cx && not (nullable (textDeriv cx inf p s))
then result [(if inf then Inferred else id) (Add (TextNode s))] Empty
else NotAllowed
textDeriv cx inf (List p) s = if nullable (listDeriv cx p (words s))
then result [(if inf then Inferred else id) (Add (TextNode s))] Empty
else NotAllowed
textDeriv cx inf (Element (Name uri local) p) s =
let qn = QName uri local
in region True qn (textDeriv cx inf (startRecovery qn cx p) s)
textDeriv cx inf (Element (NameClassChoice n1 n2) p) s =
choice (textDeriv cx inf (Element n1 p) s) (textDeriv cx inf (Element n2 p) s)
textDeriv cx inf (Element _ _) s = NotAllowed
textDeriv cx inf (Region inferred qn p) s = region inferred qn (textDeriv cx inf p s)
textDeriv cx inf (Result r p) s = result r (textDeriv cx inf p s)
textDeriv _ _ _ _ = NotAllowed
listDeriv :: Context -> Pattern -> [String] -> Pattern
listDeriv _ p [] = p
listDeriv cx p (h:t) = listDeriv cx (textDeriv cx False p h) t
startRecovery :: QName -> Context -> Pattern -> Pattern
startRecovery qn cx p = result
[Inferred (StartTag (ElementNode qn cx [] []))] (startTagCloseDeriv p)
region :: Bool -> QName -> Pattern -> Pattern
region inferred qname NotAllowed = NotAllowed
region inferred qname p = Region inferred qname p
result :: [ResultDelta] -> Pattern -> Pattern
result _ NotAllowed = NotAllowed
result r (Choice p1 p2) = Choice (result r p1) (result r p2)
result r1 (Result r2 p) = Result (r1 ++ r2) p
result [] p = p
result r p = Result r p
addResult :: Pattern -> [ResultDelta] -> Pattern
addResult NotAllowed _ = NotAllowed
addResult (Choice p1 p2) r = Choice (addResult p1 r) (addResult p2 r)
addResult (Group p1 p2) r = group (addResult p1 r) p2
addResult p@(Interleave balance p1 p2) r = interleave balance (addResult p1 r) p2
addResult (Result r1 p) r2 = result r1 (addResult p r2)
addResult (Region inferred qn p) r = Region inferred qn (addResult p r)
addResult p r = Result r p
choice :: Pattern -> Pattern -> Pattern
choice p NotAllowed = p
choice NotAllowed p = p
choice Empty Text = Text
choice Text Empty = Text
choice p1 p2 | p1 == p2 = p1
choice p1 p2 = Choice p1 p2
group :: Pattern -> Pattern -> Pattern
group p NotAllowed = NotAllowed
group NotAllowed p = NotAllowed
group p Empty = p
group Empty p = p
group Text Text = Text
group _ (Result _ _) = error "Can't have Result on the RHS of Group."
group (Result r p1) p2 = result r (group p1 p2)
group p1 p2 = Group p1 p2
interleave :: Int -> Pattern -> Pattern -> Pattern
interleave _ p NotAllowed = NotAllowed
interleave _ NotAllowed p = NotAllowed
interleave _ p Empty = p
interleave _ Empty p = p
interleave _ Text Text = Text
interleave balance p1 p2 = Interleave balance p1 p2
startTagOpenDeriv :: Pattern -> ChildNode -> Pattern
startTagOpenDeriv (Choice p1 p2) node = choice (startTagOpenDeriv p1 node)
(startTagOpenDeriv p2 node)
startTagOpenDeriv (Element (NameClassChoice n1 n2) p) node =
choice (startTagOpenDeriv (Element n1 p) node) (startTagOpenDeriv (Element n2 p) node)
startTagOpenDeriv (Element nc@(Name uri local) p) node@(ElementNode qn cx _ _) =
if contains nc qn
then Region False qn (result [StartTag node] p)
else let qn = QName uri local
in region True qn (startTagOpenDeriv (startRecovery qn cx p) node)
startTagOpenDeriv (Element nc p) node@(ElementNode qn cx _ _) =
if contains nc qn then Region False qn (result [StartTag node] p) else NotAllowed
startTagOpenDeriv (Interleave 0 p1 p2) node =
choice
(interleave (-1) (startTagOpenDeriv p1 node) p2)
(interleave 1 p1 (startTagOpenDeriv p2 node))
startTagOpenDeriv (Interleave balance p1 p2) node =
if balance < 0
then interleave (balance-1) (startTagOpenDeriv p1 node) p2
else interleave (balance+1) p1 (startTagOpenDeriv p2 node)
startTagOpenDeriv (OneOrMore p) node = group (startTagOpenDeriv p node)
(choice (OneOrMore p) Empty)
startTagOpenDeriv (Group p1 p2) node =
let x = group (startTagOpenDeriv p1 node) p2
in if nullable p1
then choice x (result (bestNullableResult p1) (startTagOpenDeriv p2 node))
else x
startTagOpenDeriv (Region inferred qn p) node = region inferred qn
(startTagOpenDeriv p node)
startTagOpenDeriv (Result r p) node = result r (startTagOpenDeriv p node)
startTagOpenDeriv _ _ = NotAllowed
attsDeriv :: Context -> Pattern -> [AttributeNode] -> Pattern
attsDeriv cx p [] = p
attsDeriv cx p ((AttributeNode qn s):t) = attsDeriv cx
(attDeriv cx p (AttributeNode qn s)) t
attDeriv :: Context -> Pattern -> AttributeNode -> Pattern
attDeriv cx (Choice p1 p2) att = choice (attDeriv cx p1 att) (attDeriv cx p2 att)
attDeriv cx (Group p1 p2) att = choice (group (attDeriv cx p1 att) p2)
(group p1 (attDeriv cx p2 att))
attDeriv cx (Interleave balance p1 p2) att =
choice (interleave balance (attDeriv cx p1 att) p2) (interleave balance p1
(attDeriv cx p2 att))
attDeriv cx (OneOrMore p) att = group (attDeriv cx p att)
(choice (OneOrMore p) Empty)
attDeriv cx (Attribute nc p) (AttributeNode qn s) = if
contains nc qn && valueMatch cx p s then Empty else NotAllowed
attDeriv cx (Region inferred qn p) att = Region inferred qn (attDeriv cx p att)
attDeriv cx (Result r p) att = result r (attDeriv cx p att)
attDeriv _ _ _ = NotAllowed
valueMatch :: Context -> Pattern -> String -> Bool
valueMatch cx p s = (nullable p && whitespace s) ||
nullable (textDeriv cx False p s)
startTagCloseDeriv :: Pattern -> Pattern
startTagCloseDeriv (Choice p1 p2) = choice (startTagCloseDeriv p1)
(startTagCloseDeriv p2)
startTagCloseDeriv (Group p1 p2) = group (startTagCloseDeriv p1)
(startTagCloseDeriv p2)
startTagCloseDeriv (Interleave balance p1 p2) = interleave balance
(startTagCloseDeriv p1)
(startTagCloseDeriv p2)
startTagCloseDeriv (OneOrMore p) = oneOrMore (startTagCloseDeriv p)
startTagCloseDeriv (Attribute _ _) = NotAllowed
startTagCloseDeriv (Region inferred qn p) = Region inferred qn (
startTagCloseDeriv p)
startTagCloseDeriv (Result r p) = result r (startTagCloseDeriv p)
startTagCloseDeriv p = p
oneOrMore :: Pattern -> Pattern
oneOrMore NotAllowed = NotAllowed
oneOrMore Empty = Empty
oneOrMore Text = Text
oneOrMore p@OneOrMore{} = p
oneOrMore (Choice p Empty) = Choice (oneOrMore p) Empty
oneOrMore (Choice Empty p) = Choice Empty (oneOrMore p)
oneOrMore (Result r p) = result r (oneOrMore p)
oneOrMore p = OneOrMore p
childrenDeriv :: Context -> Pattern -> [ChildNode] -> Pattern
childrenDeriv _ NotAllowed _ = NotAllowed
childrenDeriv cx p [] = let p1 = textDeriv cx True p ""
in choice (addResult p [Inferred (Add (TextNode ""))]) p1
childrenDeriv cx p [(TextNode s)] = let p1 = childDeriv cx p (TextNode s)
in if whitespace s then choice (addResult p [Add (TextNode s)]) p1 else p1
childrenDeriv cx p children = stripChildrenDeriv cx p children
stripChildrenDeriv :: Context -> Pattern -> [ChildNode] -> Pattern
stripChildrenDeriv _ p [] = p
stripChildrenDeriv cx p (h:t) = stripChildrenDeriv cx
(if strip h then (addResult p [Add h]) else childDeriv cx p h) t
endTagDeriv :: Pattern -> Pattern
endTagDeriv (Choice p1 p2) = choice (endTagDeriv p1) (endTagDeriv p2)
endTagDeriv (Region False qn p) = if nullable p
then result (bestNullableResult p ++ [EndTag]) Empty
else region False qn (endTagDeriv p)
endTagDeriv (Region True qn p) = region True qn (endTagDeriv p)
endTagDeriv (Result r p) = result r (endTagDeriv p)
endTagDeriv (Group p1 p2) = group (endTagDeriv p1) p2
endTagDeriv (Interleave balance p1 p2)
| balance < 0 = interleave (balance+1) (endTagDeriv p1) p2
| balance == 0 = NotAllowed
| balance > 0 = interleave (balance-1) p1 (endTagDeriv p2)
endTagDeriv _ = NotAllowed
Bibliography
[a72]
AV Aho, JD Ullman, 1972.
The Theory of Parsing, Translation, and Compiling.
Prentice Hall
[a03]
Mohammad Abolhassani, Norbert Fuhr and Norbert GÖvert,
Information extraction and automatic markup for XML documents,
In Blanken et al, 2003, 159--174,
Springer
[b59]
Backus, J.W.,
The Syntax and Semantics of the Proposed International Algebraic Language of ZÜrich ACM-GAMM Conference,
Proceedings of the International Conference on Information Processing, UNESCO,
1959, pp.125-132.
[b64]
Brzozowski, J. A. 1964. Derivatives of Regular Expressions. J. ACM 11,
4 (Oct. 1964), 481-494.
DOI= http://doi.acm.org/10.1145/321239.321249
[c56]
Chomsky, Noam (1956). "Three models for the description of language".
IRE Transactions on Information Theory 2: 113-124.
doi:10.1109/TIT.1956.1056813
[c01]
James Clark and Makoto Murata. RELAX NG Specification.
http://relaxng.org/spec-20011203.html, 2001. ISO/IEC 19757-2:2003.
[c02]
James Clark. An algorithm for RELAX NG validation
http://www.thaiopensource.com/relaxng/derivative.html
[c10]
QuickCheck: Automatic testing of Haskell programs
http://hackage.haskell.org/package/QuickCheck-2.1.0.3
[f93]
Peter Fankhauser and Yi Xu,
MarkItUp! - An incremental approach to document structure recognition,
Electronic Publishing, 1993, pages 447-456
[k95]
Eila Kuikka and Martti Penttonen,
Transformation of Structured Documents,
Electronic Publishing Origination, Dissemination and Design, 8(4), 1995.
[k97]
Bertin Klein and Peter Fankhauser,
Error tolerant Document Structure Analysis,
International Journal on Digital Libraries,
1997, volume 1, pages 344-357
[l68]
Lewis, P. M. and Stearns, R. E. 1968. Syntax-Directed Transduction.
J. ACM 15, 3 (Jul. 1968), 465-488.
DOI= http://doi.acm.org/10.1145/321466.321477
[l97]
Greger LindÉn,
Structured Document Transformations,
1997
[m56]
Moore, E. F., [1956]. Gedanken experiments on sequential machines,
Automata Studies, Princeton Univ. Press,
Princeton, New Jersey, pp. 129-153.
[m96]
Makoto Murata,
Transformation of Documents and Schemas by Patterns and Contextual Conditions,
Proceedings of the Third International Workshop on Principles of Document Processing (PODP 96),
1997, pages 153-169, Springer-Verlag
[s05]
Sperberg-McQueen, C. M. Applications of Brzozowski derivatives to XML schema processing.
In Extreme Markup Languages 2005, page 26, Internet, 2005. IDEAlliance.
[t98]
Kazem Taghva, Allen Condit, and Julie Borsack,
Autotag: A tool for creating structured document collections from printed materials,
Electronic Publishing, Artistic Imaging, and Digital Typography, Proc. of the EP '98 and RIDT
'98 Conferences, 1998, pages 420-431,
Springer-Verlag
[t03]
Tang, X. 2003 A High-Level Specification Language for Structured Document Transformation. Doctoral Thesis.
UMI Order Number: AAINQ84932., University of Waterloo.
[v88]
Ikuo Kudo, Hideya Koshino, Moonkyung Chung, Tsuyosi Morimoto,
Schema method: a framework for correcting grammatically ill-formed input
Proceedings of the 12th conference on Computational linguistics - Volume 1
Computer and Automation Institute, Hungarian Academy of Sciences
Pages 341 - 347
Association for Computational Linguistics Morristown, NJ, USA 1988
ISBN: 9638431563 doi>10.3115/991635.991705
[h02]
Haskell 98 Language and Libraries, the Revised Report. December 2002.
http://haskell.org/onlinereport/
[s86]
Standard Generalized Markup Language (SGML)
International Organization for Standardization ISO 8879:1986
[s09a]
OmniMark language documentation
http://developers.stilo.com/docs/html/index.html
[s09b]
OmniMark RELAX NG (OMRELAXNG) library documentation
https://developers.omnimark.com/docs/html/library/125.html
[w04]
XML Schema Part 1: Structures Second Edition, Analysis of the Unique Particle Attribution Constraint
W3C Recommendation 28 October 2004
http://www.w3.org/TR/xmlschema-1/#non-ambig
|