Building markup pipelines with OmniMark 9
By Roy Amodeo
Contents
1. Introduction
One of the primary motivations in development of OmniMark 9 was to be able to create reusable markup processing
components.
Since OmniMark was first released twenty years ago, typical applications have tended to become more
sophisticated, and thus modularization has become more important. You can create code that is less complex, easier
to maintain, and more suitable for reuse.
OmniMark is a streaming language. Modular OmniMark programs tend to be organized differently than programs in
other programming languages. A traditional, non-streaming program is usually organized into a hierarchy:
lower-level modules export functions used by higher-level modules, and so on to the main module sitting on top.
When a streaming program is modularized, the dominant architecture is often not so much a hierarchy as a pipeline
whose components pass data to each other.
In OmniMark 8, you could build component pipelines, but those components were connected together with string
sources or string sinks. Each XML-to-XML processing component had to parse the input, transform the document
structure with element rules, and then re-encode the output as XML so it could be passed as input to the next
component. A modular pipeline in OmniMark 8 would therefore look like this:
Repeated parsing and encoding adds significant overhead. OmniMark 8 tempts you to combine several processing
components into one, improving efficiency at the cost of modularity:
The main feature of OmniMark 9 is markup event streaming. Components can now be connected with markup sources
or markup sinks:
There is no more need for encoding and re-parsing between pipeline components. Pipelines are more efficient and
components become simpler and easier to develop.
2. Prerequisites for streaming markup processing
2.1. Gluing the components: Markup sources and sinks
OmniMark 9 introduces two new stream types, markup source and markup sink.
Markup sources are like string sources, except that they can also contain markup events. String source is a
subtype of markup source: any function that reads a markup source can also read a string source. With source
functions, pipelines read right to left, as in the following example:
define markup source function parse from value string source s elsewhere
define markup source function transform1 from value markup source m elsewhere
define markup source function transform2 from value markup source m elsewhere
define string source function encode from value markup source m elsewhere
process
output encode from transform2 from transform1 from parse from #main-input
Markup sinks are like string sinks except that you can
also write markup events to them. Markup sink is a
subtype of string sink: any function that writes to a string
sink can also write to a markup sink. With sink functions, pipelines
read left to right:
define string sink function parse into value markup sink m elsewhere
define markup sink function transform1 into value markup sink m elsewhere
define markup sink function transform2 into value markup sink m elsewhere
define markup sink function encode into value string sink m elsewhere
process
using output as (parse into transform1 into transform2 into encode into #main-output)
output #main-input
Sources can be joined, with input taken from each source sequentially.
output encode from
transform2 from
transform1 from
parse from
(file "my.dtd" || #main-input)
Sinks can be forked, and output written to each sink in parallel.
using output as parse into
transform1 into
transform2 into
encode into
(#main-output & relaxng.validator against my-schema)
2.2. Encoding XML: the OMXMLWRITE library
The OMXMLWRITE library provides two encoders that convert a well-formed markup event stream into XML. These
encoders are typically placed at the end of the pipeline.
written is a string source function that creates XML out of events read from a markup source :
export string source function
written from value markup source m
elsewhere
writer is a markup sink function that writes the XML to a string sink :
export markup sink function
writer into value string sink destination
elsewhere
2.3. #content — creating a markup source
Before you can begin to process a markup stream, it has to be created at the beginning of the pipeline. The
easiest way to create a markup stream is to parse a markup document. This simple example function shows how:
define markup source function
parse from value string source s
as
do xml-parse scan s
output #content
done
The built-in #content variable is a markup source . More specifically, it is the source of text and markup
events resulting from do xml-parse . You can use #content in any rule in which "%c" can be used, though
unlike "%c" , #content does not fire element rules.
2.4. Processing markup events: do markup-parse
Processing of markup streams is best done using markup rules, just like in previous versions of OmniMark. To
subject a markup source to rules, use do markup-parse :
define markup source function
transform1 value markup source ms
as
do markup-parse ms
output "%c"
done
The construct do markup-parse is like do xml-parse , except that it works on a markup source instead of a
string source . No scan keyword is required, as the input has already been scanned into events. While "%c"
invokes markup rules on the event stream, #content would not.
3. Putting the pipeline together
The minimal complete pipeline is the identity pipeline. It just parses the input and then re-encodes it and
writes it out again.
import "omxmlwrite.xmd" prefixed by xml.
define markup source function
parse from value string source s
as
do xml-parse scan s
output #content
done
process
output xml.written from parse from #main-input
Or more simply:
import "omxmlwrite.xmd" prefixed by xml.
process
do xml-parse scan #main-input
output xml.written from #content
done
To have our pipeline do any useful work, we must replace #content by "%c" , which will then invoke our
element rules. These are not your grandfather's element rules, though. Our goal is to make a component that
can be plunked in the middle of any markup-processing pipeline, and that means we must emit a markup source.
3.1. Example: removing paragraphs from lists
In a recent project, the input format had list items that contained paragraph tags. The output format did not
allow that, so the paragraph tags inside the list items had to be removed.
<doc>
<paragraph>Here's a list:</paragraph>
<list>
<item>
<paragraph>First item</paragraph>
</item>
<item>
<paragraph>Second item</paragraph>
</item>
<item>
<paragraph>Third item</paragraph>
<paragraph>has <b>two</b> paragraphs</paragraph>
</item>
</list>
</doc> Here's how we removed the element tags from paragraph elements in list items:
element "paragraph" when parent is "item"
output #content
We copied all other elements as they were found.
element #implied
signal throw #markup-start #current-markup-event
output "%c"
signal throw #markup-end #current-markup-event
An element is a markup-region-event . You have to signal both the start and the end of the event. The action
signal throw sends an event to the #current-output . In this case, the #current-output feeds into the
function xml.written which will convert the signal into XML. This rule is the equivalent of the ubiquitous
pre-OmniMark 9 rule:
element #implied
output "<%q>%c</%q>"
Note: if #current-output is not a markup sink , the markup event will be thrown. This is also what
happens if you use #content as a string source when there are markup events. (We will use this property
later.)
Other events can be copied in a similar fashion:
markup-comment
signal throw #markup-start #current-markup-event
output #content
signal throw #markup-end #current-markup-event
processing-instruction any*
signal throw #markup-point #current-markup-event
Comments are region events, like elements. Processing instructions are point events — they consist of
one signal with no content.
So this is what the whole pipeline component looks like:
define markup source function
strip-paragraph-tags from value markup source ms
as
do markup-parse ms
output "%c"
done
element "paragraph" when parent is "item"
output #content
element #implied
signal throw #markup-start #current-markup-event
output "%c"
signal throw #markup-end #current-markup-event
markup-comment
signal throw #markup-start #current-markup-event
output #content
signal throw #markup-end #current-markup-event
processing-instruction any*
signal throw #markup-point #current-markup-event
Finally, we can insert this transformation component into the identity pipeline:
process
output xml.written from
strip-paragraph-tags from
parse from #main-input
4. Creating new markup events
There are two types of markup events, as we could see above:
markup-region-event
Elements
Comments
Marked sections
markup-point-event
Processing instructions
Markup errors
There are also three built-in catch targets that can be used to signal and catch markup events:
catch #markup-start value markup-region-type e
catch #markup-end value markup-region-type e
catch #markup-point value markup-point-type e
Trying to use #content as a string source causes a throw if a markup event is encountered, as markup events
are not allowed in string sources. We can rely on this behavior to create a new element event.
define markup-region-event function
create-element-event (value string element-name)
as
do xml-parse scan "<" || element-name || "/>"
output #content drop any*
done
catch #markup-start e
return e
The drop operator treats #content as a string source . It can't handle the #markup-start event from the
element tag. Therefore, the event is thrown, caught by the catch clause, (terminating the parse), and returned.
Here are a few more simple but useful helper functions for generating or copying markup events.
make-element allows you to create a new element region.
define markup source function
make-element (value string element-name,
value markup source ms)
as
output make-markup-region (create-element-event (element-name), ms)
make-markup-region allows you to copy existing events to the next component in the chain.You would use this to
copy an element event, a comment event, or any other kind of region event.
define markup source function
make-markup-region (value markup-region-event event,
value markup source ms)
as
signal throw #markup-start event
output ms
signal throw #markup-end event
You can use make-markup-point to pass a point event to the next component in the chain.
define markup source function
make-markup-point (value markup-point-event event)
as
signal throw #markup-point event
4.1. Example: renaming elements
For this example, we will use the output of the previous component as our input. We have removed the paragraph
tags from inside the list items, but left them elsewhere. Now, we want to convert the element names to HTML
style. The example input for our new pipeline component looks like this:
<doc>
<paragraph>Here's a list:</paragraph>
<list>
<item>
First item
</item>
<item>
Second item
</item>
<item>
Third item
has <b>two</b> paragraphs
</item>
</list>
</doc> First, we define a function to create a markup source, using do markup-parse and "%c" so that element
rules will be invoked. Next, we create element rules to substitute the new element names. When we encounter the
doc element, we will output the html element, and then the body element. "%c" lets us invoke element
rules on the content of the doc element. Similarly we output ul when we encounter list, and use "%c" to
invoke element rules on the content of the list. For the item and paragraph rules, we use #content , as
there is no need, in our example, for invoking element rules within them. Last we have the functions to pass
anything else through to the next component. For elements, we use "%c" , but for comments, which will not
contain any other markup events, we can use #content .
define markup source function
rename-elements from value markup source ms
as
do markup-parse ms
output "%c"
done
element "doc"
output make-element ("html", make-element ("body", "%c"))
element "list"
output make-element ("ul", "%c")
element "item"
output make-element ("li", #content)
element "paragraph"
output make-element ("p", #content)
element #implied
output make-markup-region (#current-markup-event, "%c")
markup-comment
output make-markup-region (#current-markup-event, #content)
processing-instruction any*
output make-markup-point (#current-markup-event)
Our pipeline that performs both paragraph stripping and element renaming now looks like this:
process
output xml.written from
rename-elements from
strip-paragraph-tags from
parse from #main-input
Note that the four pipeline components we have put together do not depend on each other. We can, for example,
remove strip-paragraph-tags from the pipeline; the pipeline output will still be valid HTML, only with more
p tags. Furthermore, each component can be easily reused in other similar pipelines.
5. New syntax summary
This is the syntax you need to create new transformation components
Markup streams
Types: markup source, markup sink
Built-in source: #content
Actions: do markup-parse
Libraries: OMXMLWRITE, OMSGMLWRITE (beta)
Markup events
Types: markup-point-event, markup-region-event
Built-in event: #current-markup-event
Actions: signal throw
Catch targets: #markup-point, #markup-start, #markup-end
But wait, there's more!
6. OMMARKUPUTILITIES library
The OMMARKUPUTILITIES library introduces some very useful but non-trivial functionality
Markup buffers store markup events and string content. We need markup buffers because referents cannot be
written to a parser. Referents are not resolved until the scope (or the program) ends. The parser would not be
able to resume until all referents were resolved.
Moreover, referents can only contain text (and other referents). Strings (and buffers) can only store string
content, not markup events. So efficiency would be lost by having to re-encode markup if we relied on buffers.
Markup buffers are easy to use:
To create an empty markup-buffer , declare a variable or push a new item on a markup-buffer shelf.
import ommarkuputilities.xmd prefixed by markuputilities.
global markuputilities.markup-buffer titles variable
Write to a markup-buffer by using it as a markup sink :
element title
using output as new titles
output make-markup-region (#current-markup-event,
#content)
Read from a markup-buffer as a markup source :
process
…
repeat over titles as ms
output xml.written from ms
again
All the usual operators work on markup buffers: & (fork), || (join), = , != .
6.1. Table normalization example
In this example, we have a table with different numbers of cells on each row. The
output requires each row to have same number of cells.
The solution can be achieved in two passes.
domain-bound global integer max-cells initial {0}
element "table"
local markuputilities.markup-buffer table-content
save max-cells
using output as table-content
output #content
count-cells from table-content
output make-markup-region (#current-markup-event,
add-cells from table-content)
Note that max-cells is declared domain-bound . In this simple example that is not necessary, but it is
generally a good practice to declare all globals modified from a coroutine as domain-bound , and to save them
at the top level of the coroutine. This way each coroutine instance has its own copy of the variable, so they
won't step on each others' toes.
In the first pass, function count-cells runs through all table rows, keeping track of the maximum cell count
in the max-cells global variable.
define function
count-cells from value markup source ms
as
using group "count cells"
do markup-parse ms
suppress
done
group "count cells"
element "row"
put #suppress #content
do when children > max-cells
set max-cells to children
done
Note the use of suppress to fire element rules in do markup-parse , and of put #suppress #content to
not fire the element rules for table cells in element "row" .
In the second pass, function add-cells uses the calculated max-cells to add the missing cells.
define markup source function
add-cells from value markup source ms
as
using group "add cells"
do markup-parse ms
output "%c"
done
group "add cells"
element "row"
signal throw #markup-start #current-markup-event
output #content
repeat to max-cells - children
output make-element ("cell", "") || "%n"
again
signal throw #markup-end #current-markup-event
The glue code for the table normalization pipeline would be
process
output xml.written from
normalize-tables from
parse from #main-input
6.2. Split-external-text-entities
Remember put #suppress #content in count-cells? Markup rules do not fire when #content is sent to the
#suppress stream. Therefore external text entity references will not be resolved.
Unfortunately, in SGML, text entities may contain unbalanced start and end tags. The only way to know the
context after the external text entity reference is to resolve the reference and parse the content of the
entity. The parser cannot proceed without resolving the entity references. In this pipeline, an external text
entity reference in a table cell will throw a #program-error .
There is a ready solution to this problem: import the ommarkuputilities libary and use its function
split-external-text-entities . This function lets you separate the external text entity events and send them to
a destination that knows how to handle them. All other content can be forwarded to the normal processing chain.
First create a markup sink to resolve external text entities:
define markup sink function
resolver
as
do markup-parse #current-input
suppress
done
Now we can define a function that will filter out all external text entity references:
define markup source function
filter-entities from value markup source
ms
as
using output as markuputilities.split-external-text-entities (resolver, #current-output)
output ms
And here is the complete pipeline, including entity resolution:
process
output xml.written from
normalize-tables from
filter-entities from
parse from #main-input
7. Conclusion
OmniMark 9 removes the efficiency barrier blocking modularization of markup processing pipelines. No longer must
we re-encode and re-parse between components. And, pipeline components are independent of the input and output
markup language.
These are some example components from recent projects we've worked on:
Apply a unique identifier to each element in content
Merge consecutive elements (of a certain kind)
Break elements at specific sub-elements (for example, break paragraph elements containing <br/> into two paragraphs)
Remove empty paragraphs and spans
Insert content before or after elements with specified identifiers
Strip all tags from content
Resolve external text entities
Return the first element from the content
Retain only the specified elements
Strip tables to cell contents
Convert markup errors into processing instructions
Streaming merge of multi-part documents
Produce statistics on the number of characters, words, and different elements in the markup stream
Prettify the markup by inserting newlines and indentation where it is not significant
8. If you would like more information on the new features of OmniMark 9, here are some places to look:
|