Stilo picture - clouds picture - clouds
dark blue bar dark blue bar dark blue bar
Stilo
TrainingSupportContact Us Stilo Home
OmniMark Developer Resources  
Stilo
Stilo
Public
Stilo
Stilo
Stilo
Login
Request Password
defects reporting

Building markup pipelines with OmniMark 9

Contents

1. Introduction

One of the primary motivations in development of OmniMark 9 was to be able to create reusable markup processing components.

Since OmniMark was first released twenty years ago, typical applications have tended to become more sophisticated, and thus modularization has become more important. You can create code that is less complex, easier to maintain, and more suitable for reuse.

OmniMark is a streaming language. Modular OmniMark programs tend to be organized differently than programs in other programming languages. A traditional, non-streaming program is usually organized into a hierarchy: lower-level modules export functions used by higher-level modules, and so on to the main module sitting on top. When a streaming program is modularized, the dominant architecture is often not so much a hierarchy as a pipeline whose components pass data to each other.

In OmniMark 8, you could build component pipelines, but those components were connected together with string sources or string sinks. Each XML-to-XML processing component had to parse the input, transform the document structure with element rules, and then re-encode the output as XML so it could be passed as input to the next component. A modular pipeline in OmniMark 8 would therefore look like this:

Re-encoding pipeline in OmniMark 8

Repeated parsing and encoding adds significant overhead. OmniMark 8 tempts you to combine several processing components into one, improving efficiency at the cost of modularity:

Single-shot pipeline in OmniMark 8

The main feature of OmniMark 9 is markup event streaming. Components can now be connected with markup sources or markup sinks:

Pipeline in OmniMark 9

There is no more need for encoding and re-parsing between pipeline components. Pipelines are more efficient and components become simpler and easier to develop.

2. Prerequisites for streaming markup processing

2.1. Gluing the components: Markup sources and sinks

OmniMark 9 introduces two new stream types, markup source and markup sink.

Markup sources are like string sources, except that they can also contain markup events. String source is a subtype of markup source: any function that reads a markup source can also read a string source. With source functions, pipelines read right to left, as in the following example:

Source pipeline

define markup source function parse      from value string source s elsewhere
define markup source function transform1 from value markup source m elsewhere
define markup source function transform2 from value markup source m elsewhere
define string source function encode     from value markup source m elsewhere

process
   output encode from transform2 from transform1 from parse from #main-input


Markup sinks are like string sinks except that you can also write markup events to them. Markup sink is a subtype of string sink: any function that writes to a string sink can also write to a markup sink. With sink functions, pipelines read left to right:

Sink pipeline

define string sink   function parse      into value markup sink m elsewhere
define markup sink   function transform1 into value markup sink m elsewhere
define markup sink   function transform2 into value markup sink m elsewhere
define markup sink   function encode     into value string sink m elsewhere

process
   using output as (parse into transform1 into transform2 into encode into #main-output)
      output #main-input


Sources can be joined, with input taken from each source sequentially.

output encode from
       transform2 from
       transform1 from
       parse from
       (file "my.dtd" || #main-input)


Sinks can be forked, and output written to each sink in parallel.

using output as parse into
                transform1 into
                transform2 into
                encode into
                (#main-output & relaxng.validator against my-schema)


2.2. Encoding XML: the OMXMLWRITE library

The OMXMLWRITE library provides two encoders that convert a well-formed markup event stream into XML. These encoders are typically placed at the end of the pipeline.

  • written is a string source function that creates XML out of events read from a markup source:

    export string source function
       written from value markup source m
    elsewhere
    
    
    
  • writer is a markup sink function that writes the XML to a string sink:

    export markup sink function
       writer into value string sink destination
    elsewhere
    
    
    

2.3. #content — creating a markup source

Before you can begin to process a markup stream, it has to be created at the beginning of the pipeline. The easiest way to create a markup stream is to parse a markup document. This simple example function shows how:

define markup source function
   parse from value string source s
as
   do xml-parse scan s
      output #content
   done


The built-in #content variable is a markup source. More specifically, it is the source of text and markup events resulting from do xml-parse. You can use #content in any rule in which "%c" can be used, though unlike "%c", #content does not fire element rules.

2.4. Processing markup events: do markup-parse

Processing of markup streams is best done using markup rules, just like in previous versions of OmniMark. To subject a markup source to rules, use do markup-parse:

define markup source function
   transform1 value markup source ms
as
   do markup-parse ms
      output "%c"
   done


The construct do markup-parse is like do xml-parse, except that it works on a markup source instead of a string source. No scan keyword is required, as the input has already been scanned into events. While "%c" invokes markup rules on the event stream, #content would not.

3. Putting the pipeline together

The minimal complete pipeline is the identity pipeline. It just parses the input and then re-encodes it and writes it out again.

import "omxmlwrite.xmd" prefixed by xml.

define markup source function
   parse from value string source s
as
   do xml-parse scan s
      output #content
   done

process
   output xml.written from parse from #main-input


Or more simply:

import "omxmlwrite.xmd" prefixed by xml.

process
   do xml-parse scan #main-input
      output xml.written from #content
   done


To have our pipeline do any useful work, we must replace #content by "%c", which will then invoke our element rules. These are not your grandfather's element rules, though. Our goal is to make a component that can be plunked in the middle of any markup-processing pipeline, and that means we must emit a markup source.

3.1. Example: removing paragraphs from lists

In a recent project, the input format had list items that contained paragraph tags. The output format did not allow that, so the paragraph tags inside the list items had to be removed.

        <doc>
           <paragraph>Here's a list:</paragraph>
           <list>
              <item>
                <paragraph>First item</paragraph>
              </item>
              <item>
                <paragraph>Second item</paragraph>
              </item>
              <item>
                <paragraph>Third item</paragraph>
                <paragraph>has <b>two</b> paragraphs</paragraph>
              </item>
           </list>
        </doc>

Here's how we removed the element tags from paragraph elements in list items:

element "paragraph" when parent is "item"
   output #content


We copied all other elements as they were found.

element #implied
   signal throw #markup-start #current-markup-event
   output "%c"
   signal throw #markup-end #current-markup-event


An element is a markup-region-event. You have to signal both the start and the end of the event. The action signal throw sends an event to the #current-output. In this case, the #current-output feeds into the function xml.written which will convert the signal into XML. This rule is the equivalent of the ubiquitous pre-OmniMark 9 rule:

element #implied
   output "<%q>%c</%q>"


Note: if #current-output is not a markup sink, the markup event will be thrown. This is also what happens if you use #content as a string source when there are markup events. (We will use this property later.)

Other events can be copied in a similar fashion:

markup-comment
   signal throw #markup-start #current-markup-event
   output #content
   signal throw #markup-end #current-markup-event

processing-instruction any*
   signal throw #markup-point #current-markup-event


Comments are region events, like elements. Processing instructions are point events — they consist of one signal with no content.

So this is what the whole pipeline component looks like:

define markup source function
   strip-paragraph-tags from value markup source ms
as
   do markup-parse ms
      output "%c"
   done

element "paragraph" when parent is "item"
   output #content

element #implied
   signal throw #markup-start #current-markup-event
   output "%c"
   signal throw #markup-end #current-markup-event

markup-comment
   signal throw #markup-start #current-markup-event
   output #content
   signal throw #markup-end #current-markup-event

processing-instruction any*
   signal throw #markup-point #current-markup-event


Finally, we can insert this transformation component into the identity pipeline:

process
   output xml.written from
          strip-paragraph-tags from
          parse from #main-input


4. Creating new markup events

There are two types of markup events, as we could see above:

  • markup-region-event

    • Elements

    • Comments

    • Marked sections

  • markup-point-event

    • Processing instructions

    • Markup errors

There are also three built-in catch targets that can be used to signal and catch markup events:

  • catch #markup-start value markup-region-type e

  • catch #markup-end value markup-region-type e

  • catch #markup-point value markup-point-type e

Trying to use #content as a string source causes a throw if a markup event is encountered, as markup events are not allowed in string sources. We can rely on this behavior to create a new element event.

define markup-region-event function
   create-element-event  (value string element-name)
as
   do xml-parse scan "<" || element-name || "/>"
      output #content drop any*    ; Any string operation will do
   done

 catch #markup-start e
   return e


The drop operator treats #content as a string source. It can't handle the #markup-start event from the element tag. Therefore, the event is thrown, caught by the catch clause, (terminating the parse), and returned.

Here are a few more simple but useful helper functions for generating or copying markup events.

  • make-element allows you to create a new element region.

    define markup source function
       make-element   (value   string               element-name,
                       value   markup source        ms)
    as
       output make-markup-region (create-element-event (element-name), ms)
    
    
    
  • make-markup-region allows you to copy existing events to the next component in the chain.You would use this to copy an element event, a comment event, or any other kind of region event.

    define markup source function
       make-markup-region   (value   markup-region-event  event,
                             value   markup source        ms)
    as
       signal throw #markup-start event
       output ms
       signal throw #markup-end event
    
    
    
  • You can use make-markup-point to pass a point event to the next component in the chain.

    define markup source function
       make-markup-point  (value   markup-point-event   event)
    as
       signal throw #markup-point event
    
    
    

4.1. Example: renaming elements

For this example, we will use the output of the previous component as our input. We have removed the paragraph tags from inside the list items, but left them elsewhere. Now, we want to convert the element names to HTML style. The example input for our new pipeline component looks like this:

        <doc>
           <paragraph>Here's a list:</paragraph>
           <list>
              <item>
                First item
              </item>
              <item>
                Second item
              </item>
              <item>
                Third item
                has <b>two</b> paragraphs
              </item>
           </list>
        </doc>

First, we define a function to create a markup source, using do markup-parse and "%c" so that element rules will be invoked. Next, we create element rules to substitute the new element names. When we encounter the doc element, we will output the html element, and then the body element. "%c" lets us invoke element rules on the content of the doc element. Similarly we output ul when we encounter list, and use "%c" to invoke element rules on the content of the list. For the item and paragraph rules, we use #content, as there is no need, in our example, for invoking element rules within them. Last we have the functions to pass anything else through to the next component. For elements, we use "%c", but for comments, which will not contain any other markup events, we can use #content.

define markup source function
   rename-elements from value markup source ms
as
   do markup-parse ms
      output "%c"
   done

element "doc"
   output make-element ("html", make-element ("body", "%c"))

element "list"
   output make-element ("ul", "%c")

element "item"
   output make-element ("li", #content)

element "paragraph"
   output make-element ("p", #content)

element #implied
   output make-markup-region (#current-markup-event, "%c")

markup-comment
   output make-markup-region (#current-markup-event, #content)

processing-instruction any*
   output make-markup-point (#current-markup-event)


Our pipeline that performs both paragraph stripping and element renaming now looks like this:

process
   output xml.written from
          rename-elements from
          strip-paragraph-tags from
          parse from #main-input


Note that the four pipeline components we have put together do not depend on each other. We can, for example, remove strip-paragraph-tags from the pipeline; the pipeline output will still be valid HTML, only with more p tags. Furthermore, each component can be easily reused in other similar pipelines.

5. New syntax summary

This is the syntax you need to create new transformation components

  • Markup streams

    • Types: markup source, markup sink

    • Built-in source: #content

    • Actions: do markup-parse

    • Libraries: OMXMLWRITE, OMSGMLWRITE (beta)

  • Markup events

    • Types: markup-point-event, markup-region-event

    • Built-in event: #current-markup-event

    • Actions: signal throw

    • Catch targets: #markup-point, #markup-start, #markup-end

But wait, there's more!

6. OMMARKUPUTILITIES library

The OMMARKUPUTILITIES library introduces some very useful but non-trivial functionality

  • Type: markup-buffer

  • Function: split-external-text-entities

Markup buffers store markup events and string content. We need markup buffers because referents cannot be written to a parser. Referents are not resolved until the scope (or the program) ends. The parser would not be able to resume until all referents were resolved.

Moreover, referents can only contain text (and other referents). Strings (and buffers) can only store string content, not markup events. So efficiency would be lost by having to re-encode markup if we relied on buffers.

Markup buffers are easy to use:

  1. To create an empty markup-buffer, declare a variable or push a new item on a markup-buffer shelf.

    import ommarkuputilities.xmd prefixed by markuputilities.
    global markuputilities.markup-buffer titles variable
    
    
    
  2. Write to a markup-buffer by using it as a markup sink:

    element title
       using output as new titles
          output make-markup-region (#current-markup-event,
                                     #content)
    
    
    
  3. Read from a markup-buffer as a markup source:

    processrepeat over titles as ms
           output xml.written from ms
       again
    
    
    

All the usual operators work on markup buffers: & (fork), || (join), =, !=.

6.1. Table normalization example

In this example, we have a table with different numbers of cells on each row. The output requires each row to have same number of cells.

The solution can be achieved in two passes.

domain-bound global integer max-cells initial {0}

element "table"
   local  markuputilities.markup-buffer table-content
   save   max-cells

   ; Store the table contents into the markup buffer
   using output as table-content
      output #content

   ; Pass 1: Find the maximum number of cells in a row
   count-cells from table-content

   ; Pass 2: Output table, adding missing cells
   output make-markup-region (#current-markup-event,
                              add-cells from table-content)


Note that max-cells is declared domain-bound. In this simple example that is not necessary, but it is generally a good practice to declare all globals modified from a coroutine as domain-bound, and to save them at the top level of the coroutine. This way each coroutine instance has its own copy of the variable, so they won't step on each others' toes.

In the first pass, function count-cells runs through all table rows, keeping track of the maximum cell count in the max-cells global variable.

define function
   count-cells from value markup source ms
as
   using group "count cells"
      do markup-parse ms
         suppress
      done

group "count cells"
element "row"
   put #suppress #content
   do when children > max-cells
      set max-cells to children
   done


Note the use of suppress to fire element rules in do markup-parse, and of put #suppress #content to not fire the element rules for table cells in element "row".

In the second pass, function add-cells uses the calculated max-cells to add the missing cells.

define markup source function
   add-cells from value markup source ms
as
    using group "add cells"
       do markup-parse ms
          output "%c"
       done

group "add cells"

element "row"
   signal throw #markup-start #current-markup-event
   output #content
   repeat to max-cells - children
      output make-element ("cell", "") || "%n"
   again
   signal throw #markup-end #current-markup-event


The glue code for the table normalization pipeline would be

process
   output xml.written from
          normalize-tables from
          parse from #main-input


6.2. Split-external-text-entities

Remember put #suppress #content in count-cells? Markup rules do not fire when #content is sent to the #suppress stream. Therefore external text entity references will not be resolved.

Unfortunately, in SGML, text entities may contain unbalanced start and end tags. The only way to know the context after the external text entity reference is to resolve the reference and parse the content of the entity. The parser cannot proceed without resolving the entity references. In this pipeline, an external text entity reference in a table cell will throw a #program-error.

There is a ready solution to this problem: import the ommarkuputilities libary and use its function split-external-text-entities. This function lets you separate the external text entity events and send them to a destination that knows how to handle them. All other content can be forwarded to the normal processing chain.

First create a markup sink to resolve external text entities:

define markup sink function
   resolver
as
   do markup-parse #current-input
      suppress
   done


Now we can define a function that will filter out all external text entity references:

define markup source function
filter-entities from value markup source
    ms
as
using output as markuputilities.split-external-text-entities (resolver, #current-output)
      output ms


And here is the complete pipeline, including entity resolution:

process
   output xml.written from
          normalize-tables from
          filter-entities from
          parse from #main-input


7. Conclusion

OmniMark 9 removes the efficiency barrier blocking modularization of markup processing pipelines. No longer must we re-encode and re-parse between components. And, pipeline components are independent of the input and output markup language.

These are some example components from recent projects we've worked on:

  • Apply a unique identifier to each element in content

  • Merge consecutive elements (of a certain kind)

  • Break elements at specific sub-elements (for example, break paragraph elements containing <br/> into two paragraphs)

  • Remove empty paragraphs and spans

  • Insert content before or after elements with specified identifiers

  • Strip all tags from content

  • Resolve external text entities

  • Return the first element from the content

  • Retain only the specified elements

  • Strip tables to cell contents

  • Convert markup errors into processing instructions

  • Streaming merge of multi-part documents

  • Produce statistics on the number of characters, words, and different elements in the markup stream

  • Prettify the markup by inserting newlines and indentation where it is not significant

8. If you would like more information on the new features of OmniMark 9, here are some places to look:

blue bar