Linking chains of streaming filters

It is easy to create a set of independent filters and to stream data through those filters sequentially, as it is to write a single filter with multiple rules. This allows you to choose the most natural algorithm to solve each content engineering challenge you encounter.

To enable streaming in this fashion, OmniMark provices sink and source types. Here is a function of type string source, which means that the function returns a source of string data. It also takes an argument of type string source, meaning that it expects to be passed a source of string data. The purpose of the function is to remove excess white space from string data:

  define string source function 
     compress-whitespace (value string source s)
  as
     repeat scan s
     match blank* "%n" blank*
       output "%n"
  
     match blank+
       output "%_"
  
     match [any \ white-space]+ => chars
       output chars
     again

This function can be called in any context that expects a data source, such as a submit action. It can accept any source as an argument, such as #main-input.

  process
     submit compress-whitespace (#main-input)

This program will stream input from #main-input, through the function compress-whitespace (), to submit, where it can be processed by find rules. The find rules will receive a stream of data from which all excess whitespace has been removed by the function compress-whitespace (). Data flows through the program in a completely streaming fashion, with no buffering of data. This means that you can now connect any number of streaming filters in a chain. Suppose that you want to process an unstructured document to create an XML representation and then create an HTML output. You could do this with a traditional OmniMark context-translate program; however, this would mean that you could only have one find rule pass and one markup rule pass at the data. But with string source functions, you can connect as many text filters or markup parsers together as you want. In this case, the most natural algorithm might be:

Filter the input text to remove excess white space. This makes it easier to write the next filter, by simplifying white-space handling (compress-whitespace ()):
```
  define string source function 
     compress-whitespace (value string source s)
  as
     repeat scan s
     ; ...
            
```
Filter the output of compress-whitespace () to wrap XML tags around the elements of the input data in the simplest possible fashion (text2xml ()):
```
  define string source function 
     text2xml (value string source s)
  as
     submit s
     ; ...
            
```

Parse the output of text2xml () to tidy up the XML, removing unneeded elements and adding structure and ID attributes (tidy-xml ()):

  define string source function 
     tidy-xml (value string source s)
  as
     do xml-parse scan s
     ...

Parse the output of tidy-xml () to create HTML (xml2html ()):

  define string source function 
     xml2html (value string source s)
  as
     do xml-parse scan s
     ...

You would then invoke those functions as a chain of streaming filters with a simple output action:

  process
     output xml2html (tidy-xml (text2xml (compress-whitespace (#main-input))))

The flow of data here is from right to left (as the program is written). Each function, starting with compress-whitespace () on the right, takes a string source as its input and returns a string source to the function on its left.

Another way to structure this program would be to write the xml2html () function as a string sink rather than as a string source. This means that the function becomes a destination to which data is sent, and processes that data before sending it on to another sink. Here is the xml2html () function written as a sink function:

  define string sink function 
     xml2html (value string sink s)
  as
     using output as destination
     do xml-parse scan #current-input
        output "%c"
     done

This function can be used anywhere a string sink (data destination for strings) is expected, such as a using output as statement, and can accept any string sink expression as an argument, such as #main-output:

  process
     using output as xml2html (#main-output)
        output tidy-xml (text2xml (compress-whitespace (#main-input)))

Here again, data is streamed through the chain of streaming filters implemented by the string source functions to the current output scope, which is the string sink function xml2html (), which in turn streams it to #main-output. Once again, the data is never buffered. The output data streams from left to right (as the program is written) from the xml2html () function to the main output.

Since the current output scope of an OmniMark program can include more than one sink, you can define multiple string sink functions and stream data to them simultaneously. In the following example, the original source is converted to XML, then that XML is streamed directly to a file, to an HTML output function, and to an XSL/FO output function, creating three different output formats simultaneously:

  process
     using output as xml2html (file #args[2])
                   & xml2fo (file #args[3])
                   & file #args[4]
        output tidy-xml (text2xml (compress-whitespace (file #args[1])))

Prerequisite Concepts

Related Topics