About OmniMark

OmniMark is designed to make it easy for you to write programs using the streaming programming model.

The streaming programming model is an approach to programming that concentrates on describing the process to be applied to a piece of data, and on processing data directly as it streams from one location to another. In the streaming model, the use of data structures to model input data is eliminated, and the use of data structures to model output is greatly reduced. For instance, here is an OmniMark program that takes a document and converts all references to monetary amounts from the English style ($29.95) to the French style (29,95$):

     submit "The doggy in the window costs $24.95."
  find "$" digit+ => dollars "." digit{2} => cents
     output dollars || "," || cents || "$"

This program outputs:

  The doggy in the window costs 24,95$.

Here is how this program works:

  • The word process starts a process rule. An OmniMark program is a collection of rules. A process rule fires when the program is run. It is the equivalent of the main function in other languages.
  • The word submit creates an OmniMark source. In this case, the content of the source is the literal string The doggy in the window costs $24.95..
  • submit also initiates scanning of the source it creates. Scanning is a process in which data is moved from a source to a destination, potentially applying a transformation to it as it moves.
  • The word find defines a find rule. A find rule is a filter for data that is being scanned. The find rule specifies a pattern to be matched in the data and the actions to be applied when the data is matched. When a source is scanned by find rules, data that is not matched streams through to the current output scope. Data that is matched by a pattern is consumed and does not stream through to output. Output generated by the rule is merged with the data streaming to the current output scope.

The pattern used in this find rule is designed to match English style dollar values. Leaving out the pattern variable assignments, which we'll discuss in a moment, it looks like this:

  "$" digit+ "." digit{2}

The pattern reads as follows:

  • match a literal dollars sign ("$"),
  • then match one or more digits (the keyword digit with a plus sign after it, meaning "one or more"),
  • then match a literal period (".") followed by exactly 2 digits (digit{2}).

In order for the program to create the proper output, it needs to capture the digits that represent the dollars and the cents portions of the matched data. This is done by assigning the matched data to pattern variables, using the pattern variable assignment operator =>. This is the pattern with the pattern variables in place:

  find "$" digit+ => dollars "." digit{2} => cents

When the scanning process encounters a piece of data that matches this pattern it will fire the find rule and the data matched by digit+ will be assigned to dollars and the data matched by digit{2} will be assigned to cents.

Next, the actions associated with the find rule will be executed. The output actions output the dollars and cents values with the , and $ characters in the appropriate place. This output goes to the current output scope. Since the unmatched data is also going to this scope, the output of the rule is merged into the source data as it flows to its destination.

There is a lot of detail in this explanation. To get a better idea of how this program works, paste the program into the OmniMark Studio for Eclipse, create an appropriate input file, and trace through the program.

Taking control of input and output

To process data other than literal strings, you need to be able to create a scanning source from external data sources. You also want to be able to send output somewhere other than the screen. In this revised version of the program the input comes from a file named on the command line and the output goes to another file named on the command line.

     using output as file #args[1]
        submit file #args[2]
  find "$" digit+ => dollars "." digit{2} => cents
     output dollars || "," || cents || "$"

Note that only the process rule has changed. The find rule that does the actual work of processing the data remains the same no matter where the data comes from or where it goes. Here's how this new process rule works:

  • The first line uses the qualifier using output as to make file #args[1] the current output scope. This makes it the target of all output actions that are executed in that output scope.
  • submit is now prefixed by the using output as qualifier, which means that all output generated as a result of submit will go to that output scope.

The streaming model at work

Beyond the details of the program, notice the streaming model at work:

Firstly, notice that the input data is not buffered. No data structure is created to represent it. The process of replacing the English form with the French form is carried out as the data flows from source to destination. The output is not buffered either. This program will run with equal success on a 2 kilobyte file or a 2 gigabyte file.

Secondly, notice how the program describes the process it performs. A reasonable description of the function of this program would be: "It finds the English format for expressing currency and replaces it with the French format. The input comes from one file and goes to another." And when we look at the code, we see that the process rule describes the path the data takes from input file to output file, and the find rule says find the English currency format and output the French currency format.

Thirdly, notice the abstraction involved in dealing with sources and destinations of information. The find rule does not specify what data it is acting on: it is the current input data, whatever source that may flow from. The output action does not say where the output goes to; it goes to the current output scope, whatever that may be attached to. This means that the same scanning techniques can be applied to any piece of data from program variables, to files, to network data streams, in exactly the same manner. Scanning is a fully general data-processing technique, independent of the source or destination of the data to be scanned.

Fourthly, notice how much work is done for you by the scanning mechanism. There is no data movement code in this program. There is no need to maintain pointers or offsets into the data. There is no memory management to worry about. There is no need to explicitly buffer input and output. There is not even any need to worry about the opening and closing of files. All these things are done for you, in a highly robust and optimized way.

Streaming parsing

The same streaming techniques apply to XML parsing. Here is a simple XML document:

    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>

Here is a program that processes this XML document to produce HTML output:

     do xml-parse scan file "input.xml"
        output "<HTML>%c</HTML>"
  element "person"
     output "<BODY>%c</BODY>"
  element "name"
     output "<H1>%c</H1>"
  element "bio"
     output "%c"
  element "p"
     output "<p>%c</p>"

You should step through this program in the OmniMark Studio for Eclipse to observe how it works. The output of the program is:

    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>

The process rule plays the same role in this program as in the previous one. It establishes an input source and an output destination and it starts the scanning process. The difference here is that it is the parser that scans the data, not find rules. When the parser finds element markup in the source it is scanning, it fires an element rule. Just as with find rules, the unmatched data—the "data content" in XML terms—streams through to the current output. Thus each element rule can output into the current output stream just the way a find rule does.

Since the program is creating HTML, its element rules output HTML markup:

  element "person"
     output "<BODY>%c</BODY>"

This rule outputs the start and end tags for the HTML BODY element. In the final output, however, there will be a good deal of markup and data between <BODY> and </BODY>. Because XML data is hierarchical in nature, element rules fire hierarchically as well. The person element rule is suspended at the point %c occurs in the output action. All the contents of the person element are then parsed, with the appropriate rules being fired. This results in the other markup and data being sent to the output. Once this is done, the person element rule resumes and </BODY> is output.

Going further

To learn more about the basic principles of OmniMark programming see:

To learn about specific OmniMark syntax, just follow the links in the code samples, or consult the index to this documentation.

Visit the Stilo website for information on OmniMark training courses near you.

The OmniMark Users Group Mail list (OMUG-L) provides an opportunity for the OmniMark community to discuss issues related to OmniMark programming and related technologies. To subscribe, visit the OmniMark developers website.