OmniMark is designed to make it easy for you to write programs using the streaming programming model.
The streaming programming model is an approach to programming that concentrates on describing the process to be
applied to a piece of data, and on processing data directly as it streams from one location to another. In the
streaming model, the use of data structures to model input data is eliminated, and the use of data structures to
model output is greatly reduced. For instance, here is an OmniMark program that takes a document and converts all
references to monetary amounts from the English style ($29.95
) to the French style (29,95$
):
process submit "The doggy in the window costs $24.95." find "$" digit+ => dollars "." digit{2} => cents output dollars || "," || cents || "$"
This program outputs:
The doggy in the window costs 24,95$.
Here is how this program works:
process
starts a process
rule. An
OmniMark program is a collection of rules. A process
rule fires when the program is run. It is the
equivalent of the main
function in other languages.
submit
creates an OmniMark source. In this case, the content of the source is the
literal string The doggy in the window costs $24.95.
.
submit
also initiates scanning of the source it creates. Scanning is a
process in which data is moved from a source to a destination, potentially applying a transformation to it as
it moves.
find
defines a find
rule. A find
rule is a filter for data that is
being scanned. The find
rule specifies a pattern
to be matched in the data and the actions to be applied when the data is matched. When a source is scanned by
find
rules, data that is not matched streams through to the current output scope. Data that is
matched by a pattern is consumed and does not stream through to output. Output generated by the rule is merged
with the data streaming to the current output scope.
The pattern used in this find
rule is designed to match English style dollar values. Leaving out the
pattern variable assignments, which we'll discuss in a moment, it looks like this:
"$" digit+ "." digit{2}
The pattern reads as follows:
"$"
),
digit
with a plus sign after it, meaning "one or
more"),
"."
) followed by exactly 2 digits (digit{2}
).
In order for the program to create the proper output, it needs to capture the digits that represent the dollars
and the cents portions of the matched data. This is done by assigning the matched data to pattern variables, using the pattern variable assignment operator
=>
. This is the pattern with the pattern variables in place:
find "$" digit+ => dollars "." digit{2} => cents
When the scanning process encounters a piece of data that matches this pattern it will fire the find
rule and the data matched by digit+
will be assigned to dollars and the data matched by
digit{2}
will be assigned to cents.
Next, the actions associated with the find
rule will be executed. The output
actions output the
dollars and cents values with the ,
and $
characters in the
appropriate place. This output goes to the current output scope. Since the unmatched data is also going to this
scope, the output of the rule is merged into the source data as it flows to its destination.
There is a lot of detail in this explanation. To get a better idea of how this program works, paste the program into the OmniMark Studio for Eclipse, create an appropriate input file, and trace through the program.
To process data other than literal strings, you need to be able to create a scanning source from external data
sources. You also want to be able to send output somewhere other than the screen. In this revised version of the
program the input comes from a file named on the command line and the output goes to another file named on the
command line.
process using output as file #args[1] submit file #args[2] find "$" digit+ => dollars "." digit{2} => cents output dollars || "," || cents || "$"
Note that only the process
rule has changed. The find
rule that does the actual work
of processing the data remains the same no matter where the data comes from or where it goes. Here's how this
new process
rule works:
using output as
to make file #args[1]
the
current output scope. This makes it the target of all
output
actions that are executed in that output scope.
submit
is now prefixed by the using output as
qualifier, which means that all
output generated as a result of submit
will go to that output scope.
Beyond the details of the program, notice the streaming model at work:
Firstly, notice that the input data is not buffered. No data structure is created to represent it. The process of replacing the English form with the French form is carried out as the data flows from source to destination. The output is not buffered either. This program will run with equal success on a 2 kilobyte file or a 2 gigabyte file.
Secondly, notice how the program describes the process it performs. A reasonable description of the function
of this program would be: "It finds the English format for expressing currency and replaces it with the French
format. The input comes from one file and goes to another." And when we look at the code, we see that the
process
rule describes the path the data takes from input file to output file, and the find
rule
says find the English currency format and output the French currency format.
Thirdly, notice the abstraction involved in dealing with sources and destinations of information. The find
rule does not specify what data it is acting on: it is the current input data, whatever source that may
flow from. The output
action does not say where the output goes to; it goes to the current output scope,
whatever that may be attached to. This means that the same scanning techniques can be applied to any piece of
data from program variables, to files, to network data streams, in exactly the same manner. Scanning is a fully
general data-processing technique, independent of the source or destination of the data to be scanned.
Fourthly, notice how much work is done for you by the scanning mechanism. There is no data movement code in this program. There is no need to maintain pointers or offsets into the data. There is no memory management to worry about. There is no need to explicitly buffer input and output. There is not even any need to worry about the opening and closing of files. All these things are done for you, in a highly robust and optimized way.
The same streaming techniques apply to XML parsing. Here is a simple XML document:
<person> <name>Mary</name> <bio> <p>Mary had a little lamb</p> <p>Its fleece was white as snow</p> </bio> </person>
Here is a program that processes this XML document to produce HTML output:
process do xml-parse scan file "input.xml" output "<HTML>%c</HTML>" done element "person" output "<BODY>%c</BODY>" element "name" output "<H1>%c</H1>" element "bio" output "%c" element "p" output "<p>%c</p>"
You should step through this program in the OmniMark Studio for Eclipse to observe how it works. The output of the program is:
<HTML><BODY> <H1>Mary</H1> <p>Mary had a little lamb</p> <p>Its fleece was white as snow</p> </BODY></HTML>
The process
rule plays the same role in this program as in the previous one. It establishes an input
source and an output destination and it starts the scanning process. The difference here is that it is the
parser that scans the data, not find
rules. When the parser finds element markup in the source it is
scanning, it fires an element
rule. Just as with find
rules, the unmatched data—the "data
content" in XML terms—streams through to the current output. Thus each element
rule can output into
the current output stream just the way a find
rule does.
Since the program is creating HTML, its element
rules output HTML markup:
element "person" output "<BODY>%c</BODY>"
This rule outputs the start and end tags for the HTML BODY
element. In the final output, however,
there will be a good deal of markup and data between <BODY>
and </BODY>
.
Because XML data is hierarchical in nature, element
rules fire hierarchically as well. The person
element
rule is suspended at the point %c
occurs in the output
action. All
the contents of the person
element are then parsed, with the appropriate rules being fired. This
results in the other markup and data being sent to the output. Once this is done, the person
element
rule resumes and </BODY>
is output.
To learn more about the basic principles of OmniMark programming see:
To learn about specific OmniMark syntax, just follow the links in the code samples, or consult the index to this documentation.
Visit the Stilo website for information on OmniMark training courses near you.
The OmniMark Users Group Mail list (OMUG-L) provides an opportunity for the OmniMark community to discuss issues related to OmniMark programming and related technologies. To subscribe, visit the OmniMark developers website.