Types of OmniMark Programs

HOME \| COMPANY \| SOFTWARE \| DOCUMENTATION \| EDUCATION & TRAINING \| SALES & SERVICE
"The Official Guide to Programming with OmniMark"	Site Map \| Search: OmniMark Magazine Developer's Forum
International Edition

OmniMark^® Programmer's Guide Version 3

2. Types of OmniMark Programs

Detailed Table of Contents

Previous chapter is Chapter 1, "Introduction".

Next chapter is Chapter 3, "Generalized Document Processing".

OmniMark provides a variety of ways to configure programs, and a lot of help in doing so. There are two basic types of programs:

Batch processing programs typically perform a single task or a set of tasks which are determined at the time that a program is run. Batch processing tasks can involve:
- processing information from a number of sources, such as files, database records and communication ports,
- combining the information from these sources,
- producing one or more results, and
- delivering the results to an equal variety of destinations.
Server-based processing programs typically wait for requests and perform tasks on demand. Most often they communicate over networks with other programs and with users of the program, and with file systems and databases which provide information and receive updates.

OmniMark makes it easy to write both types of programs. Although all types of OmniMark programs can make good use of most facilities in the language, there are some features which are primarily designed to support one type of programming or the other.

Batch processing programs can take advantage of OmniMark's built-in translation types, which aid the programmer by pre-configuring OmniMark to different types of rule-based conversions.

Server-based processing programs directly issue OmniMark's easy-to-use "DO SGML-PARSE" actions for SGML processing, and SUBMIT actions for general text processing.

2.1 OmniMark Subsystems

Essentially, OmniMark programs run with one or two program-controlled subsystems or threads, called domains. Programs can have one domain, or they can have two domains, joined together by OmniMark's built-in SGML parser.

General text processing is usually done with SUBMIT actions and FIND rules. SUBMIT actions feed input data to OmniMark, and FIND rules use pattern matching to analyse the input data. These features are enhanced by OmniMark's SGML support -- general text processing can be used to help create SGML documents, to further process the results of SGML parsing, or do both. (FIND rules are described in Section 3.1, "General Document Processing Rules". SUBMIT is described in Section 3.2.1, "Submitting Input to FIND Rules".)

Additionally, OmniMark's unique "source" functions and "output" functions mean that input data and output data can be transmitted from and to any location: the local file system, databases, across networks, or across the Internet. Source functions are described in Section 12.3.3, "Externally-Defined Sources". Output functions are described in Section 12.3.4, "External Output Functions".

Multi-domain programs can initiate SGML parsing themselves (as is typical in server-based processing) or use one of OmniMark's built-in translation types. They can even combine a built-in translation type with program-controlled initiation of further SGML parsing.

The OmniMark run-time environment is a system composed of one or more of the following three cooperating subsystems:

the input processor, (or find domain)
the SGML parser
the output processor, (or element domain)

Figure 1 -- OmniMark Subsystems

The input processor provides the input to the SGML parser. In the process it may need to convert non-SGML data into SGML. The output processor converts the result of SGML parsing into some other form (which may even be SGML conforming to a different DTD).

The input processor has traditionally been called the find domain, because FIND rules figure largely in the kind of processing done there. Similarly, the output processor has traditionally been called the element domain, because that's where ELEMENT rules and other SGML processing is done. The terms input processor and output processor better characterize the role of the domains.

For translation programs, the translation type determines which subsystems are involved. For process programs, subsystems are started and suspended dynamically explicitly by the actions in the program.

2.2 Batch Translation Program Types

Writing batch processing programs is made easier using a translation type, which automatically sets up the interaction of the OmniMark subsystems and sets up the input and output processing. By choosing different translation types, the programmer can control whether the input, the output, or an intermediate stage is parsed by the SGML parser, and what is produced as the "main output" of the program.

Batch processing programs are also referred to as translation programs because they are classified according to their translation type.

The translation type is specified by a single keyword at the start of the program. There are four translation types:

CROSS-TRANSLATE: processes input using FIND rules.
This is the only one of the four translation types that does not make use of the SGML parser. Cross translations consequently use only one domain.
DOWN-TRANSLATE: processes an SGML input document with all the processing done post-SGML parsing.
A DOWN-TRANSLATE is typically used to convert SGML documents into non-SGML forms, or into other SGML documents.
UP-TRANSLATE: converts an input document into an SGML document.
An UP-TRANSLATE differs from the other translation types in that the input to the SGML parser is also the main output of the program. The SGML parser is used for validation, and, often more importantly, to provide contextual information that is used to drive the conversion into SGML.
CONTEXT-TRANSLATE: converts an input document into SGML, parses the result using the SGML parser, and then does further processing based on the parsed SGML.
This further processing can be used to convert the document into some other form, to "clean up" the markup and data in the converted document, or to convert it into an SGML document conforming to another DTD (or Document Type Definition).

Where a programmer does not want an OmniMark program to be configured as one of the above types of translations -- for example in a server-based application -- the programmer can specify directly what is to be subjected to FIND rule processing and what are to be the inputs and outputs of SGML parsing.

To avoid automatic configuration of the program, the programmer merely omits the translation type at the start.

The following subsections describe each of OmniMark's four "built-in" translation types in more detail.

2.2.1 Cross-Translation: General Document Translation

A cross-translation is a translation that converts a document from one arbitrary form to another. A cross-translation program does not make any use of the SGML parser. A cross-translation must begin with:

Syntax

   CROSS-TRANSLATE

Figure 2 -- Cross Translation Block Diagram

Figure 2 -- Cross Translation Block Diagram shows a simple block diagram of a cross-translation.

In a cross-translation, the OmniMark programmer must define their own conversion events. To this end, OmniMark provides a rich language for specifying patterns to match text in the input. When text in the input matches a pattern, the associated rule is executed. OmniMark also provides a very expressive mechanism for saving the text matched by pieces of the pattern, so that the matched text can play a role in the actions that will be executed.

The operation of a cross-translation is:

OmniMark examines each FIND rule in turn looking for a rule which can match text at the current position. It selects the first such rule with no condition or whose condition is true.
If a rule is selected, the actions in that rule are performed in order, and those actions are responsible for outputting the "conversion" of the text matched by the pattern. If the programmer wishes to use any of the matched text in an action, they can capture that text in pattern variables.
If no rules can be selected, OmniMark allows text at the current point to "fall through" to the output -- unrecognized data is output without being converted.
The text that "fell through" or was matched is considered to be "consumed", and the cycle begins again with the text that follows.

The following is an example of a simple OmniMark CROSS-TRANSLATE that removes all spaces at the starts and ends of lines, collapses runs of spaces between words into a single space character, and upper-cases every word starting with the letter "j". In the process tabs are converted into spaces:

   CROSS-TRANSLATE

   FIND LINE-START BLANK+ | BLANK+ LINE-END
      ; just ignore spaces and tabs at the start of end of lines

   FIND BLANK {2}+ ; two or more "blank" (space or tab) characters
      OUTPUT " "

   FIND WORD-START (UL "J" LETTER*) => word
      OUTPUT "%ux(word)" ; The "u" modifier upper-cases the word.

OmniMark's pattern recognition capability is not limited by line boundaries, nor does it arbitrarily break up the text into fields. Because of this, cross-translation is a technique that is applicable to a wide range of data analysis and conversion tasks.

2.2.2 Down-Translation: Translating SGML Documents

A down-translation is a translation whose input is a complete SGML document or an SGML document instance corresponding to a specified Document Type Definition.

The output of a down-translation can be SGML or some other format. It could be a document suitable for input into a text formatter, for example. Or a down-translation can be used to enter information from the SGML document into a database. A down-translation can even be used to transform an SGML document into another SGML document, for instance by "cleaning up" the input, or by restructuring it.

A down-translation is defined by entering:

Syntax

   DOWN-TRANSLATE

at the start of an OmniMark program.

Figure 3 -- Down-Translation Block Diagram

A down-translation is composed of rules that recognize SGML events, like elements. The basic operation of a down-translation program is:

The SGML parser builds up information as it processes the input document.
When the SGML parser recognizes a new component in the document, (an element or a processing instruction, for example), it informs OmniMark.
OmniMark then examines each rule in the program, in order, looking for rules that apply to that event. It selects the first one with no condition or with a condition that is true.
OmniMark executes all of the actions in the selected rule in order.
Control is returned to the SGML parser so that more input can be processed, and the cycle begins again.

If the component found is one that has content, such as an element or an SGML comment, then the actions in the rule control when the content is processed. The content is processed when a "%c" format item or a SUPPRESS action is encountered.

Should the SGML parser detect any errors in the markup of the SGML document it will report the errors. The OmniMark program can customize the manner in which these errors are reported. (See Chapter 15, "Processing SGML Errors").

The SGML parser will always recover from a markup error and return meaningful information to OmniMark, allowing processing to continue. Because of this, as many errors as possible will be detected in one run of the program.

The following simple example displays the titles of all the chapter elements in a document, prefixed by chapter numbers. All other elements are suppressed:

   DOWN-TRANSLATE

   GLOBAL COUNTER chapter-count INITIAL {0}

   ELEMENT title WHEN PARENT IS chapter
      INCREMENT chapter-count
      PUT #MAIN-OUTPUT "Chapter %d(chapter-count): %c%n"

   ELEMENT #IMPLIED
      SUPPRESS ; Suppress the output of all other elements.

When the content of a component is processed by a rule, that rule is temporarily suspended. Events within the component's content (such as subelements) can cause other rules to execute. When the content has been completely processed, the suspended rule is resumed, and the remainder of the actions are executed. This behaviour gives the execution of an OmniMark program the same hierarchical structure that an SGML document has.

Care must be taken with programs written for earlier releases of OmniMark, which didn't support programs without a translation type. For the earlier releases, a program without a translation type was assumed to be a DOWN-TRANSLATE. If the program cannot be modified to add the translation type, then the current releases of OmniMark can still be made to process such programs correctly using the -herald command-line option. See Section 19.1.4.8, "Version 2 Compatibility".

2.2.3 Up-Translation: Translating Documents to SGML

An up-translation is a translation whose output is generally a complete SGML document. OmniMark parses the SGML document as it is generated, and any errors are reported. The same SGML document that is parsed is the "main output" of the program.

OmniMark also provides the ability to send information to the main output or the SGML parser individually. This allows programmers to send the SGML prolog to the SGML parser without sending it to the main output, for example. That way the output consists solely of the document instance. This is very useful for environments where the document instances are stored separately from the DTDs.

OmniMark places no restrictions on the format of the input to an up-translation; most often the input is a data file compatible with a non-SGML text processing system.

An up-translation must begin with:

Syntax

   UP-TRANSLATE

Figure 4 -- Up Translation Block Diagram

Figure 4 -- Up Translation Block Diagram shows a simple block diagram for an up-translation.

When writing an up-translation, the OmniMark programmer uses FIND rules to describe the patterns of interest in a document and the actions to take to transform the document into an SGML document.

The operation of an up-translation is:

OmniMark examines each rule in turn looking for a rule which can match text at the current position. It selects the first rule with no condition or whose condition is true.
If a rule is selected, the actions in that rule are performed in order. If the programmer wishes to use any of the matched text in an action, they can capture that text in pattern variables.
If no rules can be selected, OmniMark allows text at the current point to "fall through" to the currently selected output targets, which are typically both the main output and the SGML parser.
The text that "fell through" or was matched is consumed, and the cycle begins again.

As markup is found and submitted to the parser, OmniMark will collect context information; that is, it will collect information about the document hierarchy being formed. This context information can be used in FIND rules to qualify subsequent FIND rules.

In an up-translation, the SGML document created is strictly a result of the patterns which can be found, in context, in the input document. The final SGML document provided at the output is identical to the document provided to the parser.

If there are errors in the generated markup, the parser will report the markup errors and perform as much error correction as possible. The error reports can be customized and even acted upon by the OmniMark programmer, to help the program recover from such markup errors.

The following example of an UP-TRANSLATE program converts RTF (Rich Text Format) into SGML. It:

recognizes style number 23 as the start of a paragraph,
recognizes a variety of alternative encodings of characters (protected by "\" and identified symbolically or as hexadecimal),
and ignores all other RTF commands.

In practise preamble material (style sheets) will need to be skipped over, and other styles and paragraph commands (such as "\par") will need to be recognized.

   UP-TRANSLATE

   FIND "\s23" LOOKAHEAD ! DIGIT
      OUTPUT "<P>"

   FIND "\" LETTER [LETTER | DIGIT | "-"]* | ; RTF command
        "\" ANY |                            ; other RTF code
        ["{}"]                               ; RTF grouping
      ; Output nothing for these.

   FIND "\" ["{}\"] => protected-character
      OUTPUT protected-character ; Some characters are protected by \

   FIND "\'" ANY {2} => hex-code ; Some characters are in hexadecimal
      LOCAL COUNTER character-value
      SET character-value TO hex-code BASE 16
      OUTPUT "&#%d(character-value);"

   FIND "<"
      OUTPUT "<<!>" ; "<" often needs protecting in the SGML

   FIND "&"
      OUTPUT "&<!>" ; Likewise "&"

Up-translations work well for relatively simple documents. For complex documents context-translations are almost always preferable.

2.2.4 Context-Translation: Using SGML as an Intermediate Form

A context-translation is the most general of the built-in translation types. A context-translation is a translation that converts data from one form to another, using SGML as an intermediate form. A context-translation can be viewed as an up-translation to produce an intermediate SGML document combined with a simultaneous down-translation of that SGML document.

Patterns in the original document suggest its structure and allow (a possibly partial) conversion to SGML. OmniMark parses the SGML form and, using the SGML parser, corrects structure errors. The final output makes use of the structure discovered by the parser to produce a fully marked-up document, a minimized document, or some other form of data.

A context-translation begins with

Syntax

   CONTEXT-TRANSLATE

Figure 5 -- Context Translation Block Diagram

Figure 5 -- Context Translation Block Diagram shows a simple block diagram of a context-translation. A context-translation combines the best features of an up-translation and a down-translation with the powerful error recovery and context tracking capability of the parser.

Although the following example is simple, it nonetheless illustrates the typical roles of the input and output processors in a context-translation.

Consider an input document like the following:

   Context-Translation


   This is a simple context-translation. It
   takes an ASCII text file and produces SGML.

   The find rules just insert the markup. The
   element rules add white-space to make the
   document look more readable.


   The Input Document


   The input document consists of paragraphs
   and chapter titles.

   Chapter titles are preceded and followed
   by two blank lines to make them stand out.

   Paragraphs are separated from each other
   by a single blank line.

If the file "my.dtd" contains the element declarations:

   <!ELEMENT doc      - o (chapter+)>
   <!ELEMENT chapter  - o (title, p+)>
   <!ELEMENT title    - o (#PCDATA)>
   <!ELEMENT p        - o (#PCDATA)>

The following program will convert the input to an SGML document conforming to those element declarations.

   CONTEXT-TRANSLATE

   FIND-START
      OUTPUT "<!DOCTYPE doc SYSTEM 'my.dtd'>%n"_
             "<DOC><CHAPTER><TITLE>"

   FIND "%n"{2}+ ANY-TEXT+ => title-text "%n"{2}+
      OUTPUT "<CHAPTER><TITLE>%x(title-text)</TITLE><P>"

   FIND "%n"{2}+
      OUTPUT "<P>"

   ELEMENT doc
      OUTPUT "%c"

   ELEMENT chapter
      OUTPUT "<CHAPTER>%n%c"

   ELEMENT #IMPLIED
      OUTPUT "<%q>%sc</%q>%n"

The FIND-START rule ensures that the first line of the document is interpreted as a chapter title. Following that, each single line of text surrounded by blank lines is interpreted as a further chapter title. ("%n"{2}+ matches a sequence of two or more line end characters.) All other blocks of text are interpreted as paragraphs.

On the output processor side, the resulting SGML is "cleaned up":

doc elements are stripped of their tags,
chapter start tags are placed on lines by themselves (end tags omitted), and
all other elements are formatted with:
- both tags,
- a line break following the end tag, and
- automatic white space cleanup (the "s" format modifier on the "%c" format item). (Format modifiers for element content are described in Section 4.1.2, "Processing Content".)

A detailed explanation of how OmniMark manages a context-translation can be found in Chapter 18, "How Asynchronous Concurrent Context Translations Work".

2.3 Process Programs: Server-Based Translation Programs

A programmer can take control of whether an OmniMark program does SGML parsing, where its main input comes from, and where its main output goes to, by omitting the translation type. A program without a translation type is called a "process program", because its "main" processing is specified in PROCESS rules. A process program can explicitly invoke SGML parsing using the "DO SGML-PARSE" action (see Section 17.2, "The "DO SGML-PARSE" Action").

Unlike translation programs, process programs do not perform any automatic processing of files named on the command-line. If files are named on the command-line, the names must be accessed using the #COMMAND-LINE-NAMES shelf (see Section 2.6, "Accessing The Command-Line Arguments").

The main processing in a process program is performed by PROCESS rules:

Syntax

   PROCESS condition?
      local-declaration*
      action*

A process program can contain more than one PROCESS rule. These rules are examined in the order that they occur in the program. When the rule is examined, if it has no condition, or if it has a condition which is true, the rule is executed. OmniMark then examines the next PROCESS rule.

The following is a simple program fragment that individually SGML-parses the files named on the command line, and puts the result of processing each of them into a file with ".out" appended to its name:

   PROCESS
      REPEAT OVER #COMMAND-LINE-NAMES
         DO SGML-PARSE DOCUMENT WITH SCAN FILE #COMMAND-LINE-NAMES
            SET FILE (#COMMAND-LINE-NAMES || ".out") TO "%c"
         DONE
      AGAIN

Of course, the program also requires ELEMENT rules to process the contents of each document. This program is the equivalent of performing a down-translation on each file individually.

It is not an error for a process program to contain no PROCESS rules, or for all of the PROCESS rules to be unselectable because they either have conditions that cannot be satisfied or because they are in an inactive group. In this case only the PROCESS-START and PROCESS-END rules, if any, are performed. Such programs are frowned upon. It is intended that the PROCESS rules in a program contain its "main" processing. (PROCESS-START and PROCESS-END rules are described in Section 2.4, "Program Initialization and Termination".)

2.4 Program Initialization and Termination

One of the main advantages of translation programs is that they will automatically take their input from the files named on the command-line. These files are processed as if they are all components of one big document that has been broken down into files. This can be especially convenient for processing SGML documents where the SGML declaration, DTD, and document instance are in separate files.

Because the processing of these files begins automatically, OmniMark provides initialization and termination rules to allow the programmer to gain control before the first file is processed, and after the last one is processed.

The initialization and termination rules that may be used depend on the domain in which they are used:

The input processor generally uses FIND-START rules for initialization and FIND-END rules for termination. They can be used in cross-translations, up-translations, and context-translations. See Section 3.1, "General Document Processing Rules".
In up-translations and context-translations, an "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule can be used to perform initialization and termination instead of FIND-START and FIND-END.
The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule replaces the automatic processing of command-line arguments, so it is used when the programmer wishes to manage the command-line explicitly. Otherwise, if the files named on the command-line are to be processed automatically, FIND-START and FIND-END rules should be used.
The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is described in Section 2.5.3, "Controlling Input to the SGML Parser".
The output processor uses DOCUMENT-START rules for initialization and DOCUMENT-END rules for termination. They can be used in down-translations, up-translations, and context-translations. See Section 4.5, "Initializing and Terminating SGML Processing".
In addition to these, PROCESS-START and PROCESS-END can be used as universal initialization and termination rules. Unlike the other initialization and termination rules, PROCESS-START and PROCESS-END can be used in any kind of program, including process programs.

2.4.1 Universal Program Initialization and Termination

Initialization and termination rules that can be used in any kind of OmniMark program are needed to allow programs to provide generic services that are applicable to any type of program.

PROCESS-START and PROCESS-END rules fulfil this need.

Syntax

   PROCESS-START condition?
      local-declaration*
      action*

Syntax

   PROCESS-END condition?
      local-declaration*
      action*

PROCESS-START and PROCESS-END are provided to allow initiation and termination processing to be placed adjacent to the GLOBAL declarations, function definitions and processing rules with which it is associated. This promotes a declarative style of programming.

Because they are independent of program type, PROCESS-START and PROCESS-END rules can be used in translation programs as well as process programs. This makes them suitable for use in INCLUDE files that, for example, contain:

GLOBAL shelf declarations,
functions which use those shelves,
PROCESS-START rules to initialize those shelves, and
possibly, PROCESS-END rules to perform final processing on those shelves.

Such an INCLUDE file can be used both in a DOWN-TRANSLATE (which doesn't allow FIND-START and FIND-END rules), in a CROSS-TRANSLATE (which doesn't allow DOCUMENT-START and DOCUMENT-END rules), and in a process program (which does not allow FIND-START, FIND-END, DOCUMENT-START, or DOCUMENT-END rules).

2.4.2 Properties of `PROCESS-START` and `PROCESS-END` Rules

PROCESS-START and PROCESS-END rules have the following properties:

Streams opened in PROCESS-START or PROCESS-END rules without specifying the DOMAIN-FREE open modifier can be written to by:
- in cross-translations, and down-translations:
  all rules.
- in up-translations, context-translations, and process programs:
  all output processor rules.
Similarly, PROCESS-START rules and PROCESS-END rules can write to streams opened with the DOMAIN-FREE open modifier, or to any stream opened in:
- in cross-translations, and down-translations:
  all rules.
- in up-translations, context-translations, and process programs:
  all output processor rules.
Changing groups in a PROCESS-START or PROCESS-END rule:
- in a cross-translation or down-translation:
  can affect any following rule.
- in an up-translation, context-translation, or process program:
  can only affect output processor rules.
Of course, changing groups never affects a subsequent rule if it is done within an action preceded by a "USING GROUP AS" prefix. See Chapter 5, "Organizing Rules With Groups" for more information on groups.
Changing groups can affect a PROCESS-START or PROCESS-END rule:
- in a cross-translation or down-translation:
  if it is done in any rule.
- in an up-translation, context-translation, or process program:
  only if it is done in an output processor rule.

2.4.3 The Order of Initialization and Termination Rules

Many programs can contain more than one type of initialization or termination rule. Context-translations, for instance, can contain FIND-START, DOCUMENT-START, and PROCESS-START rules, all in the same program.

In general, the order in which rules are performed (if the rule is available in that program type) is:

PROCESS-START rules
DOCUMENT-START rules
FIND-START rules
The main processing rules:
- PROCESS rules in a process program,
- FIND rules in a cross-translation,
- FIND rules or "EXTERNAL-TEXT-ENTITY #DOCUMENT" rules in an up-translation or a context-translation,
- ELEMENT rules in a context-translation (see below), and
- ELEMENT rules in a down-translation.
FIND-END rules
DOCUMENT-END rules
PROCESS-END rules

The list above is a general guideline, but there are special cases which fall outside this ordering:

In a context translation, ELEMENT rules are fired when they are needed.
In some situations, the FIND-START rule may supply the SGML declaration (if any), the DTD, and the initial part of a document instance to the #SGML stream. In this case, ELEMENT rules will begin to execute before all of the FIND-START rules have completed.
Similarly, a program may use a FIND-END rule to provide the trailing part of a document instance. In this case, ELEMENT rules will continue to execute even though the FIND-END rules have begun.
FIND rules can also be executed under programmer control at any time by invoking a SUBMIT from a FIND-START, FIND-END, or "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule.
Similarly, ELEMENT rules can be executed under programmer control at any time by performing a "DO SGML-PARSE" from any DOCUMENT-START, PROCESS-START, DOCUMENT-END, or PROCESS-END rule.

In a process program, there is no real distinction between the PROCESS-START, PROCESS and PROCESS-END rules, except in that they are performed in that order. OmniMark doesn't distinguish what can be done in these rules. However, PROCESS-START and PROCESS-END rules should only be used for whole-program initiation and termination functions, and PROCESS rules used for "main" processing, and for initiation and termination for individual transactions in a multi-transaction process.

2.5 Program Input and Output

OmniMark helps the programmer by dealing with a lot of the details of input and output, including, if the programmer wants, most of the reading and writing of files. However, OmniMark also allows the programmer to take control by giving them direct access to the command-line arguments, and by making the input sources and output streams explicitly available.

The OmniMark program reads data from the outside world through input sources, and writes data to the outside world through streams attached to files and externally-defined output streams.

OmniMark also provides a predefined set of streams and sources, which can be thought of as data's "main roads" into and out of an OmniMark program, and between the input and output processor.

The following subsections describe how the input sources and output streams are accessed and used.

2.5.1 Program Output Streams

The built-in streams that provide the output for the program are:

#PROCESS-OUTPUT
#PROCESS-OUTPUT identifies the default output destination supplied to the OmniMark program by the system. This corresponds to what is usually referred to as "standard output" ("stdout") on UNIX systems.
#PROCESS-OUTPUT can be written to from either domain.
The "DECLARE #PROCESS-OUTPUT" declarations can be used to change some of the characteristics of the #PROCESS-OUTPUT stream.
#PROCESS-OUTPUT was identified by the name #CONSOLE in earlier releases of OmniMark, and #CONSOLE can be used as a synonym for #PROCESS-OUTPUT. However, the use of #CONSOLE is deprecated -- there's no reason to use two names for the same thing.
#MAIN-OUTPUT
#MAIN-OUTPUT identifies the output destination described by the -of or -aof directive on the OmniMark program's command line. When there isn't an -of or -aof on the command line, then #MAIN-OUTPUT identifies the same destination as #PROCESS-OUTPUT (i.e. "standard output").
Even when #MAIN-OUTPUT and #PROCESS-OUTPUT identify the same destination they are considered distinct streams. They can each have their own set of properties. For example, #MAIN-OUTPUT could be written in TEXT-MODE and #PROCESS-OUTPUT in BINARY-MODE (although in most cases it would be strange to do so).
#MAIN-OUTPUT is the default output (#CURRENT-OUTPUT) in an OmniMark program. It is "owned" by (and part of the #CURRENT-OUTPUT set in) the output processor in a context-translation, down-translation or process program and the input processor in a cross-translation or up-translation.
The "DECLARE #MAIN-OUTPUT" declarations can be used to change some of the characteristics of the #MAIN-OUTPUT stream.
#MAIN-OUTPUT was identified by the name OUTPUT in earlier releases of OmniMark. OUTPUT can still be used as a synonym for #MAIN-OUTPUT (except in contexts where the programmer has declared a shelf, argument or function with the name OUTPUT) but its use is deprecated. Using #MAIN-OUTPUT produces easier-to-understand programs.
#SGML
#SGML identifies the input to the SGML parser. It is part of the current output set (#CURRENT-OUTPUT) in the following contexts:
- in FIND-START, FIND and FIND-END rules that process the main input in context-translation and up-translation program,
- in the body of EXTERNAL-TEXT-ENTITY rules,
- in the body of an SGML-ERROR rule, and
- in the input function of a "DO SGML-PARSE".
Furthermore, it is available in any function called from the above contexts, or any FIND rules performed as a result of a SUBMIT in any of the above contexts.
It cannot be written to in any other contexts.
#SGML was identified by the name SGML in earlier releases of OmniMark. SGML can still be used as a synonym for #SGML (except in contexts where the programmer has declared a shelf, argument or function with the name SGML) but its use is deprecated. Using #SGML produces easier-to-understand programs, because it clearly identifies the #SGML stream to be an OmniMark artifact and not a programmer-declared name.
#SUPPRESS
The #SUPPRESS stream is used to discard data. Output can be directed to #SUPPRESS for actions which may produce output in situations where output is not wanted. If #SUPPRESS is specified with other destination streams in an OUTPUT-TO, "USING OUTPUT AS", or PUT action, then the data is still written to the other streams. The data is only discarded if #SUPPRESS is the only stream being written to.
Output processor rules which process content may use the SUPPRESS action to discard the content. The SUPPRESS action sets the current output stream set to #SUPPRESS so that any rules invoked during the processing of the content will also discard their output.
The #SUPPRESS stream can be written from either domain, but it is the initial current output set for the output processor in an up-translation.
#ERROR
In a similar manner to the #PROCESS-OUTPUT special stream, #ERROR always refers to "standard error" ("stderr") and can be used in the same manner as #PROCESS-OUTPUT. (If no -log command-line argument is given, #ERROR is where OmniMark places all error and informative messages. Defining -log does not change the definition of #ERROR: the latter remains the "standard error" output.) If a system does not distinguish between "standard output" and "standard error", they are defined to be the same destination.
An example of using the #ERROR stream to report errors is:
```
   LOCAL COUNTER list-items
   DO WHEN NUMBER OF list-items != list-count
     LOCAL COUNTER temp
     PUT #ERROR "Found a condition that shouldn't have happened:%n"
     SET temp TO NUMBER OF list-items
     PUT #ERROR "    list-count = %d(list-count), but " _
                "list-items has %d(temp) item"
     PUT #ERROR "s" WHEN temp != 1
     PUT #ERROR ".%n"
     HALT WITH 2        ; signal an error condition while stopping
   DONE
```

The built-in streams can be heralded with STREAM, in contexts where such names are allowed to be heralded. The following two examples are equivalent:

Example A

   DEFINE FUNCTION put-nl MODIFIABLE STREAM s AS
      PUT s "%n"

   FIND-START
      put-nl STREAM #MAIN-OUTPUT

Example B

   DEFINE FUNCTION put-nl MODIFIABLE STREAM s AS
      PUT s "%n"

   FIND-START
      put-nl #MAIN-OUTPUT

2.5.2 Program Input Sources

The built-in input sources are:

#PROCESS-INPUT
#PROCESS-INPUT identifies the default input source supplied to the OmniMark program by the system. This corresponds to what is usually referred to as "standard input" ("stdin") on UNIX systems.
When the -term command-line option is given, #PROCESS-INPUT is unavailable to the program, and access of #PROCESS-INPUT is an error.
How #PROCESS-INPUT is used is described in Section 2.5.2.1, "Making Use of Built-In Input Sources".
#MAIN-INPUT
In a translation program, #MAIN-INPUT identifies the text that will be automatically processed. Thus, when files are named on the command-line, #MAIN-INPUT supplies the text of each of the files in the order that their names appear on the command line. When there isn't any file named on the command line, then #MAIN-INPUT identifies the same source as #PROCESS-INPUT (i.e. "standard input").
In a process program, #MAIN-INPUT always identifies the same source as #PROCESS-INPUT.
In earlier releases of OmniMark, FIND rule pattern matching was unable to match across file "boundaries" -- a pattern couldn't match part of one file and part of the following. As of OmniMark V3, the files are joined together as if by the JOIN string concatenation operator, and pattern matching can match across any number of input files.
OmniMark gives the programmer control over how the command-line files are read by not actually opening any of them until absolutely required. The OmniMark program opens a command-line file if:
- the program does a "DO SCAN", "REPEAT SCAN", or SUBMIT on #MAIN-INPUT,
- the program does a "DO SCAN", "REPEAT SCAN", or SUBMIT of a file which is also named on the command line, or
- the program is a translation program and it does not halt before the #MAIN-INPUT starts being processed automatically.
Otherwise, a programmer can be sure that OmniMark does not open any of the named files.
How #MAIN-INPUT is used is described in Section 2.5.2.1, "Making Use of Built-In Input Sources".

The appropriate type herald for #PROCESS-INPUT and #MAIN-INPUT is SOURCE, as in:

   SUBMIT SOURCE #MAIN-INPUT

2.5.2.1 Making Use of Built-In Input Sources

#PROCESS-INPUT and #MAIN-INPUT are built-in input sources. They explicitly identify sources of input and can be used as the scanning source in:

a "DO SCAN",
"REPEAT SCAN",
the input function of a "DO SGML-PARSE",
the SCAN source of a "DO SGML-PARSE", or
a SUBMIT.

For example, the following simple rule exchanges square brackets in the input for tag open/tag close characters and vice versa, and provides the result to the SGML parser:

   EXTERNAL-TEXT-ENTITY #DOCUMENT
      REPEAT SCAN #MAIN-INPUT
      MATCH "["
         OUTPUT "<"
      MATCH "]"
         OUTPUT ">"
      MATCH "<"
         OUTPUT "["
      MATCH ">"
         OUTPUT "]"
      MATCH [ANY EXCEPT "[]<>"]+ => other-text
         OUTPUT other-text
      AGAIN

Other uses can be made of #PROCESS-INPUT and #MAIN-INPUT, as required by a programmer.

2.5.2.2 Restrictions on Built-In Input Sources

#PROCESS-INPUT and #MAIN-INPUT are each subject to a variety of constraints:

Each of #PROCESS-INPUT and #MAIN-INPUT can only be used once per run of a program. It is an error to use either of them a second time. So each one can only be processed by either "DO SCAN", "REPEAT SCAN", or SUBMIT once.
Both #PROCESS-INPUT and #MAIN-INPUT are normally read incrementally, in the same way that files and other input sources are normally read. This allows a large amount of data to be processed without overloading the memory resources of a machine. However, both #PROCESS-INPUT and #MAIN-INPUT must be the whole of the scanning source immediately following the keywords SCAN or SUBMIT, or must be directly output to an output stream, to be read "normally".
If #PROCESS-INPUT or #MAIN-INPUT is used in any other way, an attempt is made to read in the whole of their input data into a string buffer in memory, and that string buffer is used in further operations. For example, using #MAIN-INPUT as the first argument of the "||" (JOIN) operator in the following action causes it to be read in its entirety prior to concatenating the period:
```
   OUTPUT #MAIN-INPUT || "."
```
Further difficulty arises where either #PROCESS-INPUT or #MAIN-INPUT do not have an "end". This can happen when they are piped from a keyboard or other such device, where the input can wait forever for another character. This will have the effect of "hanging" a program that attempts to read in all of #PROCESS-INPUT or #MAIN-INPUT at once.
As a consequence of these difficulties, care should be taken that #PROCESS-INPUT and #MAIN-INPUT are normally read in an incremental manner.

2.5.3 Controlling Input to the SGML Parser

Syntax

   EXTERNAL-TEXT-ENTITY #DOCUMENT

An "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is used to take control of where input comes from in a down-translation, in much the same way that the #COMMAND-LINE-NAMES shelf gives input control to process programs and cross-translations. Up-translation and context-translations can use either technique for input control.

An "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule provides a framework for explicitly providing an entire "SGML document entity" to the SGML parser.

The following example demonstrates how the names on the OmniMark command line can be interpreted as URLs of HTML documents. The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule uses SUBMIT to make sure FIND rules can be used to convert the HTML into appropriately conforming SGML. An externally defined "source" function called get-url is assumed to be available for getting the text of HTML files via the Internet:

   CONTEXT-TRANSLATE

   EXTERNAL-TEXT-ENTITY #DOCUMENT
      REPEAT OVER #COMMAND-LINE-NAMES
         SUBMIT get-url ("http://" || #COMMAND-LINE-NAMES)
      AGAIN
   ...

If a CONTEXT-TRANSLATE or UP-TRANSLATE contains an "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule then there is no automatic SUBMIT of either the files named on the command line, or the #PROCESS-INPUT. The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is used for processing the "SGML document entity". It can examine the #COMMAND-LINE-NAMES built-in stream (Section 2.6, "Accessing The Command-Line Arguments") if it needs to access files named on the command-line.

The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is different than other kinds of EXTERNAL-TEXT-ENTITY rules, in that:

The name #DOCUMENT cannot be mixed with entity names or #DTD, #CHARSET, #CAPACITY or #SYNTAX in the header of the rule. For example:
```
   EXTERNAL-TEXT-ENTITY #DOCUMENT | #DTD
      ...
```
is not allowed.
If there are any "EXTERNAL-TEXT-ENTITY #DOCUMENT" rules in the program, then one of them must be selected to provide the SGML document. It is an error for none of them to be selectable, and it is an error for more than one to be selectable.
A program that contains an "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule must not contain any FIND-START or FIND-END rules (because there is no reason for them ever to be performed).
An "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule only applies to the main SGML document, and not to documents whose parsing was initiated by a "DO SGML-PARSE". Thus, the #CURRENT-OUTPUT inherited by an "EXTERNAL-TEXT-ENTITY #DOCUMENT" rules is always the #SGML stream created by the CONTEXT-TRANSLATE, DOWN-TRANSLATE or UP-TRANSLATE.
The "%q" format item cannot be used in an "EXTERNAL-TEXT-ENTITY #DOCUMENT" because such a text entity does not have a name.

The other forms of the EXTERNAL-TEXT-ENTITY rule are described in Section 16.2.2, "Processing External Text Entities".

2.6 Accessing The Command-Line Arguments

OmniMark defines a global unkeyed read-only built-in stream shelf, #COMMAND-LINE-NAMES, that contains, as its values, the "words" on the command-line that are not recognized as OmniMark command-line options.

The #COMMAND-LINE-NAMES shelf can be used to:

to get the name of the file currently being processed.
This can be useful in programmer-generated messages. The current item of the #COMMAND-LINE-NAMES shelf can be used in the messages to identify the file which caused the message to be generated.
to get the names of all of the files that will be or have been processed.
This can be useful for checking the arguments to a program which takes a long time to execute. By immediately checking for any files named on the command-line that do not exist, the program can let the user running it know immediately when they have made a mistake typing in the command-line.
to allow the programmer to take over processing of the files named on the command-line.
This is useful when the programmer wishes to avoid processing the files automatically. The most prominent cases where automatic file reading is bypassed is in process programs, and in programs that use an "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule to provide the text of the SGML document entity to be parsed.
to allow the programmer to provide a customized interface to the program, where some of the command-line names represent options that the program can recognize, and some of them may name files to be processed.

#COMMAND-LINE-NAMES is available to all types of OmniMark programs.

In the process of populating #COMMAND-LINE-NAMES, the following components of the command line are recognized as command-line options and are not placed in the shelf:

any existing OmniMark command-line option together with its arguments.
any other word that begins with, or consists entirely of, a single dash (-). These are reserved for future expansion, and, thus, are currently treated as errors.

All other words on the command-line are recognized as "names" and not commands. In particular, the following, so long as they are not recognized as the arguments that follow a dash command, are placed on the #COMMAND-LINE-NAMES shelf:

any "word" that starts with anything other than a dash, and
any "word" that starts with two or more dashes.

If there are no words on the command line recognized as names, then the #COMMAND-LINE-NAMES shelf has no items.

Variables initialized on the command-line are often referred to as "command-line arguments" (such as a stream value set using -d; see the companion manual, Using OmniMark 3 [eum13]). These are different from the #COMMAND-LINE-NAMES shelf: the former is entered on the command line with an identification of the global shelf it parameterizes, the latter is the set of "unidentified" names on the command line.

2.7 More Examples of OmniMark Programs

This section provides a sample set of programs that illustrate some basic uses of OmniMark.

The first example shows how an SGML document might be translated into an HTML document for publishing on the Internet. The second example shows how legacy TeX documents may be converted into SGML.

2.7.1 Translating SGML Documents: An Example

Because OmniMark rules are defined in terms of document structure rather than markup in a down-translation, output is not affected by markup minimization, or ignored record ends in the SGML source document. Unless the program has rules for comments and marked sections, they also do not affect the output.

The predominant rule in a down-translation is the ELEMENT rule. An SGML element is described as the part of the document that spans from the beginning of a start tag to the end of the corresponding end tag for a particular element name. Elements may contain text, entity references, processing instructions, and more elements, thus forming a hierarchy.

It is the OmniMark programmer's task to identify the elements which may occur in an SGML document and set up a rule for each one. The elements which may occur are identified by examining the Document Type Definition. The OmniMark programmer may define more than one rule for any one element. In this case, the programmer must specify the conditions or qualifications under which the rule becomes relevant.

For a simple but practical example, suppose a programmer wishes to present simple glossaries using a Web browser. The most straightforward way of doing this is to convert the glossaries into HTML. A glossary begins with a title that is followed by one or more entries. Entries in turn consist of the term being defined followed by a single, one-paragraph definition. The input is entered in SGML to correspond to the following Document Type Definition:

   <!DOCTYPE glossary [
   <!ELEMENT glossary o o (title, entry+)>
   <!ELEMENT title o o (#PCDATA)>
   <!ELEMENT entry - o (term, def)>
   <!ELEMENT term o - (#PCDATA)>
   <!ELEMENT def o o (#PCDATA)>
   <!ENTITY end-term ENDTAG "term">
   <!SHORTREF term-map "&#RE;" end-term>
   <!USEMAP term-map term>
   ]>

This Document Type Definition permits some markup minimization. Since it is assumed, for instance, that the defined terms are never longer than one input record, a term is ended by a record end ("&#RE"). Various start- and end-tags may be omitted. Using these conventions, a typical source document might appear as shown below:

   SGML Definitions
   <entry>containing element
   An element within which a subelement occurs.
   <entry>data entity
   An entity that was declared to be data and therefore is
   not parsed when referenced.
   <entry>name
   A name token whose first character is a name start
   character.

An OmniMark program to process this glossary contains a rule for each element type. The actions in the OmniMark rules below indicate how HTML tags are inserted around the contents of each element. In these actions, "%c" represents an element's content (possibly including the content of subelements), and "%n" indicates insertion of a line break in the output.

   DOWN-TRANSLATE

   ELEMENT glossary
     OUTPUT "<HTML>%n<HEAD>%n%c</UL>%n" ||
            "</BODY></HTML>%n"

   ELEMENT title
     LOCAL STREAM title-text
     SET title-text TO "%c"
     OUTPUT "<TITLE>" || title-text || "</TITLE>%n" ||
            "</HEAD><BODY>%n" ||
            "<H1>" || title-text || "</H1>%n" ||
            "<UL>%n"

   ELEMENT entry
     OUTPUT "<LI>%c%n"

   ELEMENT term
     OUTPUT "<STRONG>%c</STRONG>%n"

   ELEMENT def
     OUTPUT "%c"

The first rule specifies that the glossary's content is to be output prefixed by the HTML start tag "<HTML>", and followed by the HTML end tags "</UL>", "</BODY>" and "</HTML>". The title rule specifies the tags surrounding the glossary's title. There are two copies of the title output: one for the top of the browser window, and one within the text area -- so a temporary variable, title-text is defined and used. The rule for entries simply indicates that the content of each entry is output as a "<LI>" list item. Each entry consists of a term and a definition, whose text is output with "<STRONG>" tagging surrounding the term.

The translation of the sample glossary source document shown above is the following HTML source file:

   <HTML>
   <HEAD>
   <TITLE>SGML Definitions</TITLE>
   </HEAD><BODY>
   <H1>SGML Definitions</H1>
   <UL>
   <LI><STRONG>containing element</STRONG>
   An element within which a subelement occurs.
   <LI><STRONG>data entity</STRONG>
   An entity that was declared to be data and therefore is
   not parsed when referenced.
   <LI><STRONG>name</STRONG>
   A name token whose first character is a name start
   character.
   </UL>
   </BODY></HTML>

It is important to observe that identical output is generated if the source document is edited by inserting all omitted tags and placing the existing entry start-tags on separate lines. When translating SGML documents, the writer of an OmniMark program need never be concerned with variations of an SGML source document that, according to the provisions of ISO 8879, do not affect its interpretation.

2.7.2 Translating Documents into SGML: An Example

An up-translation starts with an arbitrary data file and produces an SGML document or document instance. Since the SGML document is parsed as it is generated, the translation can be guided by the structure of the SGML document.

Suppose the glossary described in the previous section was just one of many similar documents originally written in TeX. Rather than convert all of them to SGML by hand, which would be an error-prone task, it makes sense to write a program to do the conversion.

The TeX document from the previous example may look like this:

   \input glossmac
   \title{SGML Definitions}
   \term{containing element}{%
   An element within which a subelement occurs.}
   \term{data entity}{%
   An entity that was declared to be data and therefore is
   not parsed when referenced.}
   \term{name}{%
   A name token whose first character is a name start
   character.}
   \bye

Later, an application arises for the same material in SGML, in an environment that does not support the OMITTAG feature. The following OmniMark program defines the translation:

   UP-TRANSLATE

   ; to start the translation
   FIND-START
     OUTPUT FILE "file.dtd"

   ; to start the translation
   FIND "\input glossmac" WHITE-SPACE*
     OUTPUT "<glossary>%n"

   ; look for the start of the title
   FIND "\title{"
     OUTPUT "<title>"

   ; translate }
   FIND "}" "%n"?
     DO WHEN ELEMENT IS title
        OUTPUT "</title>%n"
     ELSE WHEN ELEMENT IS term
        OUTPUT "</term>%n"
     ELSE WHEN ELEMENT IS def
        OUTPUT "</def>%n</entry>%n"
     DONE

   ; look for start of term
   FIND "\term{"
     OUTPUT "<entry>%n<term>"

   ; look for start of definition
   FIND "{%%%n"
     OUTPUT "<def>"

   ; look for end of glossary
   FIND "\bye" ANY
     OUTPUT "</glossary>%n"

The program begins by identifying the translation type. As mentioned earlier, this is an up-translation, whose result is an SGML document corresponding to a given Document Type Definition. The bulk of the translation consists of FIND rules.

As it reads the TeX file, OmniMark looks for strings corresponding to the patterns defined by the FIND rules. When one is found, the actions in the rule are performed. As the output is generated, the SGML parser verifies that it corresponds to the Document Type Definition.

The first rule is a FIND-START rule that passes the DTD to the SGML parser. It assumes that the file named "file.dtd" contains the DTD used to guide the translation.

The first FIND rule is

   FIND "\input glossmac" WHITE-SPACE*
     OUTPUT "<title>"

This rule tells OmniMark to look for the string \input glossmac followed by any number of spaces, tabs, or end-of-line sequences. When the pattern is found, the action within the rule writes the <glossary> start-tag. The next rule uses a similar technique to search for the start of the title. The third rule is a little more complicated:

   FIND "}" "%n"?
     OUTPUT "</title>%n" WHEN ELEMENT IS title
     OUTPUT "</term>%n" WHEN ELEMENT IS term
     OUTPUT "</def>%n</entry>%n" WHEN ELEMENT IS def

This rule searches for a right brace, possibly followed by an end-of-line sequence. The action taken when this pattern is found depends on the context. The appropriate end-tags are written according to the state of the SGML parser. This ability to qualify an action distinguishes OmniMark from other pattern-matching languages.

The remainder of the program is straightforward. It produces the following document instance:

   <glossary>
   <title>SGML Definitions</title>
   <entry>
   <term>containing element</term>
   <def>An element within which a subelement occurs.</def>
   </entry>
   <entry>
   <term>data entity</term>
   <def>An entity that was declared to be data and therefore is
   not parsed when referenced.</def>
   </entry>
   <entry>
   <term>name</term>
   <def>A name token whose first character is a name start
   character.</def>
   </entry>
   </glossary>

Next chapter is Chapter 3, "Generalized Document Processing".

OmniMark® Programmer's Guide Version 3

2. Types of OmniMark Programs

2.1 OmniMark Subsystems

2.2 Batch Translation Program Types

2.2.1 Cross-Translation: General Document Translation

2.2.2 Down-Translation: Translating SGML Documents

2.2.3 Up-Translation: Translating Documents to SGML

2.2.4 Context-Translation: Using SGML as an Intermediate Form

2.3 Process Programs: Server-Based Translation Programs

2.4 Program Initialization and Termination

2.4.1 Universal Program Initialization and Termination

2.4.2 Properties of PROCESS-START and PROCESS-END Rules

2.4.3 The Order of Initialization and Termination Rules

2.5 Program Input and Output

2.5.1 Program Output Streams

2.5.2 Program Input Sources

2.5.2.1 Making Use of Built-In Input Sources

2.5.2.2 Restrictions on Built-In Input Sources

2.5.3 Controlling Input to the SGML Parser

2.6 Accessing The Command-Line Arguments

2.7 More Examples of OmniMark Programs

2.7.1 Translating SGML Documents: An Example

2.7.2 Translating Documents into SGML: An Example

OmniMark^® Programmer's Guide Version 3

2.4.2 Properties of `PROCESS-START` and `PROCESS-END` Rules