HOME | COMPANY | SOFTWARE | DOCUMENTATION | EDUCATION & TRAINING | SALES & SERVICE

    "The Official Guide to Programming with OmniMark"

Site Map | Search:   
OmniMark Magazine Developer's Forum   

  International Edition   

OmniMark® Programmer's Guide Version 3

2. Types of OmniMark Programs

Detailed Table of Contents

Previous chapter is Chapter 1, "Introduction".

Next chapter is Chapter 3, "Generalized Document Processing".

OmniMark provides a variety of ways to configure programs, and a lot of help in doing so. There are two basic types of programs:

OmniMark makes it easy to write both types of programs. Although all types of OmniMark programs can make good use of most facilities in the language, there are some features which are primarily designed to support one type of programming or the other.

Batch processing programs can take advantage of OmniMark's built-in translation types, which aid the programmer by pre-configuring OmniMark to different types of rule-based conversions.

Server-based processing programs directly issue OmniMark's easy-to-use "DO SGML-PARSE" actions for SGML processing, and SUBMIT actions for general text processing.

2.1 OmniMark Subsystems

Essentially, OmniMark programs run with one or two program-controlled subsystems or threads, called domains. Programs can have one domain, or they can have two domains, joined together by OmniMark's built-in SGML parser.

General text processing is usually done with SUBMIT actions and FIND rules. SUBMIT actions feed input data to OmniMark, and FIND rules use pattern matching to analyse the input data. These features are enhanced by OmniMark's SGML support -- general text processing can be used to help create SGML documents, to further process the results of SGML parsing, or do both. (FIND rules are described in Section 3.1, "General Document Processing Rules". SUBMIT is described in Section 3.2.1, "Submitting Input to FIND Rules".)

Additionally, OmniMark's unique "source" functions and "output" functions mean that input data and output data can be transmitted from and to any location: the local file system, databases, across networks, or across the Internet. Source functions are described in Section 12.3.3, "Externally-Defined Sources". Output functions are described in Section 12.3.4, "External Output Functions".

Multi-domain programs can initiate SGML parsing themselves (as is typical in server-based processing) or use one of OmniMark's built-in translation types. They can even combine a built-in translation type with program-controlled initiation of further SGML parsing.

The OmniMark run-time environment is a system composed of one or more of the following three cooperating subsystems:

  1. the input processor, (or find domain)
  2. the SGML parser
  3. the output processor, (or element domain)

[omsubsys]
Figure 1 -- OmniMark Subsystems

The input processor provides the input to the SGML parser. In the process it may need to convert non-SGML data into SGML. The output processor converts the result of SGML parsing into some other form (which may even be SGML conforming to a different DTD).

The input processor has traditionally been called the find domain, because FIND rules figure largely in the kind of processing done there. Similarly, the output processor has traditionally been called the element domain, because that's where ELEMENT rules and other SGML processing is done. The terms input processor and output processor better characterize the role of the domains.

For translation programs, the translation type determines which subsystems are involved. For process programs, subsystems are started and suspended dynamically explicitly by the actions in the program.


2.2 Batch Translation Program Types

Writing batch processing programs is made easier using a translation type, which automatically sets up the interaction of the OmniMark subsystems and sets up the input and output processing. By choosing different translation types, the programmer can control whether the input, the output, or an intermediate stage is parsed by the SGML parser, and what is produced as the "main output" of the program.

Batch processing programs are also referred to as translation programs because they are classified according to their translation type.

The translation type is specified by a single keyword at the start of the program. There are four translation types:

Where a programmer does not want an OmniMark program to be configured as one of the above types of translations -- for example in a server-based application -- the programmer can specify directly what is to be subjected to FIND rule processing and what are to be the inputs and outputs of SGML parsing.

To avoid automatic configuration of the program, the programmer merely omits the translation type at the start.

The following subsections describe each of OmniMark's four "built-in" translation types in more detail.

2.2.1 Cross-Translation: General Document Translation

A cross-translation is a translation that converts a document from one arbitrary form to another. A cross-translation program does not make any use of the SGML parser. A cross-translation must begin with:

Syntax

   CROSS-TRANSLATE

[omcross]
Figure 2 -- Cross Translation Block Diagram

Figure 2 -- Cross Translation Block Diagram shows a simple block diagram of a cross-translation.

In a cross-translation, the OmniMark programmer must define their own conversion events. To this end, OmniMark provides a rich language for specifying patterns to match text in the input. When text in the input matches a pattern, the associated rule is executed. OmniMark also provides a very expressive mechanism for saving the text matched by pieces of the pattern, so that the matched text can play a role in the actions that will be executed.

The operation of a cross-translation is:

  1. OmniMark examines each FIND rule in turn looking for a rule which can match text at the current position. It selects the first such rule with no condition or whose condition is true.
  2. If a rule is selected, the actions in that rule are performed in order, and those actions are responsible for outputting the "conversion" of the text matched by the pattern. If the programmer wishes to use any of the matched text in an action, they can capture that text in pattern variables.
  3. If no rules can be selected, OmniMark allows text at the current point to "fall through" to the output -- unrecognized data is output without being converted.
  4. The text that "fell through" or was matched is considered to be "consumed", and the cycle begins again with the text that follows.

The following is an example of a simple OmniMark CROSS-TRANSLATE that removes all spaces at the starts and ends of lines, collapses runs of spaces between words into a single space character, and upper-cases every word starting with the letter "j". In the process tabs are converted into spaces:

   CROSS-TRANSLATE

   FIND LINE-START BLANK+ | BLANK+ LINE-END
      ; just ignore spaces and tabs at the start of end of lines

   FIND BLANK {2}+ ; two or more "blank" (space or tab) characters
      OUTPUT " "

   FIND WORD-START (UL "J" LETTER*) => word
      OUTPUT "%ux(word)" ; The "u" modifier upper-cases the word.

OmniMark's pattern recognition capability is not limited by line boundaries, nor does it arbitrarily break up the text into fields. Because of this, cross-translation is a technique that is applicable to a wide range of data analysis and conversion tasks.

2.2.2 Down-Translation: Translating SGML Documents

A down-translation is a translation whose input is a complete SGML document or an SGML document instance corresponding to a specified Document Type Definition.

The output of a down-translation can be SGML or some other format. It could be a document suitable for input into a text formatter, for example. Or a down-translation can be used to enter information from the SGML document into a database. A down-translation can even be used to transform an SGML document into another SGML document, for instance by "cleaning up" the input, or by restructuring it.

A down-translation is defined by entering:

Syntax

   DOWN-TRANSLATE

at the start of an OmniMark program.

[omdown]
Figure 3 -- Down-Translation Block Diagram

A down-translation is composed of rules that recognize SGML events, like elements. The basic operation of a down-translation program is:

  1. The SGML parser builds up information as it processes the input document.
  2. When the SGML parser recognizes a new component in the document, (an element or a processing instruction, for example), it informs OmniMark.
  3. OmniMark then examines each rule in the program, in order, looking for rules that apply to that event. It selects the first one with no condition or with a condition that is true.
  4. OmniMark executes all of the actions in the selected rule in order.
  5. Control is returned to the SGML parser so that more input can be processed, and the cycle begins again.

If the component found is one that has content, such as an element or an SGML comment, then the actions in the rule control when the content is processed. The content is processed when a "%c" format item or a SUPPRESS action is encountered.

Should the SGML parser detect any errors in the markup of the SGML document it will report the errors. The OmniMark program can customize the manner in which these errors are reported. (See Chapter 15, "Processing SGML Errors").

The SGML parser will always recover from a markup error and return meaningful information to OmniMark, allowing processing to continue. Because of this, as many errors as possible will be detected in one run of the program.

The following simple example displays the titles of all the chapter elements in a document, prefixed by chapter numbers. All other elements are suppressed:

   DOWN-TRANSLATE

   GLOBAL COUNTER chapter-count INITIAL {0}

   ELEMENT title WHEN PARENT IS chapter
      INCREMENT chapter-count
      PUT #MAIN-OUTPUT "Chapter %d(chapter-count): %c%n"

   ELEMENT #IMPLIED
      SUPPRESS ; Suppress the output of all other elements.

When the content of a component is processed by a rule, that rule is temporarily suspended. Events within the component's content (such as subelements) can cause other rules to execute. When the content has been completely processed, the suspended rule is resumed, and the remainder of the actions are executed. This behaviour gives the execution of an OmniMark program the same hierarchical structure that an SGML document has.

Care must be taken with programs written for earlier releases of OmniMark, which didn't support programs without a translation type. For the earlier releases, a program without a translation type was assumed to be a DOWN-TRANSLATE. If the program cannot be modified to add the translation type, then the current releases of OmniMark can still be made to process such programs correctly using the -herald command-line option. See Section 19.1.4.8, "Version 2 Compatibility".

2.2.3 Up-Translation: Translating Documents to SGML

An up-translation is a translation whose output is generally a complete SGML document. OmniMark parses the SGML document as it is generated, and any errors are reported. The same SGML document that is parsed is the "main output" of the program.

OmniMark also provides the ability to send information to the main output or the SGML parser individually. This allows programmers to send the SGML prolog to the SGML parser without sending it to the main output, for example. That way the output consists solely of the document instance. This is very useful for environments where the document instances are stored separately from the DTDs.

OmniMark places no restrictions on the format of the input to an up-translation; most often the input is a data file compatible with a non-SGML text processing system.

An up-translation must begin with:

Syntax

   UP-TRANSLATE

[omup]
Figure 4 -- Up Translation Block Diagram

Figure 4 -- Up Translation Block Diagram shows a simple block diagram for an up-translation.

When writing an up-translation, the OmniMark programmer uses FIND rules to describe the patterns of interest in a document and the actions to take to transform the document into an SGML document.

The operation of an up-translation is:

  1. OmniMark examines each rule in turn looking for a rule which can match text at the current position. It selects the first rule with no condition or whose condition is true.
  2. If a rule is selected, the actions in that rule are performed in order. If the programmer wishes to use any of the matched text in an action, they can capture that text in pattern variables.
  3. If no rules can be selected, OmniMark allows text at the current point to "fall through" to the currently selected output targets, which are typically both the main output and the SGML parser.
  4. The text that "fell through" or was matched is consumed, and the cycle begins again.

As markup is found and submitted to the parser, OmniMark will collect context information; that is, it will collect information about the document hierarchy being formed. This context information can be used in FIND rules to qualify subsequent FIND rules.

In an up-translation, the SGML document created is strictly a result of the patterns which can be found, in context, in the input document. The final SGML document provided at the output is identical to the document provided to the parser.

If there are errors in the generated markup, the parser will report the markup errors and perform as much error correction as possible. The error reports can be customized and even acted upon by the OmniMark programmer, to help the program recover from such markup errors.

The following example of an UP-TRANSLATE program converts RTF (Rich Text Format) into SGML. It:

In practise preamble material (style sheets) will need to be skipped over, and other styles and paragraph commands (such as "\par") will need to be recognized.

   UP-TRANSLATE

   FIND "\s23" LOOKAHEAD ! DIGIT
      OUTPUT "<P>"

   FIND "\" LETTER [LETTER | DIGIT | "-"]* | ; RTF command
        "\" ANY |                            ; other RTF code
        ["{}"]                               ; RTF grouping
      ; Output nothing for these.

   FIND "\" ["{}\"] => protected-character
      OUTPUT protected-character ; Some characters are protected by \

   FIND "\'" ANY {2} => hex-code ; Some characters are in hexadecimal
      LOCAL COUNTER character-value
      SET character-value TO hex-code BASE 16
      OUTPUT "&#%d(character-value);"

   FIND "<"
      OUTPUT "<<!>" ; "<" often needs protecting in the SGML

   FIND "&"
      OUTPUT "&<!>" ; Likewise "&"

Up-translations work well for relatively simple documents. For complex documents context-translations are almost always preferable.

2.2.4 Context-Translation: Using SGML as an Intermediate Form

A context-translation is the most general of the built-in translation types. A context-translation is a translation that converts data from one form to another, using SGML as an intermediate form. A context-translation can be viewed as an up-translation to produce an intermediate SGML document combined with a simultaneous down-translation of that SGML document.

Patterns in the original document suggest its structure and allow (a possibly partial) conversion to SGML. OmniMark parses the SGML form and, using the SGML parser, corrects structure errors. The final output makes use of the structure discovered by the parser to produce a fully marked-up document, a minimized document, or some other form of data.

A context-translation begins with

Syntax

   CONTEXT-TRANSLATE

[omcontex]
Figure 5 -- Context Translation Block Diagram

Figure 5 -- Context Translation Block Diagram shows a simple block diagram of a context-translation. A context-translation combines the best features of an up-translation and a down-translation with the powerful error recovery and context tracking capability of the parser.

Although the following example is simple, it nonetheless illustrates the typical roles of the input and output processors in a context-translation.

Consider an input document like the following:

   Context-Translation


   This is a simple context-translation. It
   takes an ASCII text file and produces SGML.

   The find rules just insert the markup. The
   element rules add white-space to make the
   document look more readable.


   The Input Document


   The input document consists of paragraphs
   and chapter titles.

   Chapter titles are preceded and followed
   by two blank lines to make them stand out.

   Paragraphs are separated from each other
   by a single blank line.

If the file "my.dtd" contains the element declarations:

   <!ELEMENT doc      - o (chapter+)>
   <!ELEMENT chapter  - o (title, p+)>
   <!ELEMENT title    - o (#PCDATA)>
   <!ELEMENT p        - o (#PCDATA)>

The following program will convert the input to an SGML document conforming to those element declarations.

   CONTEXT-TRANSLATE

   FIND-START
      OUTPUT "<!DOCTYPE doc SYSTEM 'my.dtd'>%n"_
             "<DOC><CHAPTER><TITLE>"

   FIND "%n"{2}+ ANY-TEXT+ => title-text "%n"{2}+
      OUTPUT "<CHAPTER><TITLE>%x(title-text)</TITLE><P>"

   FIND "%n"{2}+
      OUTPUT "<P>"

   ELEMENT doc
      OUTPUT "%c"

   ELEMENT chapter
      OUTPUT "<CHAPTER>%n%c"

   ELEMENT #IMPLIED
      OUTPUT "<%q>%sc</%q>%n"

The FIND-START rule ensures that the first line of the document is interpreted as a chapter title. Following that, each single line of text surrounded by blank lines is interpreted as a further chapter title. ("%n"{2}+ matches a sequence of two or more line end characters.) All other blocks of text are interpreted as paragraphs.

On the output processor side, the resulting SGML is "cleaned up":

A detailed explanation of how OmniMark manages a context-translation can be found in Chapter 18, "How Asynchronous Concurrent Context Translations Work".


2.3 Process Programs: Server-Based Translation Programs

A programmer can take control of whether an OmniMark program does SGML parsing, where its main input comes from, and where its main output goes to, by omitting the translation type. A program without a translation type is called a "process program", because its "main" processing is specified in PROCESS rules. A process program can explicitly invoke SGML parsing using the "DO SGML-PARSE" action (see Section 17.2, "The "DO SGML-PARSE" Action").

Unlike translation programs, process programs do not perform any automatic processing of files named on the command-line. If files are named on the command-line, the names must be accessed using the #COMMAND-LINE-NAMES shelf (see Section 2.6, "Accessing The Command-Line Arguments").

The main processing in a process program is performed by PROCESS rules:

Syntax

   PROCESS condition?
      local-declaration*
      action*

A process program can contain more than one PROCESS rule. These rules are examined in the order that they occur in the program. When the rule is examined, if it has no condition, or if it has a condition which is true, the rule is executed. OmniMark then examines the next PROCESS rule.

The following is a simple program fragment that individually SGML-parses the files named on the command line, and puts the result of processing each of them into a file with ".out" appended to its name:

   PROCESS
      REPEAT OVER #COMMAND-LINE-NAMES
         DO SGML-PARSE DOCUMENT WITH SCAN FILE #COMMAND-LINE-NAMES
            SET FILE (#COMMAND-LINE-NAMES || ".out") TO "%c"
         DONE
      AGAIN

Of course, the program also requires ELEMENT rules to process the contents of each document. This program is the equivalent of performing a down-translation on each file individually.

It is not an error for a process program to contain no PROCESS rules, or for all of the PROCESS rules to be unselectable because they either have conditions that cannot be satisfied or because they are in an inactive group. In this case only the PROCESS-START and PROCESS-END rules, if any, are performed. Such programs are frowned upon. It is intended that the PROCESS rules in a program contain its "main" processing. (PROCESS-START and PROCESS-END rules are described in Section 2.4, "Program Initialization and Termination".)


2.4 Program Initialization and Termination

One of the main advantages of translation programs is that they will automatically take their input from the files named on the command-line. These files are processed as if they are all components of one big document that has been broken down into files. This can be especially convenient for processing SGML documents where the SGML declaration, DTD, and document instance are in separate files.

Because the processing of these files begins automatically, OmniMark provides initialization and termination rules to allow the programmer to gain control before the first file is processed, and after the last one is processed.

The initialization and termination rules that may be used depend on the domain in which they are used:

2.4.1 Universal Program Initialization and Termination

Initialization and termination rules that can be used in any kind of OmniMark program are needed to allow programs to provide generic services that are applicable to any type of program.

PROCESS-START and PROCESS-END rules fulfil this need.

Syntax

   PROCESS-START condition?
      local-declaration*
      action*

Syntax

   PROCESS-END condition?
      local-declaration*
      action*

PROCESS-START and PROCESS-END are provided to allow initiation and termination processing to be placed adjacent to the GLOBAL declarations, function definitions and processing rules with which it is associated. This promotes a declarative style of programming.

Because they are independent of program type, PROCESS-START and PROCESS-END rules can be used in translation programs as well as process programs. This makes them suitable for use in INCLUDE files that, for example, contain:

Such an INCLUDE file can be used both in a DOWN-TRANSLATE (which doesn't allow FIND-START and FIND-END rules), in a CROSS-TRANSLATE (which doesn't allow DOCUMENT-START and DOCUMENT-END rules), and in a process program (which does not allow FIND-START, FIND-END, DOCUMENT-START, or DOCUMENT-END rules).

2.4.2 Properties of PROCESS-START and PROCESS-END Rules

PROCESS-START and PROCESS-END rules have the following properties:

2.4.3 The Order of Initialization and Termination Rules

Many programs can contain more than one type of initialization or termination rule. Context-translations, for instance, can contain FIND-START, DOCUMENT-START, and PROCESS-START rules, all in the same program.

In general, the order in which rules are performed (if the rule is available in that program type) is:

  1. PROCESS-START rules
  2. DOCUMENT-START rules
  3. FIND-START rules
  4. The main processing rules:
  5. FIND-END rules
  6. DOCUMENT-END rules
  7. PROCESS-END rules

The list above is a general guideline, but there are special cases which fall outside this ordering:

In a process program, there is no real distinction between the PROCESS-START, PROCESS and PROCESS-END rules, except in that they are performed in that order. OmniMark doesn't distinguish what can be done in these rules. However, PROCESS-START and PROCESS-END rules should only be used for whole-program initiation and termination functions, and PROCESS rules used for "main" processing, and for initiation and termination for individual transactions in a multi-transaction process.


2.5 Program Input and Output

OmniMark helps the programmer by dealing with a lot of the details of input and output, including, if the programmer wants, most of the reading and writing of files. However, OmniMark also allows the programmer to take control by giving them direct access to the command-line arguments, and by making the input sources and output streams explicitly available.

The OmniMark program reads data from the outside world through input sources, and writes data to the outside world through streams attached to files and externally-defined output streams.

OmniMark also provides a predefined set of streams and sources, which can be thought of as data's "main roads" into and out of an OmniMark program, and between the input and output processor.

The following subsections describe how the input sources and output streams are accessed and used.

2.5.1 Program Output Streams

The built-in streams that provide the output for the program are:

The built-in streams can be heralded with STREAM, in contexts where such names are allowed to be heralded. The following two examples are equivalent:

Example A

   DEFINE FUNCTION put-nl MODIFIABLE STREAM s AS
      PUT s "%n"

   FIND-START
      put-nl STREAM #MAIN-OUTPUT

Example B

   DEFINE FUNCTION put-nl MODIFIABLE STREAM s AS
      PUT s "%n"

   FIND-START
      put-nl #MAIN-OUTPUT

2.5.2 Program Input Sources

The built-in input sources are:

The appropriate type herald for #PROCESS-INPUT and #MAIN-INPUT is SOURCE, as in:

   SUBMIT SOURCE #MAIN-INPUT

2.5.2.1 Making Use of Built-In Input Sources

#PROCESS-INPUT and #MAIN-INPUT are built-in input sources. They explicitly identify sources of input and can be used as the scanning source in:

For example, the following simple rule exchanges square brackets in the input for tag open/tag close characters and vice versa, and provides the result to the SGML parser:

   EXTERNAL-TEXT-ENTITY #DOCUMENT
      REPEAT SCAN #MAIN-INPUT
      MATCH "["
         OUTPUT "<"
      MATCH "]"
         OUTPUT ">"
      MATCH "<"
         OUTPUT "["
      MATCH ">"
         OUTPUT "]"
      MATCH [ANY EXCEPT "[]<>"]+ => other-text
         OUTPUT other-text
      AGAIN

Other uses can be made of #PROCESS-INPUT and #MAIN-INPUT, as required by a programmer.

2.5.2.2 Restrictions on Built-In Input Sources

#PROCESS-INPUT and #MAIN-INPUT are each subject to a variety of constraints:

2.5.3 Controlling Input to the SGML Parser

Syntax

   EXTERNAL-TEXT-ENTITY #DOCUMENT

An "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is used to take control of where input comes from in a down-translation, in much the same way that the #COMMAND-LINE-NAMES shelf gives input control to process programs and cross-translations. Up-translation and context-translations can use either technique for input control.

An "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule provides a framework for explicitly providing an entire "SGML document entity" to the SGML parser.

The following example demonstrates how the names on the OmniMark command line can be interpreted as URLs of HTML documents. The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule uses SUBMIT to make sure FIND rules can be used to convert the HTML into appropriately conforming SGML. An externally defined "source" function called get-url is assumed to be available for getting the text of HTML files via the Internet:

   CONTEXT-TRANSLATE

   EXTERNAL-TEXT-ENTITY #DOCUMENT
      REPEAT OVER #COMMAND-LINE-NAMES
         SUBMIT get-url ("http://" || #COMMAND-LINE-NAMES)
      AGAIN
   ...

If a CONTEXT-TRANSLATE or UP-TRANSLATE contains an "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule then there is no automatic SUBMIT of either the files named on the command line, or the #PROCESS-INPUT. The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is used for processing the "SGML document entity". It can examine the #COMMAND-LINE-NAMES built-in stream (Section 2.6, "Accessing The Command-Line Arguments") if it needs to access files named on the command-line.

The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is different than other kinds of EXTERNAL-TEXT-ENTITY rules, in that:

The other forms of the EXTERNAL-TEXT-ENTITY rule are described in Section 16.2.2, "Processing External Text Entities".


2.6 Accessing The Command-Line Arguments

OmniMark defines a global unkeyed read-only built-in stream shelf, #COMMAND-LINE-NAMES, that contains, as its values, the "words" on the command-line that are not recognized as OmniMark command-line options.

The #COMMAND-LINE-NAMES shelf can be used to:

#COMMAND-LINE-NAMES is available to all types of OmniMark programs.

In the process of populating #COMMAND-LINE-NAMES, the following components of the command line are recognized as command-line options and are not placed in the shelf:

All other words on the command-line are recognized as "names" and not commands. In particular, the following, so long as they are not recognized as the arguments that follow a dash command, are placed on the #COMMAND-LINE-NAMES shelf:

If there are no words on the command line recognized as names, then the #COMMAND-LINE-NAMES shelf has no items.

Variables initialized on the command-line are often referred to as "command-line arguments" (such as a stream value set using -d; see the companion manual, Using OmniMark 3 [eum13]). These are different from the #COMMAND-LINE-NAMES shelf: the former is entered on the command line with an identification of the global shelf it parameterizes, the latter is the set of "unidentified" names on the command line.


2.7 More Examples of OmniMark Programs

This section provides a sample set of programs that illustrate some basic uses of OmniMark.

The first example shows how an SGML document might be translated into an HTML document for publishing on the Internet. The second example shows how legacy TeX documents may be converted into SGML.

2.7.1 Translating SGML Documents: An Example

Because OmniMark rules are defined in terms of document structure rather than markup in a down-translation, output is not affected by markup minimization, or ignored record ends in the SGML source document. Unless the program has rules for comments and marked sections, they also do not affect the output.

The predominant rule in a down-translation is the ELEMENT rule. An SGML element is described as the part of the document that spans from the beginning of a start tag to the end of the corresponding end tag for a particular element name. Elements may contain text, entity references, processing instructions, and more elements, thus forming a hierarchy.

It is the OmniMark programmer's task to identify the elements which may occur in an SGML document and set up a rule for each one. The elements which may occur are identified by examining the Document Type Definition. The OmniMark programmer may define more than one rule for any one element. In this case, the programmer must specify the conditions or qualifications under which the rule becomes relevant.

For a simple but practical example, suppose a programmer wishes to present simple glossaries using a Web browser. The most straightforward way of doing this is to convert the glossaries into HTML. A glossary begins with a title that is followed by one or more entries. Entries in turn consist of the term being defined followed by a single, one-paragraph definition. The input is entered in SGML to correspond to the following Document Type Definition:

   <!DOCTYPE glossary [
   <!ELEMENT glossary o o (title, entry+)>
   <!ELEMENT title o o (#PCDATA)>
   <!ELEMENT entry - o (term, def)>
   <!ELEMENT term o - (#PCDATA)>
   <!ELEMENT def o o (#PCDATA)>
   <!ENTITY end-term ENDTAG "term">
   <!SHORTREF term-map "&#RE;" end-term>
   <!USEMAP term-map term>
   ]>

This Document Type Definition permits some markup minimization. Since it is assumed, for instance, that the defined terms are never longer than one input record, a term is ended by a record end ("&#RE"). Various start- and end-tags may be omitted. Using these conventions, a typical source document might appear as shown below:

   SGML Definitions
   <entry>containing element
   An element within which a subelement occurs.
   <entry>data entity
   An entity that was declared to be data and therefore is
   not parsed when referenced.
   <entry>name
   A name token whose first character is a name start
   character.

An OmniMark program to process this glossary contains a rule for each element type. The actions in the OmniMark rules below indicate how HTML tags are inserted around the contents of each element. In these actions, "%c" represents an element's content (possibly including the content of subelements), and "%n" indicates insertion of a line break in the output.

   DOWN-TRANSLATE

   ELEMENT glossary
     OUTPUT "<HTML>%n<HEAD>%n%c</UL>%n" ||
            "</BODY></HTML>%n"

   ELEMENT title
     LOCAL STREAM title-text
     SET title-text TO "%c"
     OUTPUT "<TITLE>" || title-text || "</TITLE>%n" ||
            "</HEAD><BODY>%n" ||
            "<H1>" || title-text || "</H1>%n" ||
            "<UL>%n"

   ELEMENT entry
     OUTPUT "<LI>%c%n"

   ELEMENT term
     OUTPUT "<STRONG>%c</STRONG>%n"

   ELEMENT def
     OUTPUT "%c"

The first rule specifies that the glossary's content is to be output prefixed by the HTML start tag "<HTML>", and followed by the HTML end tags "</UL>", "</BODY>" and "</HTML>". The title rule specifies the tags surrounding the glossary's title. There are two copies of the title output: one for the top of the browser window, and one within the text area -- so a temporary variable, title-text is defined and used. The rule for entries simply indicates that the content of each entry is output as a "<LI>" list item. Each entry consists of a term and a definition, whose text is output with "<STRONG>" tagging surrounding the term.

The translation of the sample glossary source document shown above is the following HTML source file:

   <HTML>
   <HEAD>
   <TITLE>SGML Definitions</TITLE>
   </HEAD><BODY>
   <H1>SGML Definitions</H1>
   <UL>
   <LI><STRONG>containing element</STRONG>
   An element within which a subelement occurs.
   <LI><STRONG>data entity</STRONG>
   An entity that was declared to be data and therefore is
   not parsed when referenced.
   <LI><STRONG>name</STRONG>
   A name token whose first character is a name start
   character.
   </UL>
   </BODY></HTML>

It is important to observe that identical output is generated if the source document is edited by inserting all omitted tags and placing the existing entry start-tags on separate lines. When translating SGML documents, the writer of an OmniMark program need never be concerned with variations of an SGML source document that, according to the provisions of ISO 8879, do not affect its interpretation.

2.7.2 Translating Documents into SGML: An Example

An up-translation starts with an arbitrary data file and produces an SGML document or document instance. Since the SGML document is parsed as it is generated, the translation can be guided by the structure of the SGML document.

Suppose the glossary described in the previous section was just one of many similar documents originally written in TeX. Rather than convert all of them to SGML by hand, which would be an error-prone task, it makes sense to write a program to do the conversion.

The TeX document from the previous example may look like this:

   \input glossmac
   \title{SGML Definitions}
   \term{containing element}{%
   An element within which a subelement occurs.}
   \term{data entity}{%
   An entity that was declared to be data and therefore is
   not parsed when referenced.}
   \term{name}{%
   A name token whose first character is a name start
   character.}
   \bye

Later, an application arises for the same material in SGML, in an environment that does not support the OMITTAG feature. The following OmniMark program defines the translation:

   UP-TRANSLATE

   ; to start the translation
   FIND-START
     OUTPUT FILE "file.dtd"

   ; to start the translation
   FIND "\input glossmac" WHITE-SPACE*
     OUTPUT "<glossary>%n"

   ; look for the start of the title
   FIND "\title{"
     OUTPUT "<title>"

   ; translate }
   FIND "}" "%n"?
     DO WHEN ELEMENT IS title
        OUTPUT "</title>%n"
     ELSE WHEN ELEMENT IS term
        OUTPUT "</term>%n"
     ELSE WHEN ELEMENT IS def
        OUTPUT "</def>%n</entry>%n"
     DONE

   ; look for start of term
   FIND "\term{"
     OUTPUT "<entry>%n<term>"

   ; look for start of definition
   FIND "{%%%n"
     OUTPUT "<def>"

   ; look for end of glossary
   FIND "\bye" ANY
     OUTPUT "</glossary>%n"

The program begins by identifying the translation type. As mentioned earlier, this is an up-translation, whose result is an SGML document corresponding to a given Document Type Definition. The bulk of the translation consists of FIND rules.

As it reads the TeX file, OmniMark looks for strings corresponding to the patterns defined by the FIND rules. When one is found, the actions in the rule are performed. As the output is generated, the SGML parser verifies that it corresponds to the Document Type Definition.

The first rule is a FIND-START rule that passes the DTD to the SGML parser. It assumes that the file named "file.dtd" contains the DTD used to guide the translation.

The first FIND rule is

   FIND "\input glossmac" WHITE-SPACE*
     OUTPUT "<title>"

This rule tells OmniMark to look for the string \input glossmac followed by any number of spaces, tabs, or end-of-line sequences. When the pattern is found, the action within the rule writes the <glossary> start-tag. The next rule uses a similar technique to search for the start of the title. The third rule is a little more complicated:

   FIND "}" "%n"?
     OUTPUT "</title>%n" WHEN ELEMENT IS title
     OUTPUT "</term>%n" WHEN ELEMENT IS term
     OUTPUT "</def>%n</entry>%n" WHEN ELEMENT IS def

This rule searches for a right brace, possibly followed by an end-of-line sequence. The action taken when this pattern is found depends on the context. The appropriate end-tags are written according to the state of the SGML parser. This ability to qualify an action distinguishes OmniMark from other pattern-matching languages.

The remainder of the program is straightforward. It produces the following document instance:

   <glossary>
   <title>SGML Definitions</title>
   <entry>
   <term>containing element</term>
   <def>An element within which a subelement occurs.</def>
   </entry>
   <entry>
   <term>data entity</term>
   <def>An entity that was declared to be data and therefore is
   not parsed when referenced.</def>
   </entry>
   <entry>
   <term>name</term>
   <def>A name token whose first character is a name start
   character.</def>
   </entry>
   </glossary>

Next chapter is Chapter 3, "Generalized Document Processing".

Copyright © OmniMark Technologies Corporation, 1988-1997. All rights reserved.
EUM27, release 2, 1997/04/11.

Home Copyright Information Website Feedback Site Map Search