HOME | COMPANY | SOFTWARE | DOCUMENTATION | EDUCATION & TRAINING | SALES & SERVICE

    "The Official Guide to Programming with OmniMark"

Site Map | Search:   
OmniMark Magazine Developer's Forum   

  International Edition   

OmniMark® Programmer's Guide Version 3

3. Generalized Document Processing

Detailed Table of Contents

Previous chapter is Chapter 2, "Types of OmniMark Programs".

Next chapter is Chapter 4, "Processing SGML Documents".

This chapter describes the central components of OmniMark's generalized document processing facilities:

All of these facilities use OmniMark's powerful pattern recognition language to describe information structures in input document and to extract information from those structures for further manipulation.

OmniMark's pattern recognition works equally well on textual data and binary data. For this reason, whenever text processing is discussed, the same points usually apply to binary data processing as well.

FIND rules are the primary ways to convert any kind of input to SGML in up-translations and context-translations, and they are used to perform more general text conversions in cross-translations.

In these translations, the FIND rules are automatically invoked on the main input. The SUBMIT action allows the programmer to feed other sources of input to these rules as well.

Because down-translations assume correct SGML input, FIND rules are not permitted in a down-translation.

3.1 General Document Processing Rules

Generalized document processing in a batch translation program is performed by the following rules:

FIND-START, FIND, and FIND-END rules are all automatically invoked on the main input of a cross translation, up translation, or context translation.

Any input that is not consumed by a FIND rule is automatically written to the current output set. This means that programmers can use rules economically. Only the rules which change the input need be specified.

FIND-START rules can be used for initialization, and FIND-END rules can be used for termination.

3.1.1 Initializing Document Processing

Syntax

   FIND-START condition?
      local-declaration*
      action*

A FIND-START rule in an OmniMark program specifies actions to be performed at the beginning of processing an input document.

The condition in the FIND-START rule can depend on switches set on the command line, streams opened on the command line, and on the actions in previously performed FIND-START rules.

More than one FIND-START rule can appear in a program: those whose conditions succeed are performed in the order in which they appear in the OmniMark program. No condition need be specified for a FIND-START rule if none is relevant.

In context-translations, initialization actions analogous to those performed in a FIND-START rule can also be performed using DOCUMENT-START rules (see Section 4.5.1, "Initializing SGML Processing"). All DOCUMENT-START rules are performed before FIND-START rules. This allows FIND-START rules to generate SGML markup which is processed by ELEMENT rules while still ensuring that DOCUMENT-START rules are processed before any markup.

A FIND-START rule is not allowed in a down-translation or in a process program.

3.1.2 Terminating Document Processing

Syntax

   FIND-END condition?
      local-declaration*
      action*

Actions to be taken at the end of processing an input document are entered in a FIND-END rule.

Several FIND-END rules can be selected at the end of an OmniMark run. If multiple rules apply, the actions of those whose conditions succeed are performed in the order the rules appear in the OmniMark program.

In a context translation, FIND-END rules are performed before any DOCUMENT-END rules so that the FIND-END rules can output SGML markup that is processed by ELEMENT rules before the DOCUMENT-END rules are processed.

A FIND-END rule is not allowed in a down-translation or in a process program.

3.1.3 Recognizing and Processing Arbitrary Input

Syntax

   FIND pattern condition?
      local-declaration*
      action*

A FIND rule is used to scan the input document for an input string matching a specified pattern. Patterns are described in Section 3.3, "Pattern Recognition".

FIND rules are the main processing rules in cross-translations and up-translations, and are very important in context-translations. They direct input to the SGML parser in up-translations and context-translations.

As OmniMark reads the input to these translations, it looks for a FIND rule that matches the input at that point. If such a rule is found, and if it either has a condition that is satisfied, or no condition, the rule is selected. When there is more than one FIND rule that can be selected, OmniMark will choose the one that appears earliest in the program.

When a FIND rule is selected, the input matched by the pattern is consumed. (OmniMark provides the LOOKAHEAD operator to allow patterns to match input without consuming it.) However, input matched by the pattern can be captured in PATTERN variables for later use. OmniMark then executes the actions associated with the selected FIND rule.

A FIND rule must match at least one character. It is an error for a FIND rule to match zero characters.

Any input that does not cause a FIND rule to be selected is copied unchanged to the current output set. (See Section 2.5.1, "Program Output Streams" and Chapter 13, "The Current Output Stream Set".)

Examples of FIND rules can be found in the sample up-translation in Section 2.7.2, "Translating Documents into SGML: An Example".

A FIND rule is not allowed in a down-translation.


3.2 Programmer-Directed Input Processing

The automatic invocation of FIND rules on the main input of a translation program is very useful and results in expressive programs that can be quite small. However, a programmer may need more flexibility when:

In such cases, OmniMark provides the programmer with several options. This section describes those options.

3.2.1 Submitting Input to FIND Rules

Syntax

   SUBMIT string-expression

The SUBMIT action supplies the specified string-expression as input to the FIND rules.

SUBMIT is especially useful when:

Note that this means that more than one FIND rule can be active at a time, including several instances of the same FIND rule.

SUBMIT never invokes FIND-START or FIND-END rules on the submitted input. Those rules are only available for the main input to the program. Any initialization required can be done before the SUBMIT, and any cleanup can be done after it returns. If the same SUBMIT must be done in a number of places in the program, it can be placed in a function.

The consequence of performing a SUBMIT while in a FIND-START rule (or in an EXTERNAL-TEXT-ENTITY rule which was triggered by input written to the #SGML stream inside a FIND-START rule) is that FIND rules may execute before all of the FIND-START rules have been processed. The programmer should ensure that any initialization required by those FIND rules must be complete before the SUBMIT is performed.

Similarly, a SUBMIT in a FIND-END rule may cause FIND rules to fire. Any cleanup required by those FIND rules must be done after the SUBMIT returns.

A SUBMIT action is not permitted in a down-translation or in the output processor.

3.2.1.1 Submitting Files

SUBMIT is often used to apply the FIND rules to the contents of files determined by the program. In those cases, it has the form:

Syntax

   SUBMIT FILE string-expression

Here, the string-expression is used as the name of the file.

When SUBMIT is invoked on a file, it causes this chain of events:

  1. OmniMark suspends processing of the current input.
  2. The named file is opened and read.
  3. FIND rules are immediately applied to the content of the named file.
  4. When the entire input has been processed, OmniMark resumes processing of the suspended input.

When SUBMIT is only processing a FILE, it does not need to read the whole file at once. It provides input to the FIND rules as they need it. This allows very large files to be processed with a SUBMIT.

When SUBMIT processes a string-expression, it evaluates the string-expression completely, before supplying it to the FIND rules. The following example shows a SUBMIT that uses a whole file as part of a string-expression. (The "||" operator joins the two operands together. See Section 9.2.2.4, "Dynamic String Concatenation".) This usage will cause the whole file to be read into memory before the string-expression is submitted.

   SUBMIT FILE "in.txt" || "%n"

A simple example of using SUBMIT is the following program that processes an OmniMark program, expanding INCLUDE files in-line.

   CROSS-TRANSLATE

   FIND (";" ANY-TEXT+ "%n") => comment
      OUTPUT comment

   FIND ("'" [ANY-TEXT EXCEPT "'"]* "'") => quoted-string
      OUTPUT quoted-string

   FIND ("%"" [ANY-TEXT EXCEPT "%""]* "%"") => quoted-string
      OUTPUT quoted-string

   FIND UL "include" WHITE-SPACE+
        ["%"'"] => quote
        [ANY-TEXT EXCEPT "%"'"]* => file-name
        quote
      OUTPUT "%n;; INCLUDE %"%x(file-name)%"%n"
      SUBMIT FILE file-name
      OUTPUT "%n;; end of include %"%x(file-name)%"%n"

The first three rules match comments and quoted strings and output them unchanged. Comments and strings are handled specially because they may contain text which looks like an include declaration.

A complete program would also recognize escaped quote characters inside of strings, and file names consisting of quoted strings that are concatenated with an underscore ("_").

3.2.2 Scanning Actions

"DO SCAN" and "REPEAT SCAN" are compound actions that can be used to process input in ways that are very similar to FIND rules.

Each "DO SCAN" or "REPEAT SCAN" can contain a number of MATCH alternatives that behave very similarly to FIND rules. Each MATCH specifies a pattern that may be recognized in the input, and a sequence of actions to perform if that pattern is recognized.

The scanning actions have advantages over SUBMIT and the FIND rules when the transformation:

3.2.2.1 Scanning Input With a Single Pattern

Syntax

   DO SCAN string-expression
       (MATCH pattern condition?
           local-declaration*
           action*)+
       (ELSE
           local-declaration*
           action*)?
   DONE

This scanning action specifies a number of MATCH alternatives, each of which contains asequence of actions to be performed when a string-expression matches the given pattern. The ELSE phrase is optional. Only one ELSE phrase may be used in a "DO SCAN" action.

The actions in a MATCH alternative are performed if the pattern matches all or part of the string-expression. Each time a MATCH alternative is tried, the pattern-matching begins again at the first character of the string-expression.

If a MATCH alternative specifies a condition, then the condition must also be satisfied before the actions in that alternative will be executed.

If a pattern is preceded by the keyword UNANCHORED, then the match is successful if the pattern appears anywhere in the input being scanned, not just at the beginning. Input skipped before the beginning of the pattern being matched is ignored.

A final sequence of actions can be preceded by the keyword ELSE. Actions in the ELSE part will be performed if none of the other match alternatives are selected.

For example, suppose a color attribute specifies the color for the background of a graphic:

   DO SCAN "%v(color)"
     MATCH UL "black"
       OUTPUT "\background(black)"
     MATCH UL "gray" | UL "grey"
       OUTPUT "\background(gray)"
     MATCH UL "blue" | UL "cyan"
       OUTPUT "\background(cyan)"
     ELSE
       OUTPUT "\background(black)" ; default color
   DONE

If the value of the color attribute is not black, gray, or blue then the ELSE phrase will be selected.

The "%c" format item is permitted in the actions of "DO SCAN" and "REPEAT SCAN" blocks, but not in their patterns or conditions.

3.2.2.2 Repeatedly Matching Input

Syntax

   REPEAT SCAN string-expression
      (MATCH pattern condition?
         local-declaration*
         action*)+
   AGAIN

The "REPEAT SCAN" action scans a string-expression for specified patterns in the same manner as the "DO SCAN" action. When a match is found, the corresponding actions are performed. However, the string is then scanned again, from the first character to the right of the part matched by the previous iteration.

This process continues until one of the following conditions has been met:

Note that an ELSE phrase may not be used in a "REPEAT SCAN" action. If an "if all else fails" case is required, a "MATCH ANY" alternative can be used with an EXIT at the end of it to terminate the loop.

In a "REPEAT SCAN", patterns which can match zero characters are treated specially. Zero characters can not be matched successfully twice in a row. OmniMark enforces this by interpreting a match of zero characters to be successful only if the previous iteration matched one or more characters.

Consider the following examples:

Example A

   REPEAT SCAN "some expression"
      MATCH LETTER
         OUTPUT "x"
   AGAIN
   OUTPUT "%n"

Example B

   REPEAT SCAN "some expression"
      MATCH LETTER
         OUTPUT "x"
      MATCH DIGIT?
         OUTPUT "d"
   AGAIN
   OUTPUT "%n"

Example C

   REPEAT SCAN "some expression"
      MATCH DIGIT?
         OUTPUT "d"
      MATCH LETTER
         OUTPUT "x"
   AGAIN
   OUTPUT "%n"

The first example will print "xxxx", the second "xxxxd", and the third will print "dxdxdxdxd".

In the second example:

  1. At the letter "s":
    1. The "MATCH LETTER" succeeds.
    2. "x" is printed.
    3. A new iteration begins.
  2. The same sequence follows for each of the letters "o", "m", and "e":
    1. The "MATCH LETTER" succeeds.
    2. "x" is printed.
    3. A new iteration begins.
  3. At the space:
    1. The "MATCH LETTER" fails.
    2. The "MATCH DIGIT?" is tried. It successfully matches zero digits.
    3. "d" is printed.
    4. A new iteration begins.
  4. Still at the space (because zero characters were matched last time):
    1. The "MATCH LETTER" fails.
    2. The "MATCH DIGIT?" is tried. It can match zero digits again, but because zero characters were matched in the last iteration, zero characters cannot be matched again this iteration. So this MATCH alternative fails too.
    3. Since no rules have matched on the last iteration, the "REPEAT SCAN" terminates.

The third example further illustrates the fact that zero characters can only be matched if there were one or more characters matched on the previous iteration.

The "REPEAT SCAN" functions this way because it is often a useful thing to match zero characters once. It gives the programmer a MATCH alternative in which things can be changed, but no input is consumed. Repeatedly matching zero characters, however, will lead to an infinite loop. OmniMark prevents infinite loops by exiting the "REPEAT SCAN" on the second match of zero characters.

3.2.2.3 Side Effects in MATCH Alternatives

When trying to determine the impact of function side effects, it can be important to know how OmniMark evaluates the parts of each MATCH alternative.

The rules that determine which parts of each MATCH alternative are evaluated are as follows:

3.2.3 Skipping Input

Syntax

   DO SKIP ((PAST numeric-expression) |
         (OVER pattern) |
         (PAST numeric-expression OVER pattern))
      local-declaration*
      action*
   (ELSE
      local-declaration*
      action*)?
   DONE

When scanning input, sometimes a large block of data can be ignored. The "DO SKIP" action offers an efficient way of skipping that block to the data of interest.

The block of input to be skipped is describe by specifying its length as a numeric-expression or by specifying the pattern that terminates it. When both are given, the specified number of characters are skipped first, and then the "DO SKIP" looks for the pattern.

A common application for the "DO SKIP" action is skipping header information. The following rule skips over the four characters following the matched "*HEADER/", and then skips up to and over the next slash:

   FIND "*HEADER/"
      DO SKIP PAST 4 DONE
      DO SKIP OVER "/" DONE

This would be equivalent to

   FIND "*HEADER/"
      DO SKIP PAST 4 OVER "/" DONE

When the OVER form is used and the matched pattern is not to be consumed (i.e. a skip up to but not over is desired), LOOKAHEAD can be used in the pattern:

   DO SKIP PAST 4 OVER LOOKAHEAD "/" DONE

Characters skipped are not revisited if the skip fails, as is normally done if a pattern match fails. This means that if the ELSE actions are evaluated, the OmniMark program can be sure that there is no more input in the file or submitted string being scanned. For example:

   DO SKIP PAST 4
       ; No actions if we skipped
   ELSE
       OUTPUT "Ran off the end of the file!%n"
       HALT WITH 2
   DONE

A "DO SKIP" action with no actions in it simply continues with the following action, whether or not it succeeded.

In a "DO SKIP ... OVER" action, any pattern variables defined in pattern are available for use only within the actions up to the ELSE or DONE keywords (whichever comes first). The relationship between the pattern and the actions is the same as between the pattern and the actions in a MATCH part of a "DO SCAN" action. The following example shows where pattern variable last-word can be referenced and where it cannot be.

   TRANSLATE "*HEADER/"
     DO SKIP OVER (WORD-START LETTER* WORD-END) => last-word "/"
       OUTPUT last-word                ; allowed
     ELSE
       OUTPUT last-word                ; not allowed
     DONE
     OUTPUT last-word                  ; not allowed
   ...

3.2.3.1 Where Skipping Can Be Done

The "DO SKIP" action can be used in two contexts with somewhat different effects:

  1. In a FIND rule, TRANSLATE rule, "DO SCAN" action or "REPEAT SCAN" action the "DO SKIP" action continues scanning input from the point at which it last stopped. For example, the following find rule skips the character following a "/":
       FIND "/"
          DO SKIP PAST 1
          DONE
    
  2. In a TRANSLATE rule, only the input in the current "chunk" of data content can be skipped. A "DO SKIP" cannot skip more input than a "TRANSLATE ANY+" would match.
  3. In a FIND-START rule, the "DO SKIP" action scans the main input normally examined by the FIND rules. Such actions can be used to match the first part of a file. For example the following rule discards the first line of an input file:
       FIND-START
          DO SKIP OVER "%n"
          DONE
    

    It is meaningless to use a "DO SKIP" action in a FIND-END rule (unless within another scanning action) as there is no more input to be skipped.


3.3 Pattern Recognition

This section describes OmniMark's powerful pattern recognition features. A pattern describes a string of text or binary data in an input source that can be selected for replacement or copying.

Since a pattern may include alternatives and repeated sequences, one pattern can select a variety of strings in the document. To recognize a string in context, a pattern can also specify the data that follows.

All or part of the input matching a pattern can be saved as an OmniMark pattern variable. Pattern variables can be output and used in conditions; pattern variables that consist of possibly-signed strings of digits can be used as numeric values.

Patterns are used in:

Patterns can consist of the following components:

This chapter describes the simplest forms first. More complex patterns are combinations of simpler ones.

3.3.1 Patterns Consisting of Strings

The simplest kind of pattern is a quoted string. For example,

   FIND "\par"

might be used to search for a formatter's paragraph instruction.

The keyword UL can prefix a string to indicate that any letter in the string should be matched in either upper-case or lower-case. The pattern

   UL "the"

for instance, matches "the", "The", and "THE" as well as "tHe", "tHE", "thE", "ThE" and "THe".

Any string expression can also be used as a pattern in the same way. The following example would search for a keyword in the input language followed by the contents of the stream hold. As with quoted strings, preceding it with a UL causes case to be ignored.

   FIND "\index{%g(hold}}"

A more complex example is:

   FIND UL ATTRIBUTE colwid OF ANCESTOR row @ current-col

3.3.2 Character Classes

Character classes are patterns that are matched by single characters selected from a programmer-specified set.

Character classes are usually enclosed by square brackets. In their simplest form, the brackets enclose a string. The pattern matches if the next input character is any of the characters that appear in the string.

For example, the pattern

   ["+=*/"]

matches any of the characters representing the four usual arithmetic operators.

Note the significance of the brackets: While FIND "\par" matches a four-character string, FIND ["\par"] matches one of four characters.

The "%n" character is not valid when a newline sequence of two or more characters has been specified with the NEWLINE declaration. The formats "%_", and "%t" are always allowed, because they always represent a string of one character.

3.3.2.1 Predefined Character Classes

The following keywords can be used in a pattern to represent a single character:

By default, the newline sequence is a single line feed character (ASCII 10). The NEWLINE declaration can be used to change this, but is deprecated.

For example:

   FIND [UC]

matches every upper-case character in the input file, one at a time.

The names of predefined character classes can be used with or without surrounding square brackets. The above example is equivalent to:

   FIND UC

3.3.2.2 Compound Character Classes

The operator "|" (OR) can be used within a character class to join strings or predefined character classes. The resulting character class matches a character in any of the listed subclasses. For example,

   ["+=*/." | DIGIT]

matches any character likely to appear on a primitive calculator keyboard: a digit, decimal point, or arithmetic operator.

The EXCEPT keyword is used to exclude particular characters from a broader character class. For example,

   [ANY EXCEPT "}"]

matches any character except a right brace. Similarly,

   [UC EXCEPT "QZ"]

matches any letter on a push-button telephone keypad.

Several subclasses can be excepted from a character class. Thus, special characters are described by

   [ANY EXCEPT LETTER | DIGIT | WHITE-SPACE]

3.3.2.3 Character Set Ranges

Ranges of characters can also be specified inside a character class by specifying the initial character of the range, the TO keyword, and the final character of the range. For example, the following matches any character between the lower-case letters "a" and "z":

   FIND ["a" TO "z"]
      ...

Ranges can be combined with other things in a character set, including other ranges. For example, the following matches any character between the lower-case letters "a" and "z", ".", "," or "?", except it does not match the lower-case letters between "i" or "n" or the lower-case letter "t":

   FIND ["a" TO "z" | ".,?" EXCEPT "i" TO "n" | "t"]

Care must be taken when using character set ranges because the letters of the alphabet are not always contiguous in a character set. In the EBCDIC character encoding, for example, there are non-alphabetic characters between "A" and "Z".

As an example, on an EBCDIC machine, the range:

   FIND ["a" TO "m"]

would include some non-alphabetic characters. A more portable way to write this would be:

   FIND [LC EXCEPT "n" TO "z"]

3.3.2.4 Character Sets and Case Insensitivity

Preceding a character set with a UL keyword causes the pattern to match the characters in the set, as well as any lower-case and upper-case mappings the stated characters have. For example,

   FIND UL ["abc" except "B"] ...

matches the letters "a", "A", "c", and "C", but not "b" or "B". Lower-case "b" is not recognized because the above is equivalent to the pattern

   FIND ["abcABC" except "bB"] ...

which is equivalent to

   FIND ["acAC"] ...

The lower-case and upper-case mappings of characters can be changed with the "DECLARE DATA-LETTERS" declaration.

Note that "UL LC" is usually, but not always equivalent to LETTER. They are different when there is a "DECLARE DATA-LETTERS" declaration which remaps an existing lower-case letter onto a new upper-case character. This leaves the old upper-case letter without a corresponding lower-case letter, even though it is still a member of the LETTER character class.

An example of such an exceptional "DECLARE DATA-LETTERS" declaration is the following:

   DECLARE DATA-LETTERS "a" "1"

In this case, the declaration changes the upper-case version of "a" to "1", so that there is no lower-case letter that has the upper-case letter "A". Needless to say, such declarations are deprecated.

3.3.3 Repetition and Optionality

Repetition and optionality are indicated in OmniMark patterns with the same occurrence indicators used in SGML. In particular, any string or character class pattern described in the previous section, or any compound pattern enclosed in parentheses can be followed by one of the following characters:

For example,

   LETTER+

represents one or more letters. Since OmniMark always matches the longest possible sequence of characters described by a pattern, this simple pattern can be used to match words.

As a second example, suppose a word processor's instructions consist of words preceded by a backslash. The following pattern recognizes a sequence of such instructions:

   ("\" LETTER+)*

Repetition must be used cautiously. It is fairly easy for a programmer to inadvertently write a pattern that matches more input than is intended. For instance, the pattern in the following FIND rule header matches the remainder of the document:

   FIND ANY+

OmniMark does not consider any following subpattern when processing a repeated subpattern. Thus, a FIND rule beginning with the following rule header can never be selected:

   FIND ANY* "!"

The "ANY*" subpattern matches all unprocessed input and, of course, no exclamation point can occur after the end of a document. The desired effect can be achieved by one of the following alternatives:

Example A

   FIND [ANY EXCEPT "!"]* "!"

Example B

   FIND ([ANY EXCEPT "!"]* "!")+

The first one matches input up to and including the first exclamation point, and the second one matches input up to and including the last exclamation point.

Programmers who are used to line-oriented pattern-matching in other languages must remember that ANY will match the newline characters as well. To confine matching to a single line, use ANY-TEXT instead of ANY.

3.3.3.1 Numeric Occurrence Indicators

OmniMark provides more control over the number of times a pattern should be matched with occurrence indicators:

Syntax

   pattern { numeric-expression }

succeeds only if it can match pattern exactly numeric-expression times. Even if there are more occurrences of pattern, no more attempts are made. For example, given text "abc;abc;abc;...", the pattern

   "abc;" {2}

would match only the first two occurrences of "abc;".

Syntax

   pattern { numeric-expression }+

succeeds only if it can match pattern at least numeric-expression times. If it cannot match it at least numeric-expression times, it fails. If it can, it continues to consume its input until there are no more matches, and then succeeds. In this case, given text "abc;abc;abc;def", the pattern

   "abc;" {2}+

would match the first three occurrences of "abc;".

Syntax

   pattern { minimum-numeric-expression
      TO maximum-numeric-expression }

succeeds only if it can match pattern at least minimum-numeric-expression times. If it cannot match it at least minimum-numeric-expression times, it fails. If it can, it continues to consume its input until either there are no more matches, or it has matched maximum-numeric-expression occurrences of pattern, and then succeeds. In this case, given text "abc;abc;abc;abc;abc;def", the pattern

   "abc;" {2 to 4}

would match the first four occurrences of "abc;".

For example, the following gives a template for matching most MS-DOS file names, which require at least one name character, up to eight, followed by an optional extension of up to three name characters, separated from the main name by a period:

   FIND [DIGIT | LETTER | "-_"] {1 TO 8}          ; base part
        ("." [DIGIT | LETTER | "-_"] {1 TO 3})?   ; extension

(Compound patterns such as this one are explained in Section 3.3.9, "Compound Patterns".)

Note that these numeric occurrence indicators are just generalizations of the repetition and optionality indicators:

where P is some OmniMark pattern.

Note the difference between the following two examples:

Example A

   FIND ANY {2}+

Example B

   FIND (ANY {2})+

The first example matches two or more characters (2, 3, 4, ...). The second example matches an even number of characters (2, 4, 6, ...). Always remember to use parentheses to combine occurrence indicators.

3.3.4 Positional Patterns

Compound patterns can include positional patterns. Instead of matching data characters, positional patterns match positions within the input. The positional patterns are defined by the following OmniMark keywords:

For example, while the pattern "or" could match characters within the words "order", "_keyword_", and "translator", the pattern

   WORD-START "or" WORD-END

is only matched by the word "or".

The CONTENT-START and CONTENT-END patterns can be used to locate input at the boundary of an SGML element. They cannot be used to describe material that overlaps elements. Thus, no subpattern can precede CONTENT-START or follow CONTENT-END. These positional patterns may only appear in TRANSLATE rules.

Similarly VALUE-START and VALUE-END can only be used in the MATCH part of a "DO SCAN" or "REPEAT SCAN" action. VALUE-START cannot be preceded by a subpattern and VALUE-END cannot be followed by a subpattern.

Once a particular position has been matched, the condition that caused the match no longer holds. For example, the following OmniMark program will add only one "%n" to each line:

   CROSS-TRANSLATE

   FIND LINE-END
     OUTPUT "%n"

3.3.4.1 Multipe Positional Patterns at One Location

Only one positional pattern may be matched at a single location. Once a positional pattern has been matched, no more positional patterns may be matched at that location. For instance:

   PROCESS

   REPEAT SCAN "foo foo"
      MATCH WORD-START  OUTPUT "[ws]"
      MATCH VALUE-START OUTPUT "[vs]"
      MATCH WORD-END    OUTPUT "[we]"
      MATCH VALUE-END   OUTPUT "[ve]"
      MATCH ANY => x    OUTPUT x
   AGAIN
   OUTPUT "%n"

   REPEAT SCAN "foo foo"
      MATCH VALUE-START OUTPUT "[vs]"
      MATCH WORD-START  OUTPUT "[ws]"
      MATCH VALUE-END   OUTPUT "[ve]"
      MATCH WORD-END    OUTPUT "[we]"
      MATCH ANY => x    OUTPUT x
   AGAIN
   OUTPUT "%n"

produces the following output:

   [ws]foo[we] [ws]foo[we]
   [vs]foo[we] [ws]foo[ve]

Once a positional pattern matches, the next thing to match must be a character. There is no precedence to positional patterns. The first one that matches will succeed, even if other positional patterns may also apply.

3.3.5 Looking Ahead

The keyword LOOKAHEAD in a compound pattern precedes data that is to be recognized but not consumed by the pattern-matching process. For example, the pattern

   DIGIT+ LOOKAHEAD BLANK* "+"

matches a string of digits that is followed by optional spaces and tabs and then a plus sign. However, only the digits are selected. The white space characters, if any, and plus sign remain in the document and can be selected by other patterns.

LOOKAHEAD can also be used to verify that selected data is not followed by input matching a given pattern. For example,

   DIGIT+ LOOKAHEAD ! LETTER

selects a string of digits as long as the digits are not immediately followed by letters. (The "!" operator is the symbol for the keyword NOT.) Note that only one letter needs to be found for the LOOKAHEAD test to fail, so there is no need to put a "+" following LETTER in the example above.

Positive and negative look-ahead can be combined in one pattern. For example, in data files for the TeX formatter, instructions (called "control sequences") consist of a backslash followed by letters.

The control sequence to end a paragraph is \par. However, standard control sequences such as \parskip or \parindent as well as programmer-defined macro names can begin with the same string.

Suppose paragraphs consist only of letters, punctuation, and space characters; in other words, suppose that no control sequences occur within a paragraph. The following pattern matches paragraph text terminated by the \par control sequence; it fails to match input terminated by another control sequence beginning with the characters \par:

   [LETTER | ".,!?" | BLANK]+ LOOKAHEAD "\par" ! LETTER

Recall that any pattern can be enclosed in parentheses and used as a subpattern. Look-ahead patterns can be used in this way. For example,

   ((LOOKAHEAD ! "xyz") ANY)+

matches any input string that does not contain the sequence "xyz" as a substring. Note that both sets of parentheses are necessary. Without the inner set, ANY becomes part of the look-ahead pattern. Without the outer set, the look-ahead is not repeated as successive characters are selected.

The above example works in the following manner, beginning at the current point in the file, data content or value being scanned:

  1. If the next three characters are "xyz", then the look-ahead pattern fails, and the pattern terminates.
  2. If the next three characters do not match "xyz", or if there are less than three characters left, then the look-ahead pattern succeeds. The current position is not advanced.
  3. If there are no more characters, then the pattern ANY will fail, and the whole pattern terminates.
  4. Otherwise, the pattern ANY matches the next character.
  5. The current point in the input is advanced a single character.
  6. The "+" indicator causes the above steps to be repeated.

If the pattern ANY has matched at least one character, then the pattern succeeds. Otherwise it fails.

3.3.6 Parentheses in Patterns

The pattern:

   LETTER LETTER | DIGIT+

Looks for either a pair of letters, or a sequence of digits. In other words, it is as if it were entered as:

   (LETTER LETTER) | (DIGIT+)

As is illustrated by this example, "|" has a lower precedence than the "sequence" operator.

The precedence of an operation can be changed by using parentheses. If the intent of the above example were to look for a letter, followed by one or more letters or digits, it would be entered as:

   LETTER (LETTER | DIGIT)+

The following list gives the precedence of all pattern operations when there are no parentheses, starting from the highest precedence. Parentheses must be used to change the precedence of an operation.

  1. UL
  2. optionality, repetition, and general occurrence indicators ("?", "*", "+", and "{" ..."}")
  3. entity-type prefixes (CDATA, TEXT, etc.) (See Section 4.2.2.3, "Patterns for CDATA and SDATA Entities".)
  4. pattern variable assignments (using "=>")
  5. sequences (pattern followed by pattern, etc.)
  6. LOOKAHEAD and "LOOKAHEAD !" ("LOOKAHEAD NOT")
  7. "|" (OR)

The following pattern illustrates all of these precedences:

   TRANSLATE "%n" DIGIT => key-digit
             LOOKAHEAD TEXT LETTER+ SPACE ! UL "KEY" "WORD"? |
             SDATA "[sect]" LOOKAHEAD "." ! " " |
             DIGIT+ => number

It is interpreted as if it were entered as:

   TRANSLATE (("%n" (DIGIT => key-digit))
              LOOKAHEAD ((TEXT (LETTER+)) SPACE)
                 ! ((UL "KEY") ("WORD"?))) |
             ((SDATA "[sect]") LOOKAHEAD "." ! " ") |
             ((DIGIT+) => number)

The precedences of pattern operation require parentheses in some situations:

The common factor in these two situations is that there is a lower-precedence operation embedded within a higher-precedence operation.

3.3.7 Avoiding Patterns that Loop

OmniMark does not accept patterns that could be matched repeatedly at the same point in a document. For example, if the rule header

   FIND ""

were permitted, this rule would always be selected, and the program would go on for ever. A similar difficulty arises with

   FIND LETTER*

which matches zero characters if the next character is not a letter. To avoid this situation, OmniMark generally requires selected patterns to match at least one character or a positional pattern other than CONTENT-START or CONTENT-END.

There are a number of exceptions to this general principle:

  1. The restriction does not apply to the MATCH parts of "DO SCAN" and "REPEAT SCAN" actions.
  2. The restriction does not apply to PROCESSING-INSTRUCTION rules. That is because the rule processes the whole processing instruction, and not just the characters inside that are matched by the pattern.
  3. A pattern can check for references to data entities whose replacement text contains no characters. For example, the following is permitted:
       TRANSLATE SDATA ""
    
  4. A pattern can match a position other than CONTENT-START or CONTENT-END. This allows:
       FIND WORD-START
    

    which is defined to be matched only once.

  5. A pattern can check for elements that have no content by using both CONTENT-START and CONTENT-END in the same pattern (neither are allowed alone):
       TRANSLATE CONTENT-START CONTENT-END
    

3.3.8 Capturing Input Matched By Patterns

Input matched by patterns can be saved in pattern variables. Unlike other programmer-defined variables in OmniMark, pattern variables do not require a separate declaration.

3.3.8.1 Assigning Pattern Variables

Syntax

   pattern => pattern-variable-name

A pattern or subpattern can be followed by an equal sign, greater-than sequence ("=>") and an OmniMark name indicating a pattern variable. The matching input is then saved under the given name.

The example:

   FIND LETTER+ => word WHITE-SPACE*

matches a word followed by any number of white space characters. The letters in the word are saved under the pattern variable word. Actions in a rule which includes this pattern can refer to the matched word with the name word.

When more than one subpattern ends at the point where "=>" appears, only the immediately preceding subpattern is saved under the specified name. For example, in the pattern

   FIND LETTER+ WHITE-SPACE* => save

the white space characters following the letters are saved under the name save. Parentheses can be used to save a longer part of a pattern. If the previous example is modified to

   FIND (LETTER+ WHITE-SPACE*) => save

all the matched input (letters followed by white space) is saved.

When a pattern is being matched, input can be saved in a pattern variable at most once. So the following pattern is allowed because either input will be saved in the pattern variable save if the match is made, or it won't be:

   FIND ((LETTER+ WHITE-SPACE*) => save) ?

The following pattern is not allowed because input could be saved in pattern save more than once:

   FIND ((LETTER+ WHITE-SPACE*) => save) *

Versions of OmniMark prior to V3 used the "=" operator for pattern assignments. For versions of OmniMark from V3 forward, the "=" is used for equality comparisons, and is therefore deprecated in pattern assignments. Modern programs should always use the "=>" operator.

3.3.8.1.1 Pattern Variable Assignment Within Repeated Or Optional Patterns

A pattern variable assignment may occur inside a pattern with an occurrence count only if there is no possibility that input will be saved in a pattern variable more than once. An assignment may not take place inside any pattern that can iterate more than once, including patterns followed by a "*" or "+" occurrence indicator.

The following three examples are all invalid because the assignment can occur more than once.

Example A

   CROSS-TRANSLATE
   GLOBAL COUNTER x
   ...

   FIND ((LETTER+ WHITE-SPACE+) => words) {2 TO x}  ...

Example B

   CROSS-TRANSLATE
   GLOBAL COUNTER x
   ...

   FIND ((LETTER+ WHITE-SPACE+) => words) {x}+  ...

Although the following patterns are syntactically correct, OmniMark checks to make sure that the occurrence indicators are either 0 or 1, preventing a repeated assignment:

   GLOBAL COUNTER x
   ...
   FIND ((LETTER+ WHITE-SPACE+) => words) {x}  ...

   FIND ((LETTER+ WHITE-SPACE+) => words) {0 TO x}  ...

   FIND ((LETTER+ WHITE-SPACE+) => words) {x TO 1}  ...

   FIND ((LETTER+ WHITE-SPACE+) => words) {x TO y}  ...

3.3.8.2 Referencing Pattern Variables

Pattern variables can be referenced with the syntax:

Syntax

   PATTERN? pattern-variable-name

The keyword PATTERN is an optional type herald.

Pattern variable references can be used as string expressions in the actions that immediately follow its assignment. Pattern variables can be considered to be local variables in the rule or the MATCH alternative in which they are defined.

In other words, the following is not legal:

   CROSS-TRANSLATE

   FIND LETTER+ => word

   FIND DIGIT+ => value
      OUTPUT "Found %x(word) & %x(value)%n"

The pattern variable word can only be referenced in the first rule.

Patterns in MATCH parts of actions cannot use names already used in enclosing rules or actions. So, for example, the following is an error, because the pattern variable command is specified in the FIND rule header, it cannot be respecified in the MATCH alternative:

   FIND "\" [LETTER | DIGIT]+ => command
     DO SCAN "%x(command)"
       MATCH LETTER+ => command
         OUTPUT ""
     DONE

When the value of a pattern variable is used in a pattern, the keyword ANOTHER can also be used instead of PATTERN as an optional herald. The following are equivalent, and match the input "redundantredundant":

Example A

   FIND "redundant" => p p

Example B

   FIND "redundant" => p PATTERN p

Example C

   FIND "redundant" => p ANOTHER p

Example D

   FIND "redundant" => p "%x(p)"

The maximum size of a matched pattern is 32,767 characters. The maximum number of pattern variables visible in a single local scope is 32.

3.3.8.3 Formatting a Pattern Variable

Syntax

   % format-modifier* x( pattern-variable-name )

The "%x" format item represents the contents of the pattern variable named pattern-variable-name(see Section 3.3, "Pattern Recognition"). The following format modifiers are allowed:

3.3.8.4 Testing Whether a Pattern Variable Has Been Initialized

Syntax

   PATTERN? pattern-variable-name (IS|ISNT)  SPECIFIED

Example

   CROSS-TRANSLATE

   FIND ("-" => sign)? DIGIT+ => value
      DO WHEN sign IS SPECIFIED
         OUTPUT "The value is negative.%n"
      DONE

The "IS SPECIFIED" operator tests whether a pattern has been saved in the named pattern-variable. The keyword PATTERN is an optional herald.

This test can also be used to investigate whether a particular part of a pattern was recognized. If a pattern-variable is defined inside of an optional pattern or in a pattern which is part of an alternative in a compound pattern, it is possible that the part of the pattern containing the assignment was never matched (or possibly even tried), and in that case, the pattern-variable will not be defined.

Note that if the pattern variable assignment is not inside the conditional pattern, then an assignment always takes place, even though it may consist of zero characters. If the find rule header had been written with the assignment outside of the sub-pattern affected by the optionality occurrence indicator:

   FIND "-"? => sign DIGIT+ => value

the pattern assignment will always take place. The pattern variable sign will contain either zero or one negative sign, depending on whether the negative sign was present or not.

In the original example, it would either contain one negative sign, or sign would be unspecified.

3.3.9 Compound Patterns

As some of the above examples suggest, compound patterns can simply consist of adjacent subpatterns. Simple equations, for instance, are matched by the following:

   DIGIT+ SPACE* ["+-*/"] SPACE* DIGIT+ SPACE* "=" SPACE* "-"? DIGIT+

The operator OR or "|" separates alternatives. For example, the pattern

   "time" | "date"

matches either of the two words.

Alternatives consist of the longest possible sequence of subpatterns. For example, the pattern

   LETTER+ | DIGIT+ WHITE-SPACE+

either matches a word, or it matches a number (i.e. a sequence of digits) followed by white space. The scope of an alternative can be modified with parentheses. The following variation:

   (LETTER+ | DIGIT+) WHITE-SPACE+

matches a word or a number and following white space.

3.3.9.1 Factoring Out UL

If UL is used in every component of a compound pattern, it can be "factored out". The following two examples are equivalent:

Example A

   FIND UL ("chapter" | "section" | "part")
      . . .

Example B

   FIND UL "chapter" | UL "section" | UL "part"
      . . .

If the compound pattern to which the UL is applied contains a condition, any patterns in that condition are unaffected by the UL. For example, in the following, the UL applies to the "abc", but not to the "def". The "def" is part of a different pattern -- one that is part of the condition.

   GLOBAL STREAM x
   ...
   FIND UL ("abc" WHEN x MATCHES "def")

A UL can precede:

When UL precedes a pattern component in parentheses, it applies to each part of the pattern component.

UL has no effect when prefixed to:

3.3.10 Conditions Inside Patterns

OmniMark allows conditions to occur inside patterns. Every pattern inside parentheses may be followed by a condition. A condition doesn't always have to be preceded by a pattern, and may appear in parentheses by itself. The general form is:

Syntax

   (( pattern? condition? ))+

where pattern is a sequence of OmniMark patterns, and condition is a test expression preceded by either WHEN or UNLESS. A pattern or a condition is required between every pair of parentheses.

For example, the following matches a letter X when the switch include-xes is active and matches the letter Y whether or not the switch is active:

   FIND ("X" WHEN  include-xes) | "Y"
      ...

A condition can also appear by itself, surrounded by parentheses, as all or part of a pattern. The above example could have been written as the following equivalent:

   FIND "X" (WHEN  include-xes) | "Y"
      ...

When conditions appear inside parentheses after a pattern, first the pattern is tested, and, if it succeeds, the condition is attempted. If the condition is false, any pattern variables that were specified in the preceding pattern are unspecified.

When a condition follows a pattern without surrounding parentheses, such as at the head of a rule or in an action, the condition is evaluated before the pattern. In this case the condition is used to determine whether or not the rule or action containing the pattern (and therefore the pattern itself) is to be evaluated. Because conditions in these circumstances are evaluated first, they may not contain references to any pattern variables specified in the pattern.

The following example would be invalid:

   FIND DIGIT (LETTER+ => word)? WHITE-SPACE
                   WHEN ELEMENT IS term & word IS SPECIFIED

The problem here is that the test to see if pattern word was specified would occur before the pattern was matched. This condition should be rewritten to be:

   FIND (DIGIT (LETTER+ => word)? WHITE-SPACE WHEN word
         IS SPECIFIED)
                   WHEN ELEMENT IS term

First the "ELEMENT IS" test is made. Then if it is true, the pattern is evaluated, and then the "IS SPECIFIED" test is made.

The following example shows where a condition without a pattern would be used. It uses a FIND rule to find a number in parentheses followed by:

Example

   GLOBAL SWITCH in-codes
   ...
   FIND "(" DIGIT+ => number ")"
        ((WHEN number = 0) |
         "(" ANY {number} => text ")") WHEN in-codes
      ...

In the example, if the condition succeeds (i.e. number is zero), then the first part of the "|" (OR) succeeds, and following input is not examined. The condition that tests in-codes is not part of the pattern, but determines whether the find rule is to be used.

A condition by itself, as in the above example, can be considered to match zero characters. For example, the following MATCH alternative matches no characters, but succeeds if the counter limit is zero:

   FIND ANY
      LOCAL COUNTER limit

      DO SCAN ...
         MATCH ...
            ...

         MATCH (WHEN limit = 0)
            ...

         MATCH ...
            ...
      DONE

Patterns which must match at least one character or one position pattern cannot contain just a condition. (See Section 3.3.7, "Avoiding Patterns that Loop".)

Next chapter is Chapter 4, "Processing SGML Documents".

Copyright © OmniMark Technologies Corporation, 1988-1997. All rights reserved.
EUM27, release 2, 1997/04/11.

Home Copyright Information Website Feedback Site Map Search