Customizing OmniMark Behaviour

HOME \| COMPANY \| SOFTWARE \| DOCUMENTATION \| EDUCATION & TRAINING \| SALES & SERVICE
"The Official Guide to Programming with OmniMark"	Site Map \| Search: OmniMark Magazine Developer's Forum
International Edition

OmniMark^® Programmer's Guide Version 3

19. Customizing OmniMark Behaviour

Detailed Table of Contents

Previous chapter is Chapter 18, "How Asynchronous Concurrent Context Translations Work".

Next chapter is Chapter 20, "Macros".

This section describes how to customize OmniMark's behaviour through the use of various declarations, and the SGML-IN and SGML-OUT actions.

19.1 Declarations

Declarations modify the default behavior of an OmniMark program or provide the OmniMark program with supplementary information for resolving such things as external file names.

Some declarations must occur before the first rule or function definition in an OmniMark program; others may occur anywhere. The part of an OmniMark program that precedes any rules is called the preamble.

This section describes the function of each declaration, and the places where those declarations may occur.

19.1.1 The Escape Declaration

Syntax

   ESCAPE quoted-character

In OmniMark, the percent sign ("%") is normally used in a quoted string to indicate a special character or a format item. The OmniMark program can designate another character to perform this function if the percent sign is deemed inappropriate. The new character is specified using the ESCAPE declaration.

The quoted-character must be a single character in quotation marks.

This declaration is deprecated in general, because it leads to non-standard OmniMark programs that can be difficult to understand.

An example is:

   DOWN-TRANSLATE
   ESCAPE "#"
   TRANSLATE "%"           ; This would normally be "%%"
      LOCAL COUNTER n
      SET n TO 3
      OUTPUT "#"#d(n)#""   ; This would normally be "%"%d(n)%""

There can only be one ESCAPE declaration, and it must be at the start of a program, immediately following the translation type (and the "DECLARE HERALDED-NAMES" declaration, if any). If there is no translation type, (and no "DECLARE HERALDED-NAMES" declaration), the ESCAPE declaration must be the first declaration in the program.

The escape character can always be entered in a string by putting two escape characters in a row. For example, the OUTPUT statement in the following example outputs the text "#":

   CROSS-TRANSLATE
   ESCAPE "#"
   FIND-START
      OUTPUT "##"
      HALT

Care should be taken in choosing the escape character, to ensure that it is not misinterpreted as something else by someone reading an OmniMark program.

As well, some characters are poor choices for the escape character because they prevent certain format items or characters from being entered. For example, if the apostrophe (') were the escape character, a string could not be entered bounded by apostrophes, because within the string, apostrophes would be interpreted as escapes or (if doubled) as apostrophe characters, and there would be no way of ending the string.

19.1.2 Naming Conventions

In OmniMark, names can consist of the 26 letters of the Roman alphabet ( "a" to "z"), the 10 digits ( "0" to "9"), the punctuation characters: hyphen ( "-"), period ( "."), and the underscore ( "_"), and all the characters with numeric values in the range "%128#" to "%225#" (i.e. the "ASCII eight-bit characters"). Names must always start with a letter (or one of the characters from "%128#" to "%225#".)

Names of all OmniMark-defined objects are case-insensitive. That is, names that contain lower-case characters are treated as if they were entered in upper-case.

Names of all SGML objects, except for entities, are treated as case-insensitive by default. These names are also treated as if they were entered in upper-case.

The names of entities are always case-sensitive. Names which contain the same characters in different cases are treated as different names.

Finally, all keywords in OmniMark are case-insensitive as well.

This treatment of SGML names is the one specified by the Reference Concrete Syntax. This allows programmers to use the same naming conventions in their program that are used in the SGML documents being processed.

The treatment of SGML names by OmniMark's built-in SGML parser can be changed by providing an SGML Declaration that specifies different naming conventions. The treatment of SGML names (and OmniMark names) in the program can be changed by adding OmniMark declarations.

The reason that two different mechanisms are required is that there are two different parts of OmniMark involved:

The built-in SGML parser is responsible for processing SGML input and communicating events to OmniMark. In order to function correctly, it must process the SGML Declaration and behave accordingly.
The OmniMark language processor is responsible for executing the OmniMark program. However, before the program even begins to execute, the rules in the program are pre-loaded and arranged for efficient execution. In order to do this, the naming conventions must be understood before the program runs. By the time the SGML Declaration is processed, it is too late.

This subsection details the declarations that affect the naming conventions. Naming declarations are all optional, but if used, must appear after the translation type and ESCAPE declaration and before any other declarations and rules. For example, if entity names are to be converted to upper-case, this information must be given before any entity names actually appear in the program.

The naming declarations may be entered in any order within a program, but no more than one declaration of each form is permitted, and if given, they must appear before any other declarations, rules or function definitions.

19.1.2.1 Capitalization in General SGML Names

Syntax

   NAMECASE GENERAL (YES | NO)

By default, OmniMark treats SGML names other than entity names, name tokens, and number tokens that appear in an OmniMark program according to the SGML Reference Concrete Syntax. Thus, in these SGML names, lower-case letters are interpreted as though the corresponding upper-case letter had been entered instead.

In SGML, the characteristics specified by the Reference Concrete Syntax can be changed in the SGML Declaration. If NAMECASE GENERAL NO is specified in the SGML Declaration, lower-case letters in SGML names will not automatically be translated to upper-case.

To allow OmniMark programs to more closely match the SGML documents which they process, OmniMark also provides a "NAMECASE GENERAL" declaration.

To avoid confusion, the OmniMark programmer should make the OmniMark "NAMECASE GENERAL" declaration agree with that in the SGML Declaration.

When "NAMECASE GENERAL YES" is specified, lower-case characters will be interpreted as if they were upper-case. When "NAMECASE GENERAL NO" is specified, lower-case characters are interpreted as different characters. If no "NAMECASE GENERAL" specification is given, it defaults to YES.

The rule headers in the following example are equivalent:

Example A

   DOWN-TRANSLATE
   NAMECASE GENERAL YES

   ELEMENT (CHAPTER | SECTION | ANNEX)
      ...

Example B

   DOWN-TRANSLATE
   NAMECASE GENERAL YES

   ELEMENT (Chapter | Section | Annex)
      ...

Example C

   DOWN-TRANSLATE
   NAMECASE GENERAL YES

   ELEMENT (chapter | section | annex)
      ...

If the "NAMECASE GENERAL" declaration had been

   NAMECASE GENERAL NO

the element names would be processed exactly as entered and the rule headers shown above are distinct.

This declaration only pertains to letters within SGML names and tokens in the OmniMark program and does not affect the parser's interpretation of letters in data content within the SGML document.

Non-CDATA attribute values are returned to the OmniMark program all in upper-case when "NAMECASE GENERAL" YES (the default) is specified in the SGML Declaration. However, string expressions being compared to attribute values are not affected by the "NAMECASE GENERAL" declaration in OmniMark. It is the OmniMark programmer's responsibility to ensure that values compared to non-CDATA attribute values are entered in upper-case when appropriate.

The "NAMECASE GENERAL" declaration in the OmniMark program does not affect OmniMark's treatment of the names of OmniMark objects such as counters or switches, or of OmniMark keywords. Capitalization is never significant in OmniMark-specific names; they are always treated as if the upper-case version was entered. The "NAMECASE GENERAL" declaration must follow the translation type and precede all other declarations and rules in the program.

The namecase declarations can only occur in the preamble.

19.1.2.2 Capitalization in SGML Entity Names

Syntax

   NAMECASE ENTITY (YES | NO)

In SGML, the Reference Concrete Syntax specifies that entity names are treated differently than the names of other SGML objects. By default, lower-case letters in entity names are not converted to upper-case. In other words, capitalization is usually significant for entity names.

The SGML parser's treatment of entity names can be modified by an NAMECASE ENTITY declaration in the SGML Declaration. NAMECASE ENTITY YES means that lower-case letters should be mapped to their corresponding upper-case letters, and NAMECASE ENTITY NO means that they should not.

Similarly, OmniMark assumes capitalization is significant within SGML entity names that are used in an OmniMark program. This treatment again parallels the SGML Declaration of the Reference Concrete Syntax.

OmniMark's treatment of entity names can also be modified with the "NAMECASE ENTITY" declaration. YES means that lower-case letters should be mapped to upper-case letters, and NO means that they should not.

Like the "NAMECASE GENERAL" declaration, this declaration only affects the processing of names in the OmniMark program, not parsing of the SGML document. The "NAMECASE ENTITY" declaration must follow the translation type and precede all other declarations and rules in the program.

19.1.2.3 Letter Characters in OmniMark Names

Syntax

   DECLARE NAME-LETTERS string string

Although OmniMark, allows the characters above ASCII 127 to be used in unquoted names, these characters have no intrinsic upper-case/lower-case relationship. The "DECLARE NAME-LETTERS" declaration is used to specify this relationship.

As an example, the following defines the upper/lower-case relationship between all the "accented" letters in the Latin 1 character set:

   DECLARE NAME-LETTERS
       "%10r{192,193,194,195,196,197,198,199,200,201}" _
       "%10r{202,203,204,205,206,207,208,209,210,211}" _
       "%10r{212,213,214,216,217,218,219,220,221,222}"

       "%10r{224,225,226,227,228,229,230,231,232,233}" _
       "%10r{234,235,236,237,238,239,240,241,242,243}" _
       "%10r{244,245,246,248,249,250,251,252,253,254}"

There are a number of constraints on the "DECLARE NAME-LETTERS" declaration:

The two strings which are the arguments of the "DECLARE NAME-LETTERS" declaration have to have the same number of characters.
The characters in the first string must all be in the range "%128#" to "%255#".
No character can appear twice in the first string: there can be no duplicates.
The characters in the second string must all be valid name characters: English letters, digits, ".", "-", "_" or "%128#" to "%255#".
No character in the second string can be an English lower-case letter or a character that also appears in the first string.

These provisions have the following consequences:

Every name character is either lower-case or upper-case, but not both. A name character is lower-case if it is an English lower-case letter or if it appears in the first string in the "DECLARE NAME-LETTERS" declaration. A name character is upper-case if it is not lower-case.
Every lower-case name character has a unique corresponding upper-case name character that is a different character from the lower-case name character.
Every upper-case English letter, every digit, ".", "-", "_", the characters from the range "%128#" to "%255#" in the second argument of the "DECLARE NAME-LETTERS" declaration, and the characters from the range "%128#" to "%255#" in neither the first or second arguments of the "DECLARE NAME-LETTERS" declaration, are all upper-case name characters.
An upper-case name character has zero, one, or more than one corresponding lower-case name characters:
- There is no requirement that an upper-case name character have a corresponding lower-case name character. This is typically the case for the digits, ".", "-" and "_", although the "DECLARE NAME-LETTERS" declaration can change this.
- An upper-case name character can have a single corresponding lower-case name character. This is typically the case for the English letters, although, again, the "DECLARE NAME-LETTERS" declaration can change this.
- An upper-case name character can be the upper-case of more than one lower-case name character. This typically occurs in languages that have more forms of a lower-case letter (by there being more accented variants) than of its corresponding upper-case letter.

Versions of OmniMark prior to V3 only permitted English letters to be used as letters in names. Digits, "-", "." were also permitted in names as long as they were not the first character of the name.

OmniMark V3 extends the set of letters permitted in unquoted names to include all of the characters with a value greater than 127 in the ASCII character set. This allows accented European letters (for example) to be used in unquoted names. OmniMark V3 also allows the underscore ("_") in names, provided that it is not the first character.

19.1.2.4 Letter Characters in Data

Syntax

   DECLARE DATA-LETTERS string string?

The "DECLARE DATA-LETTERS" declaration specifies additional characters that should be considered as letters when encountered in input data. It must appear in the preamble.

The first string specifies a new set of lower case letters. The characters in the second string are the corresponding upper-case letters. If both strings are given, they must be the same length. If the second string is omitted, then all of the characters in the first string are considered to be both upper-case and lower-case characters.

For example, the declaration:

   DECLARE DATA-LETTERS "*"

adds the asterisk "*" to the characters that are recognized as letters for the purposes of pattern-matching.

The new letters declared in a "DECLARE DATA-LETTERS" declaration do not affect the letters that can be used in unquoted names in OmniMark.

OmniMark uses the definition of letters in the "DECLARE DATA-LETTERS" declaration in the following situations:

Pattern-matching with the LETTER, UC, or LC patterns.
When determining the upper-case and lower-case equivalents of the letters specified in patterns preceded by the UL prefix.
Using the format modifiers "u" or "l" to convert data into upper-case or lower-case.

A character can appear more than once in either of the strings in the "DECLARE DATA-LETTERS" declaration. Letters from the Roman alphabet can also appear in either string. For example, suppose a document is being created that will eventually be used by a Macintosh application that both é (%142# on the Macintosh) and è (%143#) are to be recognized as lower-case letters, but that the accents are to be dropped in the upper-case form. The following declaration might be used:

   DECLARE DATA-LETTERS "%142#%143#" "EE"

This form succeeds in mapping both accented letters to the capital E when the upper-case form is needed.

In the above example, OmniMark has three choices for mapping an upper-case E to a lower-case form:

"e"
"%142#"
"%143#"

OmniMark chooses the first mapping specified in the "DECLARE DATA-LETTERS" declaration, if there is one. If not, it uses the defaults. So, in the above example, upper-case "E" would be mapped onto "%142#".

It is more likely that the unaccented "e" should be produced. For this reason, the declaration below is preferable to the previous one:

   LETTERS "e%142#%143#" "EEE"

The format item "%n" cannot appear in the "DECLARE DATA-LETTERS" declaration.

The definition of OmniMark letters does not affect recognition of name characters by the SGML parser. (The way that the SGML parser treats letters is governed by the SGML Declaration.)

The "DECLARE DATA-LETTERS" declaration must follow the translation type and precede all other declarations and rules in the program.

19.1.2.5 The `LETTERS` Declaration

Older OmniMark programs may use the LETTERS declaration instead of the "DECLARE DATA-LETTERS" declaration. The LETTERS declaration is deprecated because it can be confused with the "DECLARE NAME-LETTERS" declaration. "DECLARE DATA-LETTERS" should always be used instead of LETTERS.

19.1.3 Setting Open Modifiers On Built-In Streams

The declarations described in this section are used to set the open modifiers on the built-in streams.

19.1.3.1 Specifying Referent Processing for `#MAIN-OUTPUT`

Syntax

   DECLARE #MAIN-OUTPUT HAS REFERENTS-ALLOWED
      (DEFAULTING { string-expression
         (, string-expression)? })?

Syntax

   DECLARE #MAIN-OUTPUT HAS REFERENTS-NOT-ALLOWED

Syntax

   DECLARE #MAIN-OUTPUT HAS REFERENTS-DISPLAYED

Normally, when referents are used anywhere in an OmniMark program, the stream #MAIN-OUTPUT is automatically treated as a REFERENTS-ALLOWED stream. For programs which perform a single translation, this is usually very desirable.

For programs which may perform a number of translations in sequence, it may not be desirable to do this because it means that all the output written to this stream will be buffered until the last translation is done.

"DECLARE #MAIN-OUTPUT HAS REFERENTS-NOT-ALLOWED" can be used to prohibit referents from being written to #MAIN-OUTPUT. Consequently, #MAIN-OUTPUT will not be buffered.

The "DECLARE #MAIN-OUTPUT" referents declaration can also be used to set REFERENTS-DISPLAYED. This can be useful when referents are erroneously written to the #MAIN-OUTPUT and the programmer needs to be able to examine the output to determine where the error is occurring. It can also be useful when the #MAIN-OUTPUT is being used as a logging mechanism.

The DEFAULTING phrase may be used to specify default values for referents which are never defined. See Section 11.5.1, "Specifying Default Referent Definitions".

19.1.3.2 Setting the Mode to Text or Binary for Built-In Streams

Syntax

   DECLARE #MAIN-OUTPUT HAS (BINARY-MODE | TEXT-MODE)

Syntax

   DECLARE #PROCESS-OUTPUT HAS (BINARY-MODE | TEXT-MODE)

Syntax

   DECLARE #MAIN-INPUT HAS (BINARY-MODE | TEXT-MODE)

Syntax

   DECLARE #PROCESS-INPUT HAS (BINARY-MODE | TEXT-MODE)

Normally, both #MAIN-OUTPUT and #PROCESS-OUTPUT are written to in TEXT-MODE and both #MAIN-INPUT and #PROCESS-INPUT are read from in TEXT-MODE. This can be made explicit or changed with these declarations.

It is rare that #PROCESS-INPUT and #PROCESS-OUTPUT will need to be processed in BINARY-MODE even when #MAIN-INPUT and #MAIN-OUTPUT are.

These declarations should always be used in preference to the NEWLINE declaration for performing binary I/O.

19.1.4 Other Declarations

19.1.4.1 Mapping Public Ids To System Ids

Syntax

   LIBRARY (public-identifier system-identifier)+

When processing a reference to an external entity other than a data or subdocument entity, the SGML parser must process the replacement text of the entity. If the external identifier in the declaration of such an entity contains a public identifier but no system identifier, OmniMark must be told how to locate the replacement text. The LIBRARY declaration performs this function.

In the LIBRARY declaration, the public-identifiers and the system-identifiers are quoted strings. The system-identifier is often a file name on the host computer system.

For example, the entity sets for alphabetic characters defined in Annex D of ISO 8879 might be located through the following LIBRARY declaration:

   LIBRARY "ISO 8879-1986//ENTITIES Added Latin 1//EN"
                "iso-lat1.gml"
           "ISO 8879-1986//ENTITIES Added Latin 2//EN"
                "iso-lat2.gml"
           "ISO 8879-1986//ENTITIES Greek Letters//EN"
                "iso-grk1.gml"
           "ISO 8879-1986//ENTITIES Monotoniko Greek//EN"
                "iso-grk2.gml"
           "ISO 8879-1986//ENTITIES Russian Cyrillic//EN"
                "iso-cyr1.gml"
           "ISO 8879-1986//ENTITIES Non-Russian Cyrillic//EN"
                "iso-cyr2.gml"

Any number of LIBRARY declarations can appear in a program, but each public-identifier can only have one definition.

LIBRARY declarations can also be used by OmniMark in a "library" file specified on the command line by using the -library control argument. A library file can be used either while an OmniMark program is being run or while a DTD is being compiled. It gives public/system identifier bindings that are only used during that run.

For example, if a DTD is being compiled with the OmniMark program on one computer, and if the result will be used on another computer (that may have a different directory structure), it is usually appropriate to put the machine-specific LIBRARY declaration in library files during each of the runs.

Entries in the "library" file always take precedence over LIBRARY declarations in the OmniMark program.

19.1.4.2 Separating Attribute Tokens

Syntax

   DELIMITER string

The DELIMITER declaration specifies how tokens in a list-valued attribute are separated when output by the "%v" format. The declaration applies to list-valued attributes: those whose declared type is NAMES, NUMBERS, NMTOKENS, NUTOKENS, IDREFS, or ENTITIES.

The DELIMITER declaration specifies that the given quoted string appear between every pair of tokens when such a value is output. No more than one DELIMITER declaration can appear in an OmniMark program. If there is no DELIMITER declaration, tokens are separated by a single space.

For example, if the following are in an OmniMark program:

   DOWN-TRANSLATE
   DELIMITER '";"'

   ELEMENT list
     OUTPUT '("%v(values)")' _
            "%c"
   ...

Then if the start tag list is used with the NAMES attribute values with value a b c, the above OUTPUT action will output ("A";"B";"C") together with the content of the list element.

The "REPEAT SCAN" or "REPEAT OVER" compound actions provide more control and specificity than the DELIMITER declaration.

19.1.4.3 The Symbol Declaration

Syntax

   SYMBOL string+

Footnote indicators are sometimes symbols (such as * or §) rather than numbers. As described in Section 19.1.4.4, "The Symbol Format", the "%y" format allows counters to be represented by strings so that successive symbols can be chosen for such footnote indicators. The SYMBOL declaration defines the strings that the "%y" format uses.

These quoted strings usually consist of either a single character to be used as a footnote indicator or the formatter instruction representing such a character.

Only one SYMBOL declaration can appear in a program. It is an error to use the "%y" format in a program that does not have a SYMBOL declaration.

19.1.4.4 The Symbol Format

Syntax

   %y( counter-name )

The "%y" format item is used to output the symbols defined in the SYMBOL declaration. No modifiers are used with this format.

The "%y" format is replaced by the string whose position in the SYMBOL declaration corresponds to the value of the output counter. Thus, the first string is output if the counter value is 1, the second string if the value is 2, and so on. If the counter value is greater than the number of strings in the declaration, the strings are used, but duplicated for each pass through the list of strings. Thus, if the declaration is

   SYMBOL "*" "%160#"

a counter value of 3 results in "**" and 6 is replaced by "§§§". (Different applications may use different codes to obtain the "§" character.) Counter values less than one are output in the same manner, but start with the last string in the declaration. Thus, zero is converted to the last string in the list, -1 to the second last string, and so on.

Counters can also be represented as letters in an alphabetic sequence (Section 6.3.4.2, "Alphabetic Representations of a Counter Value"), with their underlying binary representations (Section 6.3.4.4, "Binary Representations"), or as decimal (Section 6.3.4.1, "Arabic Numerals") or Roman numerals (Section 6.3.4.3, "Roman Numeral Representations").

19.1.4.5 Including OmniMark Code from Other Files

Syntax

   INCLUDE file-name

It is sometimes convenient to divide an OmniMark program among different source files. The INCLUDE declaration tells OmniMark to read an auxiliary input file as part of the program.

The file-name is a quoted string containing a system-specific file name. The content of the indicated file is processed as if it appeared in place of the INCLUDE declaration.

Rules cannot be split by an INCLUDE declaration. An INCLUDE declaration always ends the rule that appears before it, so part of a rule cannot be defined by including a file containing only actions. Similarly, if a rule's definition starts inside an included file, the definition ends with the end of the included file.

An INCLUDE declaration can appear anywhere in an OmniMark program. Files inserted by INCLUDE declarations can themselves contain INCLUDE declarations. The maximum nesting depth of INCLUDE declarations is 100.

An included file must not include itself.

19.1.4.6 The Byte Order in the Input

Syntax

   BINARY-INPUT constant-numeric-expression

The BINARY-INPUT declaration provides a default byte-ordering code that is used by the BINARY operator when the monadic form is used. If no BINARY-INPUT declaration is given, the default byte-ordering code is 0.

The BINARY-INPUT declaration can appear anywhere in an OmniMark program, any number of times. If more than one BINARY-INPUT declaration is given, each one must specify the same constant-numeric-value. If a BINARY-INPUT declaration does not appear before the first rule or function definition, zero (0) is assumed, and any BINARY-INPUT declarations given later in the program must specify zero.

The code value may be specified as a constant-numeric-expression. This is useful when a constant value can be represented by a macro-input (see Chapter 20, "Macros").

For example, the following declaration is valid because the value can be determined at compile-time:

   MACRO base-code IS 1 MACRO-END
   ...
   BINARY-INPUT base-code + 1

19.1.4.7 Ordering Bytes When Formatting the Binary Representation of Numbers

Syntax

   BINARY-OUTPUT constant-numeric-expression

The BINARY-OUTPUT declaration provides the default byte-ordering for the "%b" format item. OmniMark selects the byte-ordering according to the following priority:

the numeric format modifier given in the "%b" format item itself
the value specified in the BINARYopen modifier applied to the stream to which the "%b" format item is being written
the value specified in the BINARY-OUTPUT declaration
zero (0)

The BINARY-OUTPUT declaration can appear anywhere in an OmniMark program, and any number of them may appear. When more than one is given, the same value must be specified in each. If all of the BINARY-OUTPUT declarations appear after the first rule or function definition, they all must specify zero (0).

19.1.4.8 Version 2 Compatibility

Syntax

   DECLARE HERALDED-NAMES

The "DECLARE HERALDED-NAMES" declaration is provided:

to support rapid prototyping, and
in the event that there are programs which do not run under OmniMark V3 and cannot be converted easily.

"DECLARE HERALDED-NAMES" allows:

programs to be written with no declarations (This is useful for prototyping.)
the same tokens to serve as both OmniMark keywords and programmer-defined names. (OmniMark V3 prohibits using a token as a keyword in any scope where it has been declared as a programmer-defined name.)
the backquote character in macro definitions

The -herald command-line option can be used when the program cannot be modified. It has the almost same effect as the "DECLARE HERALDED-NAMES" declaration, except that programs without a translation type are considered to be down-translations and not process programs. (This is to provide compatibility with pre-V3 OmniMark programs.)

19.1.5 Line Breaking

Many text processing systems limit the length of an input line they can process. So that OmniMark output will not exceed any such limits, OmniMark provides mechanisms for breaking what would otherwise be long lines into smaller pieces.

The OmniMark programmer can specify:

what character sequence is used to break the line,
the length at which lines are to be broken,
which character sequences are to be replaced with a newline sequence when the line must be broken, and
where the line breaks are not allowed.

By default, line breaking only applies to data content written to streams for which line breaking is being done. Hard-coded text is not broken unless the line break format item is used. (See Section 19.1.5.4, "The Line Break Format".)

19.1.5.1 Specifying Line Lengths

Syntax

   BREAK-WIDTH preferred-width (TO maximum-width)?

The BREAK-WIDTH declaration defines acceptable line widths for the #MAIN-OUTPUT stream. (See Section 6.4.3.1.1, "Open Modifiers" for how to apply break widths to other streams.) Both preferred-width and maximum-width (if given) are positive integers.

The first value, preferred-width is the preferred output line width, expressed as a character count. OmniMark will try to break lines that have more than this number of characters. maximum-width-phrase, which can be omitted, gives the maximum acceptable output line width. If this number is given, it is an error if a line more than maximum-width characters occurs and OmniMark cannot find an acceptable place to break the line. Also, maximum-width must be greater than or equal to preferred-width.

There can be at most one BREAK-WIDTH declaration in a program. If no BREAK-WIDTH declaration appears, lines in the #MAIN-OUTPUT stream can be of any width.

Older versions of OmniMark allowed the keyword TO to be omitted when specifying the maximum-width. This is deprecated in modern programs.

19.1.5.2 Inserting Line Breaks

Syntax

   INSERTION-BREAK string condition?

The INSERTION-BREAK declaration is used to break data content regardless of the character following the break point.

The string contains the text to be inserted to break the line. The string can contain only static text (i.e., it may not contain any dynamic format items such as %d that cannot be evaluated by the OmniMark compiler).

When a long line is encountered, OmniMark can insert the string specified in the applicable INSERTION-BREAK declaration at any "break point" that will keep the line within the desired length. A break point is any point marked in a format by a "%/" format item, and any point in text copied from the input to a stream that is not under the control of the h modifier or break suppression. So that the result will be distributed over two output lines, this string must contain at least one end-of-line sequence indicated by "%n".

Often, the end-of-line sequence is preceded by an indication that the line break results from the line-length limitation. In such cases, the line break may not occur at a word boundary. The form of this indication of course depends on the program that will process the resulting file.

For example, the TeX formatter uses a final percent sign on the line to indicate that the line-end is not important. An OmniMark program that prepares TeX source files might include the following declaration:

   INSERTION-BREAK "%%%n"

If there is more than one INSERTION-BREAK declaration in a program, their conditions must ensure that at most one declaration applies at any one time.

When there is no applicable INSERTION-BREAK declaration in a program, "%/" format items are ignored unless a REPLACEMENT-BREAK declaration can be applied.

19.1.5.3 Replacing A Sequence With A Line-Break

Syntax

   REPLACEMENT-BREAK character string condition?

The REPLACEMENT-BREAK declaration defines a string used to replace a specified character in data content when OmniMark decides to break a line at a "%/" format that immediately precedes the character or at a character in text copied from the input to stream that is not under the control of the h modifier. Typically, for instance, space or tab characters are replaced by end-of-line sequences.

The character is a single character expressed as a quoted string, string is the replacing string, and condition is an optional condition.

As with the INSERTION-BREAK declaration, the replacement string can contain only static text. Thus, the typical conventions described above are implemented through the declarations:

   REPLACEMENT-BREAK "%_" "%n" ;%_ is space character
   REPLACEMENT-BREAK "%t" "%n" ;%t is tab character

Like the string in an INSERTION-BREAK declaration, the string in a REPLACEMENT-BREAK declaration must contain at least one end-of-line sequence.

An OmniMark program can have any number of REPLACEMENT-BREAK declarations. However, the condition must ensure that not more than one declaration for a given character can be met at any one time.

Referents may interact with the line-breaking mechanism to prevent optimal results. See Section 11.6, "Referents and Line-Breaking" for more details.

19.1.5.4 The Line Break Format

Syntax

%/

The format "%/" an OmniMark string indicates that the next character is to be considered as replaceable data content. It allows the following character to be replaced by a line end sequence if line breaking is being performed.

This format has no modifiers.

Breaks can only be inserted anywhere in data content copied from the input to an output stream and at points in OmniMark strings where the "%/" format item appears. Breaks will not be inserted under the following four conditions:

The h modifier prevents line breaking in text copied from the input to a stream and overrides the "%/" format item.
Text copied from a buffer using the "%g" format item or using "OUTPUT FILE" or "PUT FILE" cannot be broken. Such text should be prepared so that it already contains breaks at appropriate points.
Break suppression has the same effect as the h modifier in that it always prevents line breaking.
The content of referents is inserted in lines after the line lengths have been calculated and can make lines longer than intended.

OmniMark will never break an OmniMark string when no "%/" format item appears in the string.

When both INSERTION-BREAK and REPLACEMENT-BREAK declarations apply, preference is given to the applicable REPLACEMENT-BREAK (on the theory that breaking between words is preferable to breaking within words even if the latter can be done legally).

19.1.5.5 The Break Suppression Format Items

Syntax

%[

Syntax

%]

Any text between "%[" and "%]" format items is protected from being subject to any REPLACEMENT-BREAK or INSERTION-BREAK rules: no insertions or replacements will be made in the text between these format items.

In addition, the text between "%[" and "%]" is not counted towards the preferred break width. It is only counted towards the maximum break width.

This facility can be used when there is "hidden" text in the output that does not affect the length of displayed lines, and BREAK-WIDTH is being used in formatting the output. It is deprecated in all other circumstances.

A "%[" and the matching "%]" can be written by separate OUTPUT or PUT actions, as in the following example. As well, a "%[" encountered when an earlier "%[" has not yet been matched by a "%]" must itself be matched before the earlier one is. In other words, they nest. A "%]" that does not match an earlier "%[" is in error.

An example of correctly matched break suppression is:

   OUTPUT "%["
   OUTPUT "\command1{"
   OUTPUT "%[argument text%]"
   OUTPUT "}%]"

Any "%/" encountered while a matching "%]" is being looked for is ignored. For example, in the following, the "%/" is ignored for stream "s", but not for stream "t":

   PUT s "%["
   PUT (s & t) "ab%/cd"
   PUT s "%]"

19.1.5.6 Line Breaking, Conditions and Functions

The condition on an INSERTION-BREAK or REPLACEMENT-BREAK rule are not allowed to contain function calls.

This rather extreme provision has been made because of the consequences that would otherwise arise from side-effects within such functions. Two types of side-effects are especially troublesome:

The synchronization of output in general, consequential of OUTPUT or PUT actions in a function called from within such a condition, is highly non-deterministic.
It is not clear what the effect of an OUTPUT-TO in such a condition should be.

19.1.6 The `NEWLINE` Declaration and Binary I/O

The NEWLINE declaration changes the newline sequence for all data input or output by the program. As a side effect, it also causes all input and output to be performed in binary mode. If the programmer wishes to write text files in a program that has a NEWLINE declaration, the system-specific newline sequence must be written out explicitly. Such programs are less portable.

The NEWLINE declaration is deprecated. Programmers are strongly recommended to use TEXT-MODE and BINARY-MODE to individually specify whether streams or input files contain binary data or text. The "DECLARE #MAIN-OUTPUT" and "DECLARE #MAIN-INPUT" declarations are the recommended way to set the mode for the main input and output. See Section 19.1.3, "Setting Open Modifiers On Built-In Streams".

For example:

   DECLARE #MAIN-OUTPUT HAS BINARY-MODE

   PROCESS
      LOCAL STREAM f
      ...
      OPEN f WITH BINARY-MODE AS FILE "f.txt"

When the NEWLINE declaration is used in a program, the TEXT-MODE and BINARY-MODE open modifiers are prohibited.

The default mode for all streams is TEXT-MODE unless the deprecated NEWLINE declaration is used in the program.

19.2 Manipulating Record Boundaries

In SGML, a document's text consists of records which are surrounded by SGML RS (record-start) and RE (record-end) characters. In general, OmniMark prepares the text directed to the output processor so it will be suitable to the SGML parser, and text returned by the parser is similarly treated for the output processor. OmniMark programmers and users usually never need to be aware of these two operations. However, exceptions can arise. This section discusses when record boundaries would need to be explicitly manipulated, and covers the OmniMark constructs used to do this.

The vast majority of applications on all systems use the line feed and carriage return character values for the record-start and record-end characters. As a consequence, very few applications will be affected by this behaviour. To be affected, an application must use an SGML Declaration that specifies RE and/or RS function character values other than those normally used by the system on which OmniMark is running.

19.2.1 Why These Actions Are Needed

OmniMark uses the system-defined values of line feed and carriage return for record-start and record-end, respectively.

By default, OmniMark supports the SGML form of line representation in the following two ways:

In text written to the #SGML stream in the input processor, each instance of the newline sequence, "%n", is converted to the two-character sequence: RE, RS. The effect of this conversion is that each newline sequence becomes the record-end mark for the line the newline ends as well as the record-start mark for the following line.
In text provided to the output processor, each instance of the RE character is converted to the newline sequence. Most record-start characters are discarded by the SGML parser once they are recognized and markup is processed, and don't need to be handled in the output processor.

OmniMark can be used to override this behavior with the SGML-IN and SGML-OUT actions. These actions are intended to be used when the application's view of record boundaries is different from that specified in the SGML Declaration.

Additionally, these two actions can be used to suppress record boundary conversion.

19.2.2 Record Boundaries in the `#SGML` Stream

Syntax

   SGML-IN (string-expression | #NONE)

Record boundaries passed to the #SGML stream are manipulated with the SGML-IN action.

In the first form, all newline sequences written to the #SGML stream are converted to string. The default conversion can be explicitly represented by the action:

   SGML-IN "%13#%10#"

The second form is used to suppress conversion of the newline sequence, and is appropriate in several circumstances, including:

when the input newline sequences already contain the appropriate record boundary characters,
when the input processor creates the appropriate record boundary characters,
when the record boundaries recognized by the SGML markup language used occur in different locations than those recognized by other system software (i.e. when a significantly variant SGML Declaration is used that changes the record-start and record-end characters to characters normally considered part of textual data), or
when the SGML document does not contain record boundaries, or where they are not significant to processing.

The SGML-IN action may not be used in cross-translations. In down-translations, it can be used only in DOCUMENT-START rules. In up-translations and context-translations, it may be used only in DOCUMENT-START rules and any rules in the input processor (FIND, FIND-START, and FIND-END). It may never be used in SGML-ERROR rules.

Any change in the conversion specified by an SGML-IN action takes effect immediately for subsequent characters that are written to the #SGML stream.

19.2.3 Record Boundaries in the Output Processor

Syntax

   SGML-OUT (string-expression | #NONE)

Record boundaries in text passed from the SGML parser to the output processor can be manipulated with the SGML-OUT action.

In the first form, each instance of the SGML record-end character (RE) is converted to string and passed to the output processor for processing as part of the enclosing element's data content. The default conversion can be explicitly represented by the action:

   SGML-OUT "%n"

Note that this use of the action applies to all computer systems.

The second form is used to suppress conversion of RE character, and is appropriate in several circumstances, including:

when the record-end character defined by the SGML document (either in the SGML Declaration or by the Reference Concrete Syntax) is an appropriate line-end character,
when the output processor creates appropriate line-boundary characters by using the TRANSLATE rule to convert the record-end character, or
when the SGML document does not contain record boundaries.

The SGML-OUT action may not be used in cross-translations. In other translations, it may be used only in output processor rules, and is not allowed in SGML-ERROR rules.

SGML-OUT actions should not be used in TRANSLATE rules. This is because any change in the conversion specified by an SGML-OUT action takes effect immediately. All data content output from that point on by the SGML parser is subject to the new conversion (or #NONE), until another SGML-OUT action is encountered (if any).

If it is used in a TRANSLATE rule, the SGML-OUT action may or may not affect the processing of data content immediately following the data matched by the translation rule if that data and the data matched were delivered from the built-in SGML parser together.

An SGML-OUT action in any other rule in the output processor does not have this problem, because in all other rules it is well defined what data content has been processed prior to the rule and what data content is to be processed after the rule.

The following examples show how the record boundary conversion takes effect before an element's content is processed, and after. In this first example, element before's content is processed with the specified record boundary. All record-ends in the text emitted from the SGML parser are replaced with the sequence "]%13#%10#[".

   ELEMENT before
   ; Surround each line with square brackets.
     SGML-OUT "]%13#%10#["
     OUTPUT "%c"

In this example, the contents of element after are processed with the current record boundaries. All elements processed after element after has been processed will have the sequence of characters "%13#%10#" replacing the record-ends.

   ELEMENT after
     OUTPUT "%c"
   ; Reinstate a more "normal" boundary.
     SGML-OUT "%13#%10#"

The next example is a special-case of the above example with element after. In effect element conv contains a processing instruction. Element conv will be processed with the current SGML-OUT sequence. After it has been processed, its contents are evaluated, and give the new SGML-OUT sequence.

   ELEMENT conv
     SGML-OUT "%sc"

19.2.4 Default Record Boundary Handling

There are some interdependencies between the value given in the NEWLINE declaration and the default record boundary conversions that the OmniMark programmer should be aware of.

If no SGML-IN action is encountered prior to the output of (some) data to the #SGML stream, then the default conversion depends on the value of the newline sequence, as follows:

If the newline sequence is a single character or if there is no NEWLINE declaration in the OmniMark program, then all newline sequence characters in data output to the #SGML stream are converted to the sequence: carriage return followed by line feed. For systems that use the ASCII character set, this is equivalent to:
```
   SGML-IN "%13#%10#"
```
If the newline sequence has two or more characters, then newline sequences output to the #SGML stream are not converted. This is equivalent to (for all systems):
```
   SGML-IN #NONE
```

These defaults are in effect until an SGML-IN action is encountered.

If no SGML-OUT action is encountered prior to the processing of data content, then all record-end characters in data content are converted to the newline sequence prior to their being provided to output processor rules. In other words, the default SGML-OUT action is (for all systems):

   SGML-OUT "%n"

19.2.5 Record Ends in SGML Comments, Marked Sections and PIs

In the same manner as for data content, record-ends in processing instruction text, IGNORE marked section text, and the text of SGML comments, which is provided to the output processor by OmniMark's built-in SGML parser are converted to the sequence of characters specified by the SGML-OUT action (if any). If the SGML-OUT action specifies #NONE, record-ends are provided to the output processor in the form in which they come from the SGML parser.

The SGML standard (ISO 8879) doesn't address the processing of text in processing instructions, IGNORE marked sections or SGML comments, as it does for data content. As a consequence, in these types of text, OmniMark's built-in SGML parser does not discard record-start characters, as it usually does in data content and attribute value text. When the SGML-OUT action specifies #NONE, record-start characters will be present in the text.

When the SGML-OUT action specifies a string,

any sequence of a record-end character followed immediately by a record-start character (RE, RS) in the text of a processing instruction, IGNORE marked section or SGML comment is replaced by that string prior to the text being made available to the OmniMark program, and
all other record-end or record-start characters will be unchanged.

This processing is different than that for data content and attribute value text, in which:

each record-end is replaced by the SGML-OUT string, and
all record-starts are left alone.

This processing ensures that, unless character references are used in strange ways, all "newlines" come out the same.

The conversion of the record-end/record-start sequence to the SGML-OUT string occurs when the "%c" format item is processed in a "MARKED-SECTION IGNORE" rule or SGML-COMMENT rule, just as in a DATA-CONTENT rule. For a PROCESSING-INSTRUCTION rule, the conversion occurs prior to the text of the processing instruction being matched to the pattern at the head of the rule.

The processing of record-starts and record-ends in the text of processing instructions differs between different versions of OmniMark:

In versions of OmniMark prior to Version 2, record-starts were removed from the text and record-ends were converted to the system-standard line-end character or sequence.
In versions of OmniMark starting with Version 2 but prior to V2R4, no conversion of record-ends or record-starts was ever done.
In versions of OmniMark starting with V2R4, the "default" behavior has been made compatible with that of OmniMark prior to Version 2, but the OmniMark programmer has been given control over the processing with the SGML-OUT action.

Next chapter is Chapter 20, "Macros".

OmniMark® Programmer's Guide Version 3

19. Customizing OmniMark Behaviour

19.1 Declarations

19.1.1 The Escape Declaration

19.1.2 Naming Conventions

19.1.2.1 Capitalization in General SGML Names

19.1.2.2 Capitalization in SGML Entity Names

19.1.2.3 Letter Characters in OmniMark Names

19.1.2.4 Letter Characters in Data

19.1.2.5 The LETTERS Declaration

19.1.3 Setting Open Modifiers On Built-In Streams

19.1.3.1 Specifying Referent Processing for #MAIN-OUTPUT

19.1.3.2 Setting the Mode to Text or Binary for Built-In Streams

19.1.4 Other Declarations

19.1.4.1 Mapping Public Ids To System Ids

19.1.4.2 Separating Attribute Tokens

19.1.4.3 The Symbol Declaration

19.1.4.4 The Symbol Format

19.1.4.5 Including OmniMark Code from Other Files

19.1.4.6 The Byte Order in the Input

19.1.4.7 Ordering Bytes When Formatting the Binary Representation of Numbers

19.1.4.8 Version 2 Compatibility

19.1.5 Line Breaking

19.1.5.1 Specifying Line Lengths

19.1.5.2 Inserting Line Breaks

19.1.5.3 Replacing A Sequence With A Line-Break

19.1.5.4 The Line Break Format

19.1.5.5 The Break Suppression Format Items

19.1.5.6 Line Breaking, Conditions and Functions

19.1.6 The NEWLINE Declaration and Binary I/O

19.2 Manipulating Record Boundaries

19.2.1 Why These Actions Are Needed

19.2.2 Record Boundaries in the #SGML Stream

19.2.3 Record Boundaries in the Output Processor

19.2.4 Default Record Boundary Handling

19.2.5 Record Ends in SGML Comments, Marked Sections and PIs

OmniMark^® Programmer's Guide Version 3

19.1.2.5 The `LETTERS` Declaration

19.1.3.1 Specifying Referent Processing for `#MAIN-OUTPUT`

19.1.6 The `NEWLINE` Declaration and Binary I/O

19.2.2 Record Boundaries in the `#SGML` Stream