Processing SGML Documents

HOME \| COMPANY \| SOFTWARE \| DOCUMENTATION \| EDUCATION & TRAINING \| SALES & SERVICE
"The Official Guide to Programming with OmniMark"	Site Map \| Search: OmniMark Magazine Developer's Forum
International Edition

OmniMark^® Programmer's Guide Version 3

4. Processing SGML Documents

Detailed Table of Contents

Previous chapter is Chapter 3, "Generalized Document Processing".

Next chapter is Chapter 5, "Organizing Rules With Groups".

The ELEMENT rules provide the backbone of a down-translation. This chapter describes ELEMENT rules and other OmniMark constructs particular to SGML documents. The features described here are allowed whenever an SGML document is being parsed. In particular, they are permitted in:

down-translations
context-translations
up-translations unless specifically forbidden
process programs

However, since they are not relevant outside the scope of SGML, it is an error for them to occur in cross-translations.

4.1 Rules That Process Content

SGML documents have a hierarchical element structure. Other structural components within the SGML document nest within the element structure as well. These include RCDATA, CDATA, and IGNORE marked sections, processing instructions, non-text entities, and SGML comments.

The OmniMark rules that process these components have the following characteristics in common:

They are triggered when the beginning of the component is encountered.
The body of the rule must contain an action which explicitly processes the content of the structural component.
When this action is executed, the current OmniMark rule is suspended.
Other OmniMark rules may fire during the processing of the content as nested structural components are encountered.
Once all of the content has been processed, the current rule resumes, and the rest of the actions are executed.

OmniMark requires that the actions for the rule process the content of these components. The content must be processed exactly once, or explicitly discarded.

The content of the component can be processed by:

executing a SUPPRESS action
processing a "%c" format item in a quoted string

The content of a component can only be processed once: it is an error for a component's actions to contain multiple references to its content unless they are conditioned in such a way that only one of the references is selected.

4.1.1 Suppressing Content

Syntax

   SUPPRESS condition?

SUPPRESS causes the content recognized by the rule to be written to the #SUPPRESS stream. It is equivalent to:

   ...
      PUT #SUPPRESS "%zhc"

The following example shows a simple SGML program that prints all of the titles in a document whose DOCTYPE element is doc:

   DOWN-TRANSLATE

   ELEMENT doc
      SUPPRESS

   ELEMENT title
      PUT #MAIN-OUTPUT "%c%n"

   ELEMENT #IMPLIED
      OUTPUT "%c"

If there are subelements in the title, they will be processed by the "ELEMENT #IMPLIED" rule. (The "ELEMENT #IMPLIED" rule processes every element except for those handled by other ELEMENT rules. See Section 4.2.1.1, "Default Element Processing".)

OUTPUT "%c" is used to process the content of these subelements instead of SUPPRESS. This ensures that the content of the subelement "goes to the same place" as the content of the parent element. If the subelement is within a title, its content will be sent to #MAIN-OUTPUT in its proper place within the title. Otherwise it will be suppressed.

4.1.2 Processing Content

Content is processed when a "%c" format item is encountered. Only one "%c" format item or one SUPPRESS action may be evaluated in a rule.

Syntax

   % format-modifier* c

The "%c" format item can be recursive when used in some rules: its result includes the content of any subcomponents, processed according to the rules that apply to them.

The "%c" format item may have modifiers, called element content format modifiers. These modifiers apply to data characters within the processed content. They can be overridden by modifiers on a "%c" format in a rule for a subcomponent.

The element content format modifiers are:

h
The "h" format modifier prevents line-breaking rules like INSERTION-BREAK and REPLACEMENT-BREAK from applying to the content of the current component. (See Section 19.1.5, "Line Breaking").
l
The "l" modifier converts all of the text to lower-case. It applies only to letters in the processed document (data characters in content and attribute values) that are copied from the input to output. It does not apply to letters in quoted strings in the OmniMark program.
The "l" modifier cannot be used with the "u" modifier.
u
The "u" modifier converts all of the text to upper-case. It applies only to letters in the processed document (data characters in content and attribute values) that are copied from the input to output. It does not apply to letters in quoted strings in the OmniMark program.
The "u" modifier cannot be used with the "l" modifier.
s
White space is stripped in the processed content as follows:
1. Removes leading and trailing spaces and line-ends from components.
2. Condenses sequences of tabs and spaces to a single space.
3. Condenses sequences of line-ends, together with any intervening, leading or trailing tabs and spaces to a single line-end.
The "s" modifier affects only text received directly from the SGML parser, or from characters specified with format items that explicitly allow stripping:
- "%s_"
- "%st"
- "%sr"
- "%sn"
z
The "z" format modifier turns off TRANSLATE rules that would otherwise apply to all or part of the content. (See Section 4.2.2.2, "Translating Patterns in Data Content").

4.1.3 An Example of "`%c`" Format Modifiers

The example below uses the above modifiers to show how a simple document can be translated into a TeX-like markup language. Note the TRANSLATE rule used with the BREAK-WIDTH and REPLACEMENT-BREAK declarations. This is a standard way of processing data content without worrying about how the lines are broken originally.

Old part numbers are printed in lower-case only. New part numbers are printed with upper-case letters. The title isn't broken or stripped. Paragraph text is stripped, and this stripping is inherited by the part and old-part elements.

A typical document may look like the following:

   <!doctype doc [
   <!element doc        o o (title, para+)>
   <!element title      o o (#pcdata)>
   <!element para       - o (#pcdata|part|old-part)*>
   <!element part       - - (#pcdata)>
   <!element old-part   - - (#pcdata)>
   ]>
   Acme Llama and Haggis Supply Parts Catalogue,
   Fall, 1973
   <para>
   Our new stock includes three new Peruvian llamas (ask for
   <part/lL-33-864/).  We have also located a new haggis supplier in
   Singapore (<part/gG-33-865/), and are no longer carrying
   <old-part/Yh5-33-863A/, as our supplier in the Maldives is no longer in
   business.  This change should handle some of your requests.
   <para>
   As usual,
   we at Acme are looking forward to meeting your needs this fall.

The following OmniMark program can be used to transform the above SGML document into a format suitable for our target formatter.

   ELEMENT doc
     OUTPUT "%c"

   ELEMENT para
     OUTPUT "%n" when previous is para
     OUTPUT "%_%_%_%_%sc%n"

   ELEMENT part
     OUTPUT "\part{%uc}"

   ELEMENT old-part
     OUTPUT "\part{%lc}"

   ELEMENT title
     OUTPUT "\title{%hc}%n%n"

   TRANSLATE "%n"
     OUTPUT "%/%s_"

   BREAK-WIDTH 40

   REPLACEMENT-BREAK "%_" "%n"

OmniMark output, which can then be sent to a formatter, appears as follows:

   \title{Acme Llama and Haggis Supply Parts Catalogue, Fall, 1973}

       Our new stock includes three new
   Peruvian llamas (ask for
   \part{LL-33-864}). We have also located
   a new haggis supplier in Singapore
   (\part{GG-33-865}), and are no longer
   carrying \part{yh5-33-863a}, as our
   supplier in the Maldives is no longer in
   business. This change should handle some
   of your requests.

       As usual, we at Acme are looking
   forward to meeting your needs this fall.

The formatter instruction \part appears in lower-case despite the "u" modifier on the "%c" format in the part ELEMENT rule. This is because it is part of a format string and not copied data content. The "%c" modifiers do not usually apply to text explicitly output by the OmniMark program. The exception is that "%sn", "%st" and "%s_" format items are subject to the s modifier in "%c" format items in enclosing elements because their "s" modifiers explicitly request this.

Within an OmniMark program, every ELEMENT and DATA-CONTENT rule must account for the content of the associated structure. Either a "%c" format in a string or a SUPPRESS action (see Section 4.1.1, "Suppressing Content") must be processed. Since the "%c" format or SUPPRESS actually causes OmniMark to process the content, including any subelements, it is an error for more than one of them to be processed in a rule. They can appear in more than one action, as long as conditions ensure that only one such action is performed.

4.2 Basic SGML Rules

The rules in this section form the core of SGML processing. They will often be the most frequently-used rules.

4.2.1 Processing Element Content

Syntax

   ELEMENT element-name (| element-name)* condition?
      local-declaration*
      action+

Translations from SGML are controlled by ELEMENT rules. When processing an SGML source document, OmniMark performs the actions in the applicable ELEMENT rule as each element is encountered.

The element-name is also referred to in the SGML literature as a generic identifier. This document will generally use the term element name.

When more than one element-name is given, the rule will apply to any of the elements named in the rule header.

An ELEMENT rule beginning as follows, for instance, would be invoked for every chapter as well as every appendix:

   ELEMENT chapter | appendix

The following syntactic variations are permitted:

The keyword OR can be used in place of "|".
The list of element-names may be enclosed in parentheses to improve readability.

4.2.1.1 Default Element Processing

Syntax

   ELEMENT #IMPLIED condition?
      local-declaration*
      action+

In some applications, the same actions may be appropriate to so many different element types that the programmer would prefer not to list the relevant element names. When #IMPLIED is used instead of an element name in an ELEMENT rule, the rule applies to all elements not accounted for by other rules.

A single OmniMark program can be applied to SGML documents with different Document Type Definitions. Thus, the programmer may not know the names of all the elements that will appear. #IMPLIED is also used in this situation.

4.2.1.2 Conditions On Element Rules

The programmer may wish to perform different actions for different elements of the same type. A condition can be specified before the actions in an ELEMENT rule to indicate that the rule is triggered only when the condition is met.

Some examples of conditions in the header of an ELEMENT rule are shown below:

Example A

   ELEMENT example WHEN ATTRIBUTE type = "COMPUTER"

Example B

   GLOBAL COUNTER list-depth
   ...
   ELEMENT (blist | nlist) UNLESS list-depth > 4

Example C

   ELEMENT #IMPLIED WHEN PARENT IS par

The first condition above introduces actions to be taken for examples of computer input. The second is used for bulleted or numbered lists in a system that precludes more than four levels of nested lists. The final example is used for all subelements of a paragraph.

4.2.1.3 Element Rule Uniqueness

The same element name may appear in more than one ELEMENT rule. In this case, every ELEMENT rule in which the element name appears must have a condition which ensures that only one of the rules will be selected.

An OmniMark program must uniquely account for all elements in a processed SGML document. It is an error if more than one ELEMENT rule applies to a single element in the document. Similarly, it is usually an error if there is no pertinent ELEMENT rule.

OmniMark requires that an ELEMENT rule be selected for every element that occurs in a document instance.

4.2.1.4 Empty Elements

Even the content of EMPTY elements must be processed explicitly. From the point of view of OmniMark, EMPTY elements are no different from any other: it is just that when processed they are found to have no subelements or data.

4.2.2 Processing Data Characters

In many applications, data characters that occur in an input document are simply copied into the translated output. OmniMark's output actions make it simple to copy characters. As described in Section 4.1.2, "Processing Content", during the copying, letters can be forced to upper-case or to lower-case, and excess white space can be deleted. In addition, as addressed in Section 19.1.5, "Line Breaking", long lines can be split into several shorter ones.

There are situations, however, in which data characters are processed in other ways. OmniMark provides two types of rules (DATA-CONTENT and TRANSLATE) for specifying special treatment of data characters that occur in an SGML document.

4.2.2.1 The Data Content Rule

Syntax

   DATA-CONTENT condition?
      local-declaration*
      action+

DATA-CONTENT rules process strings of data characters within an SGML document.

A DATA-CONTENT rule is invoked once for every unbroken string data content (consisting of data characters and entity references). If it has a condition, then the DATA-CONTENT rule will only fire if the condition is satisfied.

Data content is deemed to occur whenever the SGML parser encounters text characters or expansions of CDATA or SDATA entity references. #PCDATA content that matches zero characters does not count, whereas a CDATA or SDATA entity expansion that contains zero characters does count. DATA-CONTENT rules can process data characters within CDATA and RCDATA elements as well.

The following things in the event will always break up a string of data content:

a start-tag
an end-tag
a processing instruction
an external data entity reference

Some things will only break up a string of data content if the OmniMark program contains the type of rule that processes those things. For instance, if the OmniMark program contains any SGML-COMMENT rules, then data content is always broken by SGML comments. The data content is broken up even if the SGML comment fails to match any of the SGML-COMMENT rules in the program. If there are no SGML-COMMENT rules, then the data content is never broken by SGML comments. The presence of the following rules cause the following things to break up data content:

SGML-COMMENT: SGML comments
"MARKED-SECTION IGNORE": IGNORE marked sections
"MARKED-SECTION RCDATA": the start and end of RCDATA marked sections
"MARKED-SECTION CDATA": the start and end of CDATA marked sections
"MARKED-SECTION INCLUDE-START": the start and end of INCLUDE marked sections
"MARKED-SECTION INCLUDE-END": the start and end of INCLUDE marked sections

While it is not necessary that a DATA-CONTENT rule apply to a fragment of text, it is an error if more than one DATA-CONTENT rule is selected.

The actions for the rule must specify how the triggering text string is to be processed. As in output actions within an ELEMENT rule, "%c" refers to the content that triggered the rule. A possible DATA-CONTENT rule is shown below:

   GLOBAL SWITCH title-has-content
   ...

   DATA-CONTENT WHEN ELEMENT IS title
     SET title-has-content TO TRUE
     OUTPUT "%c"

   ELEMENT title
      SET title-has-content TO FALSE
      OUTPUT "%c"
      DO WHEN ! title-has-content
         PUT #ERROR "Error: title has no content!"
      DONE

Since SGML permits the content model token #PCDATA to be matched by the empty string, this rule is used to verify that a title actually contains some data. Like the ELEMENT rule, the DATA-CONTENT rule must process its data content exactly once, either using a "%c" format item, or the SUPPRESS action described in Section 4.1.1, "Suppressing Content".

DATA-CONTENT rules are not permitted in cross-translations.

4.2.2.2 Translating Patterns in Data Content

Syntax

   TRANSLATE pattern condition?
      local-declaration*
      action*

The TRANSLATE rule is the other type of rule useful for processing data characters in an SGML document. A TRANSLATE rule is triggered when data matching a specified pattern occurs. The matched text must be contained in a single element.

A frequent use of TRANSLATE rules is processing the delimiter characters used by the document formatter that will process OmniMark output. For example, many text-processing systems use the backslash character "\" to start a command. To emit the backslash as data, OmniMark must output the formatter's instruction to generate a backslash instead of the character itself. Often, the instruction consists of a pair of backslashes. The following TRANSLATE rule performs the substitution:

   TRANSLATE "\"
     OUTPUT "\\"

As another example, a TRANSLATE rule can be used to enforce the convention that closing punctuation should be typeset within quotation marks. The desired output can be produced without forcing the document's author to remember the convention. The following rule reverses the two characters when a period or comma follows a quotation mark:

   TRANSLATE '"' ('.' | ',') => punctuation
     OUTPUT '%x(punctuation)"'

The pattern consists of a double-quote character followed by either a period or a comma. The period or the comma is saved in the pattern variable punctuation. The value of the punctuation pattern variable is accessed with the "%x" format described in Section 3.3.8.3, "Formatting a Pattern Variable".

TRANSLATE rules apply to data characters in the content of every element. They also apply to values of CDATA attributes that are copied to the output. Finally, they apply to characters in referenced internal CDATA and SDATA entities. TRANSLATE rules only apply to characters copied from the input, to values of attributes copied directly to a stream, and to references of "internal" CDATA and SDATA entities that are expanded by the SGML parser.

When the conditions and patterns of more than one TRANSLATE rule apply to a text string, OmniMark performs the actions associated with the first such rule to appear in the program. The result is used by the enclosing ELEMENT or DATA-CONTENT rules. When a data character is not replaced by a translation rule, it is passed unchanged, to the enclosing rule. It may be altered by modifiers placed on the "%c" format item in the enclosing rule.

As discussed in Section 4.1.2, "Processing Content" and Section 14.4.4, "Attribute Format Items", actions in other rules can suppress character translation for selected parts of the text. In particular, the z format modifier prohibits the actions of a TRANSLATE rule, even if its pattern is found and its condition is met.

TRANSLATE rules are not permitted in cross-translations.

4.2.2.3 Patterns for CDATA and SDATA Entities

OmniMark provides patterns specifically designed to match CDATA and SDATA entities in TRANSLATE rules.

4.2.2.3.1 Matching Internal Text Entities

Internal text entities cannot be matched because the ISO 8879 standard mandates that they be indistinguishable from ordinary text. This is because the replacement text of text entities can contain markup characters that could straddle element boundaries.

In practise this is not a serious restriction, since entities which are used to represent special characters should always be coded as SDATA entities. Annex D.4 of ISO 8879 defines many such entities.

4.2.2.3.2 Matching Entities Based On Replacement Text

When processing SGML input, some patterns distinguish data characters occurring in parsed character data (or the content of CDATA and RCDATA elements) from characters in referenced data entities. These patterns do not apply to external data entities which are addressed in EXTERNAL-ENTITY rules. Thus, they pertain only to CDATA and SDATA entities whose replacement text appears in their declarations. Since these patterns only apply to SGML documents, they can only be used in TRANSLATE rules.

Any pattern with an occurrence indicator, or any pattern that could have an occurrence indicator, can be restricted within or outside replacement text for such entities. To match the expansion of a CDATA or SDATA entity use one of the following keywords:

CDATA: prefixes a pattern that only matches the replacement text of a CDATA entity.
SDATA: prefixes a pattern that only matches the replacement text of an SDATA entity.
ENTITY: prefixes a pattern that matches the replacement text of either a CDATA or SDATA entity.

A pattern prefixed by any of the above keywords must match the complete replacement text of a single referenced entity.

To prevent a pattern from matching with all or part of an entity expansion use one of the following keywords:

PCDATA: prefixes a pattern that only matches data characters that do not include replacement text for such entities.
TEXT: prefixes a pattern that is unrestricted in this regard. Matched text can overlap the replacement text of one or more such entities.
NON-CDATA: prefixes a pattern that cannot include replacement text of a referenced CDATA entity. The matched text can include all or part of the replacement text for one or more SDATA entities.
NON-SDATA: prefixes a pattern that cannot include replacement text of a referenced SDATA entity. The matched text can include all or part of the replacement text for one or more CDATA entities.

For example, suppose an SGML document contains the following entity declaration:

   <!ENTITY sect SDATA "[sect]">

This entity represents the section character §. Its replacement text is the specific data for a particular computer system to print the character "§". A TRANSLATE rule in the OmniMark program that prepares input for a particular formatter can replace the generic form with the one appropriate to the formatter. Assuming the appropriate instruction is \'a0, a possible rule is:

   TRANSLATE SDATA "[sect]"
     OUTPUT "\'a0"

4.2.2.3.3 Matching Entities Based On Names

Matching can also be done on entity names, and whether or not a match succeeds is based on matching the name. For example, the following TRANSLATE rule succeeds for any internal CDATA or SDATA entity whose name consists of a single letter, and simply outputs the name of the entity in parentheses (the replacement text of the entity is ignored):

   TRANSLATE ENTITY NAMED LETTER => name
      OUTPUT "(%x(name))"

A common use of the NAMED option in internal entity matches is to identify an SDATA entity by name. For example:

   TRANSLATE SDATA NAMED "amp"
      OUTPUT "&"

As in the case for matching the replacement text of an internal CDATA or SDATA entity, the pattern that follows the keyword NAMED must match the whole of an entity's name.

The following subsections contain examples of the flexible approach to capturing and processing internal entities supported by OmniMark. It will be very rare that one OmniMark program will use all of these techniques, but OmniMark programmers should be familiar with them, so that for a given application, the appropriate technique can be chosen.

4.2.2.3.4 Matching On Both An Entity's Name And Replacement Text

A match can be based on both the name and the value of an internal entity.

During the development of an application, it is a convenient and commonly used convention to define character-representing internal SDATA entities with a fixed value, such as "TBD" or "[default]". A rule that matches all such entities and only those (and which extracts the selected entities' names) is:

   TRANSLATE SDATA VALUED "TBD" NAMED ANY+ => entity-name
      OUTPUT "{\ul %x(entity-name)}"

If both an entity's name and replacement text are to be matched, then the patterns for both the value and the name must match. The NAMED and VALUED keywords (and the patterns that follow them) can be used in either order (either NAMED or VALUED can be used first).

Matching both a name and value is especially convenient when the "default" SDATA value is associated with the default general entity, so that all "undefined" entities, and their names are captured by a rule such as the above. This avoids the requirement to anticipate all entities that a user may need during the development of an OmniMark program -- specific processing can be added at a later time.

4.2.2.3.5 Combining Internal Entity and Plain-Text Matching

Internal SDATA entities are often used to represent characters that are not directly available in the character set being used, either at a particular location, or in a "lowest-common-denominator" interchange file. SDATA, and CDATA entities can be matched as part of a larger pattern, as in the following example:

   TRANSLATE "AT" SDATA NAMED "amp" "T"
      OUTPUT "\ITALIC(AT&T)"

A multitude of SDATA entities that represent individual characters is defined in Annex D of ISO 8879. Combining entity and other matches in a TRANSLATE rule, allows an entity to be treated as just another character.

Care must be taken in composing patterns that include entity matching. In the preceding example, the letter "T" is matched following the SDATA entity -- the "T" is not part of what is matched as the entity's name. Parentheses can be used to modify this behaviour. If the pattern were the following, the entity name would have to be"ampT":

   TRANSLATE "AT" SDATA NAMED ("amp" "T")
      ...

Any form of entity match can be combined with other text matching. If, for example, the "ampersand" character were matched based on its replacement text rather than its name, the following TRANSLATE rule could be used instead of that in the previous example:

   TRANSLATE "AT" SDATA "[amp   ]" "T"
      OUTPUT "\ITALIC(AT&T)"

4.2.2.3.6 Pattern Matching Internal Entity Names

There are many hundreds of character-representing SDATA entities defined in Annex D.4 of ISO 8879, the SGML standard, and many more are in use. There is usually a convention for constructing their names. The use of patterns to match their names allows whole classes of entities to be processed by a single TRANSLATE rule. For example, all accented forms of common European characters could be processed in the same manner, with a rule similar to the following:

   GLOBAL STREAM accent-representation VARIABLE
   GLOBAL STREAM backspace-command     SIZE 1
   ...

   TRANSLATE SDATA
             NAMED (["AaEeIiOoUu"] => vowel
                    ("grave" | "acute" | "circ" | "uml") => accent))
      OUTPUT vowel || backspace-command || accent-representation ^ accent

In the example, accent-representation is a keyed shelf, initialized elsewhere in the program. accent-representation ^ accent retrieves the text sequence that represents the specified accent. (The "^" operator is discussed in Chapter 7, "Shelves".

(It is also assumed that the text formatter supports backspacing of "floating" accents. Note that the example does not cover all cases of accents in European languages.)

Alternatively, a user may choose to give all entities in a certain class a common prefix. For example, if a set of mathematical symbols all start with a capital M, and the letters following the M correspond to codes used by a text formatter, the following rule can be used:

   TRANSLATE SDATA NAMED ("M" ANY+ => id)
      OUTPUT "\MATH{(%x(id))}"

Patterns already provide mechanisms for alternation ("|" or OR) and for capturing matched text. This allows more than one name to be matched, as in the accented letter example. With the very large number of different characters in use, and the general use of SDATA entities to represent them, some way of managing large sets of names is required. This is provided by the TRANSLATE rule.

4.2.2.4 Writing Data Content to Multiple Streams

When data content (using a "%c" format item), or a CDATA attribute value (using a "%v" format item) is written to the current set of output streams, then the data content or attribute text is first passed to any applicable TRANSLATE rules, and the result of their processing is what is written to those streams.

The "z" format modifier can be used to bypass the TRANSLATE rules.

As a consequence of this processing, any side effect of an action in a fired TRANSLATE rule or in any function called within the pattern at the head of such a rule, or in the body of such a rule, occurs only once, even though the side effect may affect more than one stream.

4.2.3 Processing Instructions

Processing instructions can be entered directly into an SGML document or entered through PI entities. In either case, OmniMark ignores them unless a PROCESSING-INSTRUCTION rule applies.

Syntax

   PROCESSING-INSTRUCTION pattern condition?
      local-declaration*
      action*

The PROCESSING-INSTRUCTION rule is selected when a processing instruction occurs whose entire text matches the pattern, and the condition is satisfied. As with other pattern-based rules, if more than one could be selected, OmniMark performs the actions defined in the rule that first appears in the OmniMark program. If a processing instruction occurs in the document but no OmniMark PROCESSING-INSTRUCTION rule is selected, OmniMark simply discards the processing instruction. In this way, processing instructions pertinent to one application can occur in the SGML document without affecting the way a different application is processed.

For example, suppose a document contains "<?newpage>" processing instructions. An OmniMark down-translation that pays attention to these processing instructions could contain the following PROCESSING-INSTRUCTION rule:

   PROCESSING-INSTRUCTION "newpage"
     OUTPUT "\newpage{}"

A program that is translating SGML into the language of a text formatter that does a good job of determining where pages should be broken could ignore such PROCESSING-INSTRUCTION rules.

PROCESSING-INSTRUCTION rules are not permitted in cross-translations.

4.2.3.1 Processing Instruction Entities

Processing instructions can be the replacement text of "processing instruction entities". These entities differ from other entities whose replacement text is a fully-formed processing instruction in that the text of a processing instruction entity can include non-SGML characters and the string chosen for the PIC delimiter. (The PIC delimiter closes a processing instruction. It is ">" by default).

A PROCESSING-INSTRUCTION rule allows the OmniMark programmer:

to constrain selection of the rule based on the name of the PI entity whose reference was interpreted as a processing instruction, and
to determine the entity name of a matched processing instruction.

A PROCESSING-INSTRUCTION rule can use the keywords NAMED and VALUED in the same way as entity matches in a TRANSLATE rule. The following example illustrates recreating the original PI entity reference if a processing instruction was entered with such a reference, and recreating the processing instruction itself in all other cases.

   PROCESSING-INSTRUCTION NAMED ANY* => pi-entity-name
      OUTPUT "&%x(pi-entity-name);"
   PROCESSING-INSTRUCTION VALUED ANY* => pi-text
      OUTPUT "<?%x(pi-text)>"

A PROCESSING-INSTRUCTION rule using the NAMED keyword will only match a processing instruction that is the replacement for a PI entity reference. It will not match if the processing instruction is entered directly in an SGML document or if the processing instruction is the replacement text of an entity that is not a PI entity. If NAMED is not used (i.e. only VALUED is used or neither NAMED nor VALUED) then any processing instruction can be matched by the rule, whether entered directly or by a reference to a PI entity.

NAMED and VALUED can be used in a PROCESSING-INSTRUCTION rule individually or together and in either order. Unlike the case for entity matching in a TRANSLATE rule, a processing instruction is not matched in the context of surrounding characters. Therefore the pattern following NAMED or VALUED in a PROCESSING-INSTRUCTION rule can contain multiple parts (even joined with "|" (OR)) without the use of parenthesization. However, parentheses can be used in PROCESSING-INSTRUCTION rules for consistency.

4.2.3.2 Record Ends in Processing Instructions

Record ends in processing instructions are not subject to the same rules as record ends in data content and attribute value text. The text in processing instructions is subject to the same processing as the text in SGML comments and IGNORE marked sections: any record-end/record-start sequence is replaced by the string specified by the SGML-OUT action. This processing is described in more detail in Section 19.2.5, "Record Ends in SGML Comments, Marked Sections and PIs".

4.3 SGML Comments and Marked Sections

SGML comments are typically intended for the human reader of a document. This means that programs are not, in general, interested in SGML comments. The exception is when the output of a program is itself intended to be read by humans, and it is therefore appropriate to copy the comments from the input SGML document to the output. An important example of the latter type of processing is converting SGML documents to SGML -- from one DTD to another, say, or enhancing a document with further elements, data content and attributes. Copying the SGML comments over is most important when doing a "near identity transformation", when the "converted" document is identical to the source document, with only some parts or some aspects changed.

SGML marked sections are used for a variety of purposes:

INCLUDE and IGNORE marked sections are typically used for determining which parts of the text of an SGML document instance are to be passed to an SGML parser for processing and which parts are to be ignored (and not passed on to the processing programs) by the SGML parser.
CDATA and RCDATA marked sections are used to indicate that part of the text of an SGML document is just text -- apart from the markup that ends the marked section and entity references in RCDATA marked sections, everything is treated as text.

Most programs that process SGML documents are interested in the text and the element structure of the documents, and are not interested in how the SGML parser decided what is what. For example, most processing programs are quite content that IGNORE marked sections are ignored. However, as in the case of SGML comments, programs whose output is an SGML document, especially one that is close to the input SGML document, will often want to preserve the marked section information, and will want to preserve the text inside IGNORE marked sections (treating it like an SGML comment, in effect).

OmniMark allows the OmniMark programmer to identify and process SGML comments and marked sections, including the text of comments and IGNORE marked sections. The OmniMark programmer can select which types of marked sections are to be specially processed, and whether or not SGML comments are to be processed.

Depending on the type of marked section, either the marked section and the text it contains are processed by a single OmniMark rule, or, as in the case of INCLUDE marked sections, the start and end of the marked section are processed by separate rules.

4.3.1 Processing SGML Comments

Syntax

   SGML-COMMENT condition?
      local-declaration*
      action+

An SGML comment appears between, what are called in SGML terminology, COM delimiters (usually "--" (double dash)). SGML comments can appear in any declaration, including the USEMAP and marked section declarations that can be used in a document instance. They can also occur in declarations of their own, in what are called SGML comment declarations.

The distinction between SGML comments and comment declarations is not always made clear when talking about comments, and causes some confusion, particularly with respect to markup like "<!>", which is a comment declaration without a comment in it. On the other hand, any declaration, including an SGML comment declaration, can contain more than one comment, as in the following comment declaration, which contains two comments:

   <!--first comment-- --second comment-->

In the following, each "comment" is the text of a comment, whereas "not a comment" isn't (it is part of the text ignored inside the marked section). The last comment declaration does not contain a comment:

   <!USEMAP --comment-- my-map --comment-->
   <![--comment-- IGNORE --comment--[
   <!--not a comment-->
   ]]>
   <!--comment-- --comment-->
   <!>

An SGML-COMMENT rule is performed whenever an SGML comment is found in a document and the condition at the start of the SGML-COMMENT rule, if any, succeeds. For example, the following rule outputs the text of any SGML comment in a document, on a line by itself, surrounded by braces:

   SGML-COMMENT
      OUTPUT "{%c}%n"

The output from processing the sample input above, using this SGML-COMMENT rule would be:

   {comment}
   {comment}
   {comment}
   {comment}
   {comment}
   {comment}

In other words, it would capture each of the comments.

The following statements apply to SGML-COMMENT rules:

If no SGML-COMMENT rule is performed for an SGML comment, then the comment text is discarded.
If an OmniMark program contains no SGML-COMMENT rules, then all comments are discarded.
Only one SGML-COMMENT rule may be selected for an SGML comment. That is, either there must only be one SGML-COMMENT rule or, if there is more than one SGML-COMMENT rule, each one of them must have a condition, as in the following example:
```
   SGML-COMMENT WHEN ELEMENT IS p
      OUTPUT " (NOTE: %c)"
   SGML-COMMENT WHEN ELEMENT ISNT p
      OUTPUT "   NOTE: %c%n"
```
It is an error for more than one SGML-COMMENT rule to be selected for an SGML comment.
There may be zero, one or more than one SGML comment in any declaration in an SGML document, including in a comment declaration.
The "%c" format item captures the text of a comment, as in the example above. Either "%c" or SUPPRESS must be used exactly once in an SGML-COMMENT rule.
The "u", "l", "s", "h", and "z" format modifiers can be used on a "%c" format item in an SGML-COMMENT rule.
The text of a comment consists of all the characters between the two COM delimiters ("--" and "--"), not including the COM delimiters, but including any record ends or white space within the comment. For example, assuming the first sample SGML-COMMENT rule in this section, and the following input:
```
   
   
```
the output would be:
```
   {first comment}
   {  second comment
   }
```
SGML comments in the SGML Declaration (i.e. "<!SGML ...>") are always ignored, whether or not there is any SGML-COMMENT in the OmniMark program. All comments in the document prolog (containing the DTD) and document instance are available for processing.
The setting of the SGML-OUT action determines what happens to record ends in comment text. See Section 19.2.5, "Record Ends in SGML Comments, Marked Sections and PIs".
The presence of SGML-COMMENT rules affects how TRANSLATE rules match text around a comment. See Section 4.3.6.3, "TRANSLATE Rule Boundaries".

4.3.2 IGNORE Marked Sections

Syntax

   MARKED-SECTION IGNORE condition?
      local-declaration*
      action+

IGNORE marked sections appear to an OmniMark program in just the same way as SGML comments, except that they are processed using a "MARKED-SECTION IGNORE" rule rather than an SGML-COMMENT rule. As an example, given the following marked sections:

   <![IGNORE[ignored text]]>
   <![IGNORE[]]>
   <![IGNORE[<![IGNORE[nested ignored text]]>]]>

and the following "MARKED-SECTION IGNORE" rule:

   MARKED-SECTION IGNORE
      OUTPUT "(%c)%n"

the following output would be produced:

   (ignored text)
   ()
   (<![IGNORE[nested ignored text]]>)

OmniMark programmers should note that, in keeping with the provisions of clause 10.4.1 of the SGML standard (ISO 8879:1986), all pairs of "<![" and "]]>" within an IGNORE marked section are matched and treated as text. This means that any marked sections nested within an IGNORE marked section, including the opening and closing delimiters, are treated as part of the text of the IGNORE marked section, as illustrated by the third marked section in the sample above -- the inner

   <![IGNORE[nested ignored text]]>

is text within the outer marked section.

The following statements apply to "MARKED-SECTION IGNORE" rules. They are very similar to those for SGML-COMMENT rules.

If no "MARKED-SECTION IGNORE" rule is performed for an IGNORE marked section, then the text in the marked section is discarded.
If an OmniMark program contains no "MARKED-SECTION IGNORE" rules, then all IGNORE marked sections are discarded.
Only one "MARKED-SECTION IGNORE" rule may be selected for an IGNORE marked section. That is, either there must only be one "MARKED-SECTION IGNORE" rule or, if there is more than one "MARKED-SECTION IGNORE" rule, each one of them must have a condition, as in the following example:
```
   MARKED-SECTION IGNORE WHEN ELEMENT IS p
      OUTPUT " (THE OTHER VERSION SAYS: %c)"
   MARKED-SECTION IGNORE WHEN ELEMENT ISNT p
      OUTPUT "   NOTE -- THE OTHER VERSION SAYS: %c%n"
```
It is an error for more than one "MARKED-SECTION IGNORE" rule to be selected for an IGNORE marked section.
The "%c" format item captures the text of an IGNORE marked section, as in the example above. Either "%c" or SUPPRESS must be used exactly once in a "MARKED-SECTION IGNORE" rule.
The "u", "l", "s", "h" and "z" modifiers can be used on a "%c" format item in an "MARKED-SECTION IGNORE" rule.
The text of an IGNORE marked section consists of all the characters between the DSO delimiter following the status keyword specification, and the marked section end (i.e. between the "[" following the keyword IGNORE and the "]]>"). The text does not include the surrounding delimiters, but does include any record ends or white space within the marked section. For example, assuming the "MARKED-SECTION IGNORE" rule given at the start of this section, and the following input:
```
   <![IGNORE[marked
   section text
     ]]>
```
the output would be:
```
   (marked
   section text
     )
```
Any SGML comment in the header of an IGNORE marked section is processed prior to the processing of the IGNORE marked section. So, for example, assuming the following SGML-COMMENT rule (from the previous section):
```
   SGML-COMMENT
      OUTPUT "{%c}%n"
```
and the "MARKED-SECTION IGNORE" rule at the start of this section, the following marked section:
```
   <![--first comment-- IGNORE -- second comment -- [
   <![--this is part of the marked section-- IGNORE [text]]>
   ]]>
```
would produce the following output:
```
   {first comment}
   { second comment }
   (<![--this is part of the marked section-- IGNORE [text]]>
   )
```
Only marked sections in the document instance are available for processing by an OmniMark program. Marked sections in the DTD are always ignored, whether or not there is any MARKED-SECTION rule in the OmniMark program.
The setting of the SGML-OUT action determines what happens to record ends in the text of an IGNORE marked section. See Section 19.2.5, "Record Ends in SGML Comments, Marked Sections and PIs".
The presence of "MARKED-SECTION IGNORE" rules affects how TRANSLATE rules match text in and around an IGNORE marked section. See Section 4.3.6.3, "TRANSLATE Rule Boundaries".

4.3.3 Processing CDATA and RCDATA Marked Sections

CDATA and RCDATA marked sections serve to protect text from being misinterpreted as markup (start tags, end tags, entity references or declarations). These marked sections affect how the data is parsed by the SGML parser, but they do not normally affect the way that OmniMark processes the resulting data content.

"MARKED-SECTION CDATA" and "MARKED-SECTION RCDATA" rules can be used to identify content that was wrapped in a CDATA or RCDATA marked section.

Syntax

   MARKED-SECTION CDATA condition?
      local-declaration*
      action+

Syntax

   MARKED-SECTION RCDATA condition?
      local-declaration*
      action+

If there is no "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule to process a CDATA or RCDATA marked section, the resulting text of the marked section is treated the same way as ordinary data content. If there is an applicable "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule, then that rule determines how the OmniMark rules process the text within the marked section.

It is very important to understand that the presence or absence of MARKED-SECTION rules do not affect how marked sections are treated by the SGML parser. They only determine how the SGML parser presents the resulting text to OmniMark.

A similar set of statements applies to CDATA and RCDATA marked sections as applies to IGNORE marked sections. The major difference is that the "default" processing for these two types of marked section is to treat their text content as data content, and not to discard it.

If an OmniMark program contains no "MARKED-SECTION CDATA" rules, then OmniMark treats the text resulting from the CDATA marked sections as if the text resulted from ordinary data content. In other words, OmniMark does not detect the boundaries between the text originating from inside the marked section and the text originating from outside the marked section.
Similarly, if an OmniMark program contains no "MARKED-SECTION RCDATA" rules, then the text within all RCDATA marked sections is treated as if it were produced by ordinary data content as well.
Only one "MARKED-SECTION CDATA" rule may be selected for a CDATA marked section. That is, either there must only be one "MARKED-SECTION CDATA" rule or, if there is more than one "MARKED-SECTION CDATA" rule, each one of them must have a condition. Similarly only one "MARKED-SECTION RCDATA" rule may be selected for an RCDATA marked section
It is an error for more than one "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule to be selected for a CDATA or an RCDATA marked section.
The "%c" format item captures the text of a CDATA or RCDATA marked section. Either ""%c" or SUPPRESS must be used exactly once in a "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule.
All modifiers supported by "%c" can be used on a "%c" format item in an "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule.
The text of a CDATA or RCDATA marked section consists of all the characters between the "[" following the keyword CDATA or RCDATA and the "]]>", not including the surrounding delimiters, but including any record ends or white space within the marked section.
All SGML comment in the header of a CDATA or RCDATA marked section are processed prior to the processing of the marked section.
Only marked sections in the document instance are available for processing by an OmniMark program. Marked sections in the DTD are always ignored, whether or not there is any MARKED-SECTION rule in the OmniMark program.
The setting of the SGML-OUT action determines what happens to record ends in the text of a CDATA and RCDATA marked sections. See Section 19.2.5, "Record Ends in SGML Comments, Marked Sections and PIs".

4.3.4 INCLUDE Marked Sections

Syntax

   MARKED-SECTION INCLUDE-START condition?
      local-declaration*
      action*

Syntax

   MARKED-SECTION INCLUDE-END condition?
      local-declaration*
      action*

SGML comments, and IGNORE, CDATA and RCDATA marked sections are all processed similarly. INCLUDE marked sections, however, require quite a different approach. Instead of one rule to process an INCLUDE marked section, OmniMark provides two: one for processing the start of a marked section and one for the end. This split is necessitated by the fact that, unlike the other types of marked section, an INCLUDE marked section can start in the context of one element and end in another, and so can overlap the hierarchical structure that ties the components of a parsed SGML document together. An example of an INCLUDE marked section overlapping the element structure of a document is the following:

   <title>Part of the title.
   <![INCLUDE[More of the title.
   <p>The first paragraph.
   <p>Part of the second paragraph.
   ]]>More of the second paragraph.

This kind of overlapping cannot happen with IGNORE, CDATA or RCDATA marked sections because those types of marked sections inhibit the recognition of other markup, including start and end tags, within their text. An important consequence of this is that the whole of the text of an IGNORE, CDATA or RCDATA marked section is processed with one set of output streams (as used by the OUTPUT action and as available using the #CURRENT-OUTPUT stream set) and inherits the stream destinations and stream modifiers from the ELEMENT or DATA-CONTENT rule that processes the surrounding content.

The contents of an INCLUDE marked section, as in the example, can be part of one or more elements, the ELEMENT and DATA-CONTENT rules for which may each specify different output destinations and stream modifiers. To avoid all the complexity and user confusion that could result from trying to "merge" the specifications of the rules for INCLUDE marked sections and the applicable ELEMENT and DATA-CONTENT rules, the INCLUDE marked section rules only apply to start and end of an INCLUDE marked section. The INCLUDE marked section's rules have no direct influence on the processing of the marked section's content. The two rules are the "MARKED-SECTION INCLUDE-START" and "MARKED-SECTION INCLUDE-END", as in the following example:

   MARKED-SECTION INCLUDE-START
      DO WHEN ELEMENT IS (p | title)
         OUTPUT " (Start of bracketed text)"
      ELSE
         OUTPUT "(Start of bracketed text)%n"
      DONE

   MARKED-SECTION INCLUDE-END
      DO WHEN ELEMENT IS (p | title)
         OUTPUT " (End of bracketed text)"
      ELSE
         OUTPUT "(End of bracketed text)%n"
      DONE

The OmniMark program can influence the processing of the content of an INCLUDE marked section by setting global variables and testing them in ELEMENT and DATA-CONTENT rules, so that those rules can detect when they occur in an INCLUDE marked section.

The following statements apply to "MARKED-SECTION INCLUDE-START" and "MARKED-SECTION INCLUDE-END" rules:

If no "MARKED-SECTION INCLUDE-START" rule is performed for the start of an INCLUDE marked section, then the starting markup is ignored by the OmniMark program (though not by the SGML parser). If no "MARKED-SECTION INCLUDE-END" rule is performed for the end of an INCLUDE marked section, then the ending markup is similarly ignored.
If an OmniMark program contains no "MARKED-SECTION INCLUDE-START" rules, then all INCLUDE marked section starts are ignored by the OmniMark program. Similarly, if an OmniMark program contains no "MARKED-SECTION INCLUDE-END" rules, then all INCLUDE marked section ends are ignored.
Only one "MARKED-SECTION INCLUDE-START" rule may be selected for an INCLUDE marked section. That is, either there must only be one "MARKED-SECTION INCLUDE-START" rule or, if there is more than one "MARKED-SECTION INCLUDE-START" rule, each one of them must have a condition.
Similarly, only one "MARKED-SECTION INCLUDE-END" rule may be selected for an INCLUDE marked section.
Only marked sections in the document instance are available for processing by an OmniMark program. Marked sections in the DTD are always ignored, whether or not there is any MARKED-SECTION rule in the OmniMark program.
Neither the "%c" format item nor the SUPPRESS action can be used in a "MARKED-SECTION INCLUDE-START" or "MARKED-SECTION INCLUDE-END" rule.
The "MARKED-SECTION INCLUDE-START" rule is performed when the "[" at the end of the header of an INCLUDE marked section is encountered. Any comment in the header of an INCLUDE marked section is processed prior to the processing of the "MARKED-SECTION INCLUDE-START" rule.

4.3.5 Trapping Illegal Input

Syntax

   INVALID-DATA condition?
      local-declaration*
      action*

The INVALID-DATA rule is used to process erroneous input. The DTD restricts where data content is permitted to occur in an SGML document instance. The INVALID-DATA rule is intended to process data content which violates these restrictions.

The selection of an invalid data rule to perform is determined by the currently active groups and the condition, if any, on each invalid data rule in an active group, like any other output processor rule. The "%c" format item is used in the body of the INVALID-DATA rule to capture the data in question, and either it or the SUPPRESS action must be used (and only once).

If there are no INVALID-DATA rules in an OmniMark program, and invalid data is encountered, then the "MARKED-SECTION IGNORE" rules are examined as if the invalid data were the text of an IGNORE marked section.

The procedure that OmniMark follows when invalid data is found is as follows:

An error message is issued indicating that invalid data was found.
The invalid data is either processed or discarded:
- If there are INVALID-DATA rules:
  If there is an INVALID-DATA rule which can be performed, then it processes the invalid data. Otherwise the invalid data is discarded.
- If there are no INVALID-DATA rules:
  If there is a "MARKED-SECTION IGNORE" rule which can be performed, then it processes the invalid data. Otherwise the invalid data is discarded.

An example of an INVALID-DATA rule is:

   INVALID-DATA
      PUT #ERROR "Trashed: %"%c%".%n"

4.3.6 Text in Comments, Marked Sections and Processing Instructions

SGML comments, marked sections and processing instructions all affect the processing of text in a variety of ways. The following subsections each discuss one of the ways in which these forms of markup, and the processing applied to them, can affect the output of an OmniMark program.

4.3.6.1 Dividing Up Data Content

A DATA-CONTENT rule processes a "contiguous" sequence of text characters. A contiguous sequence of text characters is bounded by:

the start of an element,
the end of an element,
a processing instruction, or
an external CDATA, SDATA, NDATA or SUBDOC entity reference.

For example, in the following,

   <!DOCTYPE test1 [
   <!ELEMENT test1 - - (#PCDATA | x)*>
   <!ELEMENT x - - (#PCDATA)>
   <!NOTATION n SYSTEM>
   <!ENTITY y SYSTEM NDATA n>
   ]>
   <test1>
   aaa<x>bbb<?1>ccc
   </test1>

each sequence of three letters is a contiguous sequence of text characters.

4.3.6.2 SGML Comments and Marked Section Boundaries

SGML comments and marked section boundaries do not break up contiguous sequences of text characters. In the following, all occurrences of the letter "a" and the record ends between them form a single sequence of text characters:

   <!DOCTYPE test2 [
   <!ELEMENT test2 - - (#PCDATA)>
   ]>
   <test2>
   <!--first comment-->aaa
   aaa<!--second comment-->aaa
   aaa<![INCLUDE[aaa]]>aaa
   aaa<![CDATA[aaa]]>aaa
   aaa<![IGNORE[not data content]]>aaa
   aaa<!--third comment-->

The text of the comments ("first comment", "second comment" and "third comment") and the contents of the IGNORE marked section ("not data content") are not part of the text content.

If SGML-COMMENT or MARKED-SECTION rules process some or all of the comments and marked sections in an element, they occur "within" the data content, if data content has already started, and outside of the data content otherwise. For example, if the following OmniMark program were run on the "test2" example above

   DOWN-TRANSLATE
   ELEMENT #IMPLIED
      OUTPUT "%c"
   DATA-CONTENT
      OUTPUT "{%c}%n"
   SGML-COMMENT
      OUTPUT "(%c)%n"

the following output would result:

   (first comment)
   {aaa
   aaa(second comment)
   aaa
   aaaaaaaaa
   aaaaaaaaa
   aaaaaa
   aaa(third comment)
   }

In the output:

The first comment occurs outside any data content, because no data content has started when it is encountered.
The second and third comments occur inside of data content (the "(" and ")" are nested inside of the "{" and "}"), because data content has started when they are encountered. Even though the third comment is at the end of data content it is "inside" the data content.
The text in the INCLUDE and CDATA marked sections is treated like any other text in the "test2" element, and the text in the IGNORE marked section is ignored, because there are no MARKED-SECTION rules.

4.3.6.3 `TRANSLATE` Rule Boundaries

In the absence of SGML-COMMENT and MARKED-SECTION rules, TRANSLATE rules can match text on either side of SGML comments, IGNORE marked sections and the starts and ends of INCLUDE, CDATA and RCDATA marked sections as if the intervening markup were not there. For example, the following TRANSLATE rule

   TRANSLATE "hello"
      OUTPUT "howdy"

will find the "hello" in

   oh hel<!--comment-->lo there

However, if there is an SGML-COMMENT rule in the OmniMark program, the TRANSLATE rule will not match. In such a case, OmniMark:

suspends TRANSLATE rule processing when it sees the comment, terminating any TRANSLATE rule in progress,
processes the comment if one of the SGML-COMMENT rules applies to the comment (i.e. there is a rule in the currently active groups either with no condition or with a condition that succeeds), or
ignores the comment if no SGML-COMMENT rule applies (i.e. there are no SGML-COMMENT rules in the currently active groups or all the rules in the currently active groups have conditions that fail), and then
resumes TRANSLATE rule processing.

In this case, the SGML comment forms a "boundary" to translate rule matching, over which a pattern cannot match.

In general, if an OmniMark program contains a rule for processing SGML comments or for processing a particular type of marked section, then SGML comments or that type of marked section cause TRANSLATE rule boundaries in text. In particular:

If an OmniMark program contains one or more SGML-COMMENT rules, then any SGML comment in the document instance forms a TRANSLATE rule matching boundary.
If an OmniMark program contains one or more "MARKED-SECTION IGNORE" rules, then any IGNORE marked section in the document instance forms a TRANSLATE rule matching boundary.
If an OmniMark program contains one or more "MARKED-SECTION CDATA" rules, then both the start and end of any CDATA marked section in the document instance form TRANSLATE rule matching boundaries. For example, both the patterns "ab" and "bc" will match in the following "p" element if there is no "MARKED-SECTION CDATA" rule in the OmniMark program, but neither will match if there is a "MARKED-SECTION CDATA" rule:
```
   <p>aaa<![CDATA[bbb]]>ccc
```
If an OmniMark program contains one or more "MARKED-SECTION RCDATA" rules, then both the start and end of any RCDATA marked section in the document instance form TRANSLATE rule matching boundaries.
If an OmniMark program contains one or more of either "MARKED-SECTION INCLUDE-START" or "MARKED-SECTION INCLUDE-END" rules, then both the start and end of any INCLUDE marked section in the document instance form TRANSLATE rule matching boundaries.

4.4 SGML Document Regions

An SGML document can be thought of as consisting of "regions": the SGML Declaration, the DTD, the document instance, and the areas in between and around them. Most of the work done in converting an SGML document is done while in the document instance, but some processing, especially of processing instructions and SGML comments, is done while in other regions.

OmniMark has a set of output processor rules that are performed at the boundaries between these regions, that allow such distinctions to be made. Any of these rules can have a condition and local-declarations. They are each performed at the appropriate point in parsing an SGML document if they are at that point a member of an active group, and if their condition, if any, succeeds.

The rules are:

SGML-DECLARATION-END
DTD-START
DTD-END
PROLOG-END
EPILOG-START
DOCUMENT-START
DOCUMENT-END

A note on terminology: In an SGML document, the prolog ends and the document starts immediately prior to the start of the document element (i.e. the topmost element in the instance). The document instance continues to the end of the SGML document. As a consequence, any processing instructions and SGML comments between the DTD and start of the first element in the instance are officially part of the prolog, but any processing instructions or SGML comments following the end of that element are part of the instance.

4.4.1 Processing SGML Declaration Information

Syntax

   SGML-DECLARATION-END condition?
      local-declaration*
      action*

The SGML-DECLARATION-END rule is performed at the end of the SGML Declaration. This rule can be used to access the #APPINFO information. No output processor rules are performed prior to the SGML-DECLARATION-END rule other than possible EXTERNAL-TEXT-ENTITY rules for the #CHARSET, #CAPACITY and #SYNTAX entities in the SGML Declaration and SGML-ERROR rules for warnings in the SGML Declaration.

If there are errors, OmniMark will terminate processing at the end of the SGML Declaration.

4.4.2 Processing The Document Element Name

Syntax

   DTD-START condition?
      local-declaration*
      action*

The DTD-START rule is performed at the start of a DTD. It is performed immediately after the document element name is determined, so that the DTD's "name" is known. (The #DOCTYPE built-in stream can be used to return the DTD's name.)

The DTD-START rule allows comments and processing instructions prior to the DTD to be distinguished from those within the DTD. The fact that the document element name is known means that any comment between the DOCTYPE keyword and the document element name cannot be distinguished from those prior to the DTD.

4.4.3 Processing At The End Of The DTD

Syntax

   DTD-END condition?
      local-declaration*
      action*

The DTD-END rule is performed at the end of the DTD. This rule can be used to access the #DOCTYPE document element name. It also separates the SGML comments and processing instructions in and prior to the DTD from those in the remainder of the document prolog, between the DTD and the document instance.

4.4.4 Processing Just Before The Instance

Syntax

   PROLOG-END condition?
      local-declaration*
      action*

PROLOG-END is performed at the end of the document prolog, immediately prior to the start of the document instance and the document element.

4.4.5 Processing After The Instance

Syntax

   EPILOG-START condition?
      local-declaration*
      action*

EPILOG-START is performed immediately following the end of the document element. It separates the document element from those SGML comments and processing instructions that follow the document element in the document instance.

When the document instance has SGML errors, the programmer can still obtain control in the EPILOG-START rule to determine how to complete the processing. This allows the program to clean up properly while still reporting as many errors as possible before terminating the parsing of the current document.

4.5 Initializing and Terminating SGML Processing

OmniMark provides two rules, DOCUMENT-START and DOCUMENT-END, for initialization and termination in the output processor. These rules always execute in the output processor, and can not appear in cross-translations or process programs.

4.5.1 Initializing SGML Processing

Syntax

   DOCUMENT-START condition?
      local-declaration*
      action+

The DOCUMENT-START rule allows the OmniMark programmer to do processing and produce output before the start of the SGML document (including the SGML Declaration if any).

For example:

   DOCUMENT-START
     OUTPUT "{\rtf1\mac %n"

   DOCUMENT-START WHEN index ISNT OPEN
     OPEN index AS FILE "index.doc"

DOCUMENT-START rules are performed, in the order they appear in the OmniMark program, after all PROCESS-START rules (if any) have been performed, but before any other processing is done (including PROCESSING-INSTRUCTION rules).

The conditions on DOCUMENT-START rules can be controlled from variables set on the command-line, or from variables that have been set in preceding DOCUMENT-START or PROCESS-START rules.

DOCUMENT-START rules are not permitted in cross-translations or process programs.

Standard "setup" actions that only make sense for the output processor can be placed in a DOCUMENT-START rule in a separate file. This file can be incorporated in many different programs using the INCLUDE declaration.

This only makes sense if these rules are never needed in a cross-translation or a process program. However, if the setup actions are also useful for cross-translations, the actions should be placed in a PROCESS-START rule instead.

4.5.2 Terminating SGML Processing

Syntax

   DOCUMENT-END condition?
      local-declaration*
      action+

The DOCUMENT-END rule allows the OmniMark programmer to do processing and produce output after the end of the SGML document element.

   DOCUMENT-END
     OUTPUT "}%n"

DOCUMENT-END rules are performed, in the order they appear in the OmniMark program before any PROCESS-END rules, but after any other processing is done (including PROCESSING-INSTRUCTION rules).

DOCUMENT-END rules are not permitted in cross-translations or process programs.

Standard "tear-down" actions can be placed in a DOCUMENT-END rule in a separate file, which is incorporated in the program using an INCLUDE declaration. A DOCUMENT-END rule should only be used if the actions only make sense for the output processor and they are never needed in a cross-translation or a process program. However, if these actions also make sense for cross-translations, they should be placed in PROCESS-END rules.

Next chapter is Chapter 5, "Organizing Rules With Groups".

OmniMark® Programmer's Guide Version 3

4. Processing SGML Documents

4.1 Rules That Process Content

4.1.1 Suppressing Content

4.1.2 Processing Content

4.1.3 An Example of "%c" Format Modifiers

4.2 Basic SGML Rules

4.2.1 Processing Element Content

4.2.1.1 Default Element Processing

4.2.1.2 Conditions On Element Rules

4.2.1.3 Element Rule Uniqueness

4.2.1.4 Empty Elements

4.2.2 Processing Data Characters

4.2.2.1 The Data Content Rule

4.2.2.2 Translating Patterns in Data Content

4.2.2.3 Patterns for CDATA and SDATA Entities

4.2.2.3.1 Matching Internal Text Entities

4.2.2.3.2 Matching Entities Based On Replacement Text

4.2.2.3.3 Matching Entities Based On Names

4.2.2.3.4 Matching On Both An Entity's Name And Replacement Text

4.2.2.3.5 Combining Internal Entity and Plain-Text Matching

4.2.2.3.6 Pattern Matching Internal Entity Names

4.2.2.4 Writing Data Content to Multiple Streams

4.2.3 Processing Instructions

4.2.3.1 Processing Instruction Entities

4.2.3.2 Record Ends in Processing Instructions

4.3 SGML Comments and Marked Sections

4.3.1 Processing SGML Comments

4.3.2 IGNORE Marked Sections

4.3.3 Processing CDATA and RCDATA Marked Sections

4.3.4 INCLUDE Marked Sections

4.3.5 Trapping Illegal Input

4.3.6 Text in Comments, Marked Sections and Processing Instructions

4.3.6.1 Dividing Up Data Content

4.3.6.2 SGML Comments and Marked Section Boundaries

4.3.6.3 TRANSLATE Rule Boundaries

4.4 SGML Document Regions

4.4.1 Processing SGML Declaration Information

4.4.2 Processing The Document Element Name

4.4.3 Processing At The End Of The DTD

4.4.4 Processing Just Before The Instance

4.4.5 Processing After The Instance

4.5 Initializing and Terminating SGML Processing

4.5.1 Initializing SGML Processing

4.5.2 Terminating SGML Processing

OmniMark^® Programmer's Guide Version 3

4.1.3 An Example of "`%c`" Format Modifiers

4.3.6.3 `TRANSLATE` Rule Boundaries