HOME | COMPANY | SOFTWARE | DOCUMENTATION | EDUCATION & TRAINING | SALES & SERVICE | |
"The Official Guide to Programming with OmniMark" |
|
International Edition |
Previous chapter is Chapter 3, "Generalized Document Processing".
Next chapter is Chapter 5, "Organizing Rules With Groups".
The ELEMENT rules provide the backbone of a down-translation. This chapter describes ELEMENT rules and other OmniMark constructs particular to SGML documents. The features described here are allowed whenever an SGML document is being parsed. In particular, they are permitted in:
However, since they are not relevant outside the scope of SGML, it is an error for them to occur in cross-translations.
SGML documents have a hierarchical element structure. Other structural components within the SGML document nest within the element structure as well. These include RCDATA, CDATA, and IGNORE marked sections, processing instructions, non-text entities, and SGML comments.
The OmniMark rules that process these components have the following characteristics in common:
OmniMark requires that the actions for the rule process the content of these components. The content must be processed exactly once, or explicitly discarded.
The content of the component can be processed by:
The content of a component can only be processed once: it is an error for a component's actions to contain multiple references to its content unless they are conditioned in such a way that only one of the references is selected.
SUPPRESS condition?
SUPPRESS causes the content recognized by the rule to be written to the #SUPPRESS stream. It is equivalent to:
... PUT #SUPPRESS "%zhc"
The following example shows a simple SGML program that prints all of the titles in a document whose DOCTYPE element is doc:
DOWN-TRANSLATE ELEMENT doc SUPPRESS ELEMENT title PUT #MAIN-OUTPUT "%c%n" ELEMENT #IMPLIED OUTPUT "%c"
If there are subelements in the title, they will be processed by the "ELEMENT #IMPLIED" rule. (The "ELEMENT #IMPLIED" rule processes every element except for those handled by other ELEMENT rules. See Section 4.2.1.1, "Default Element Processing".)
OUTPUT "%c" is used to process the content of these subelements instead of SUPPRESS. This ensures that the content of the subelement "goes to the same place" as the content of the parent element. If the subelement is within a title, its content will be sent to #MAIN-OUTPUT in its proper place within the title. Otherwise it will be suppressed.
Content is processed when a "%c" format item is encountered. Only one "%c" format item or one SUPPRESS action may be evaluated in a rule.
% format-modifier* c
The "%c" format item can be recursive when used in some rules: its result includes the content of any subcomponents, processed according to the rules that apply to them.
The "%c" format item may have modifiers, called element content format modifiers. These modifiers apply to data characters within the processed content. They can be overridden by modifiers on a "%c" format in a rule for a subcomponent.
The element content format modifiers are:
The "h" format modifier prevents line-breaking rules like INSERTION-BREAK and REPLACEMENT-BREAK from applying to the content of the current component. (See Section 19.1.5, "Line Breaking").
The "l" modifier converts all of the text to lower-case. It applies only to letters in the processed document (data characters in content and attribute values) that are copied from the input to output. It does not apply to letters in quoted strings in the OmniMark program.
The "l" modifier cannot be used with the "u" modifier.
The "u" modifier converts all of the text to upper-case. It applies only to letters in the processed document (data characters in content and attribute values) that are copied from the input to output. It does not apply to letters in quoted strings in the OmniMark program.
The "u" modifier cannot be used with the "l" modifier.
White space is stripped in the processed content as follows:
The "s" modifier affects only text received directly from the SGML parser, or from characters specified with format items that explicitly allow stripping:
The "z" format modifier turns off TRANSLATE rules that would otherwise apply to all or part of the content. (See Section 4.2.2.2, "Translating Patterns in Data Content").
The example below uses the above modifiers to show how a simple document can be translated into a TeX-like markup language. Note the TRANSLATE rule used with the BREAK-WIDTH and REPLACEMENT-BREAK declarations. This is a standard way of processing data content without worrying about how the lines are broken originally.
Old part numbers are printed in lower-case only. New part numbers are printed with upper-case letters. The title isn't broken or stripped. Paragraph text is stripped, and this stripping is inherited by the part and old-part elements.
A typical document may look like the following:
<!doctype doc [ <!element doc o o (title, para+)> <!element title o o (#pcdata)> <!element para - o (#pcdata|part|old-part)*> <!element part - - (#pcdata)> <!element old-part - - (#pcdata)> ]> Acme Llama and Haggis Supply Parts Catalogue, Fall, 1973 <para> Our new stock includes three new Peruvian llamas (ask for <part/lL-33-864/). We have also located a new haggis supplier in Singapore (<part/gG-33-865/), and are no longer carrying <old-part/Yh5-33-863A/, as our supplier in the Maldives is no longer in business. This change should handle some of your requests. <para> As usual, we at Acme are looking forward to meeting your needs this fall.
The following OmniMark program can be used to transform the above SGML document into a format suitable for our target formatter.
ELEMENT doc OUTPUT "%c" ELEMENT para OUTPUT "%n" when previous is para OUTPUT "%_%_%_%_%sc%n" ELEMENT part OUTPUT "\part{%uc}" ELEMENT old-part OUTPUT "\part{%lc}" ELEMENT title OUTPUT "\title{%hc}%n%n" TRANSLATE "%n" OUTPUT "%/%s_" BREAK-WIDTH 40 REPLACEMENT-BREAK "%_" "%n"
OmniMark output, which can then be sent to a formatter, appears as follows:
\title{Acme Llama and Haggis Supply Parts Catalogue, Fall, 1973} Our new stock includes three new Peruvian llamas (ask for \part{LL-33-864}). We have also located a new haggis supplier in Singapore (\part{GG-33-865}), and are no longer carrying \part{yh5-33-863a}, as our supplier in the Maldives is no longer in business. This change should handle some of your requests. As usual, we at Acme are looking forward to meeting your needs this fall.
The formatter instruction \part appears in lower-case despite the "u" modifier on the "%c" format in the part ELEMENT rule. This is because it is part of a format string and not copied data content. The "%c" modifiers do not usually apply to text explicitly output by the OmniMark program. The exception is that "%sn", "%st" and "%s_" format items are subject to the s modifier in "%c" format items in enclosing elements because their "s" modifiers explicitly request this.
Within an OmniMark program, every ELEMENT and DATA-CONTENT rule must account for the content of the associated structure. Either a "%c" format in a string or a SUPPRESS action (see Section 4.1.1, "Suppressing Content") must be processed. Since the "%c" format or SUPPRESS actually causes OmniMark to process the content, including any subelements, it is an error for more than one of them to be processed in a rule. They can appear in more than one action, as long as conditions ensure that only one such action is performed.
The rules in this section form the core of SGML processing. They will often be the most frequently-used rules.
ELEMENT element-name (| element-name)* condition? local-declaration* action+
Translations from SGML are controlled by ELEMENT rules. When processing an SGML source document, OmniMark performs the actions in the applicable ELEMENT rule as each element is encountered.
The element-name is also referred to in the SGML literature as a generic identifier. This document will generally use the term element name.
When more than one element-name is given, the rule will apply to any of the elements named in the rule header.
An ELEMENT rule beginning as follows, for instance, would be invoked for every chapter as well as every appendix:
ELEMENT chapter | appendix
The following syntactic variations are permitted:
ELEMENT #IMPLIED condition? local-declaration* action+
In some applications, the same actions may be appropriate to so many different element types that the programmer would prefer not to list the relevant element names. When #IMPLIED is used instead of an element name in an ELEMENT rule, the rule applies to all elements not accounted for by other rules.
A single OmniMark program can be applied to SGML documents with different Document Type Definitions. Thus, the programmer may not know the names of all the elements that will appear. #IMPLIED is also used in this situation.
The programmer may wish to perform different actions for different elements of the same type. A condition can be specified before the actions in an ELEMENT rule to indicate that the rule is triggered only when the condition is met.
Some examples of conditions in the header of an ELEMENT rule are shown below:
Example A
ELEMENT example WHEN ATTRIBUTE type = "COMPUTER"
Example B
GLOBAL COUNTER list-depth ... ELEMENT (blist | nlist) UNLESS list-depth > 4
Example C
ELEMENT #IMPLIED WHEN PARENT IS par
The first condition above introduces actions to be taken for examples of computer input. The second is used for bulleted or numbered lists in a system that precludes more than four levels of nested lists. The final example is used for all subelements of a paragraph.
The same element name may appear in more than one ELEMENT rule. In this case, every ELEMENT rule in which the element name appears must have a condition which ensures that only one of the rules will be selected.
An OmniMark program must uniquely account for all elements in a processed SGML document. It is an error if more than one ELEMENT rule applies to a single element in the document. Similarly, it is usually an error if there is no pertinent ELEMENT rule.
OmniMark requires that an ELEMENT rule be selected for every element that occurs in a document instance.
Even the content of EMPTY elements must be processed explicitly. From the point of view of OmniMark, EMPTY elements are no different from any other: it is just that when processed they are found to have no subelements or data.
In many applications, data characters that occur in an input document are simply copied into the translated output. OmniMark's output actions make it simple to copy characters. As described in Section 4.1.2, "Processing Content", during the copying, letters can be forced to upper-case or to lower-case, and excess white space can be deleted. In addition, as addressed in Section 19.1.5, "Line Breaking", long lines can be split into several shorter ones.
There are situations, however, in which data characters are processed in other ways. OmniMark provides two types of rules (DATA-CONTENT and TRANSLATE) for specifying special treatment of data characters that occur in an SGML document.
DATA-CONTENT condition? local-declaration* action+
DATA-CONTENT rules process strings of data characters within an SGML document.
A DATA-CONTENT rule is invoked once for every unbroken string data content (consisting of data characters and entity references). If it has a condition, then the DATA-CONTENT rule will only fire if the condition is satisfied.
Data content is deemed to occur whenever the SGML parser encounters text characters or expansions of CDATA or SDATA entity references. #PCDATA content that matches zero characters does not count, whereas a CDATA or SDATA entity expansion that contains zero characters does count. DATA-CONTENT rules can process data characters within CDATA and RCDATA elements as well.
The following things in the event will always break up a string of data content:
Some things will only break up a string of data content if the OmniMark program contains the type of rule that processes those things. For instance, if the OmniMark program contains any SGML-COMMENT rules, then data content is always broken by SGML comments. The data content is broken up even if the SGML comment fails to match any of the SGML-COMMENT rules in the program. If there are no SGML-COMMENT rules, then the data content is never broken by SGML comments. The presence of the following rules cause the following things to break up data content:
While it is not necessary that a DATA-CONTENT rule apply to a fragment of text, it is an error if more than one DATA-CONTENT rule is selected.
The actions for the rule must specify how the triggering text string is to be processed. As in output actions within an ELEMENT rule, "%c" refers to the content that triggered the rule. A possible DATA-CONTENT rule is shown below:
GLOBAL SWITCH title-has-content ... DATA-CONTENT WHEN ELEMENT IS title SET title-has-content TO TRUE OUTPUT "%c" ELEMENT title SET title-has-content TO FALSE OUTPUT "%c" DO WHEN ! title-has-content PUT #ERROR "Error: title has no content!" DONE
Since SGML permits the content model token #PCDATA to be matched by the empty string, this rule is used to verify that a title actually contains some data. Like the ELEMENT rule, the DATA-CONTENT rule must process its data content exactly once, either using a "%c" format item, or the SUPPRESS action described in Section 4.1.1, "Suppressing Content".
DATA-CONTENT rules are not permitted in cross-translations.
TRANSLATE pattern condition? local-declaration* action*
The TRANSLATE rule is the other type of rule useful for processing data characters in an SGML document. A TRANSLATE rule is triggered when data matching a specified pattern occurs. The matched text must be contained in a single element.
A frequent use of TRANSLATE rules is processing the delimiter characters used by the document formatter that will process OmniMark output. For example, many text-processing systems use the backslash character "\" to start a command. To emit the backslash as data, OmniMark must output the formatter's instruction to generate a backslash instead of the character itself. Often, the instruction consists of a pair of backslashes. The following TRANSLATE rule performs the substitution:
TRANSLATE "\" OUTPUT "\\"
As another example, a TRANSLATE rule can be used to enforce the convention that closing punctuation should be typeset within quotation marks. The desired output can be produced without forcing the document's author to remember the convention. The following rule reverses the two characters when a period or comma follows a quotation mark:
TRANSLATE '"' ('.' | ',') => punctuation OUTPUT '%x(punctuation)"'
The pattern consists of a double-quote character followed by either a period or a comma. The period or the comma is saved in the pattern variable punctuation. The value of the punctuation pattern variable is accessed with the "%x" format described in Section 3.3.8.3, "Formatting a Pattern Variable".
TRANSLATE rules apply to data characters in the content of every element. They also apply to values of CDATA attributes that are copied to the output. Finally, they apply to characters in referenced internal CDATA and SDATA entities. TRANSLATE rules only apply to characters copied from the input, to values of attributes copied directly to a stream, and to references of "internal" CDATA and SDATA entities that are expanded by the SGML parser.
When the conditions and patterns of more than one TRANSLATE rule apply to a text string, OmniMark performs the actions associated with the first such rule to appear in the program. The result is used by the enclosing ELEMENT or DATA-CONTENT rules. When a data character is not replaced by a translation rule, it is passed unchanged, to the enclosing rule. It may be altered by modifiers placed on the "%c" format item in the enclosing rule.
As discussed in Section 4.1.2, "Processing Content" and Section 14.4.4, "Attribute Format Items", actions in other rules can suppress character translation for selected parts of the text. In particular, the z format modifier prohibits the actions of a TRANSLATE rule, even if its pattern is found and its condition is met.
TRANSLATE rules are not permitted in cross-translations.
OmniMark provides patterns specifically designed to match CDATA and SDATA entities in TRANSLATE rules.
Internal text entities cannot be matched because the ISO 8879 standard mandates that they be indistinguishable from ordinary text. This is because the replacement text of text entities can contain markup characters that could straddle element boundaries.
In practise this is not a serious restriction, since entities which are used to represent special characters should always be coded as SDATA entities. Annex D.4 of ISO 8879 defines many such entities.
When processing SGML input, some patterns distinguish data characters occurring in parsed character data (or the content of CDATA and RCDATA elements) from characters in referenced data entities. These patterns do not apply to external data entities which are addressed in EXTERNAL-ENTITY rules. Thus, they pertain only to CDATA and SDATA entities whose replacement text appears in their declarations. Since these patterns only apply to SGML documents, they can only be used in TRANSLATE rules.
Any pattern with an occurrence indicator, or any pattern that could have an occurrence indicator, can be restricted within or outside replacement text for such entities. To match the expansion of a CDATA or SDATA entity use one of the following keywords:
A pattern prefixed by any of the above keywords must match the complete replacement text of a single referenced entity.
To prevent a pattern from matching with all or part of an entity expansion use one of the following keywords:
For example, suppose an SGML document contains the following entity declaration:
<!ENTITY sect SDATA "[sect]">
This entity represents the section character §. Its replacement text is the specific data for a particular computer system to print the character "§". A TRANSLATE rule in the OmniMark program that prepares input for a particular formatter can replace the generic form with the one appropriate to the formatter. Assuming the appropriate instruction is \'a0, a possible rule is:
TRANSLATE SDATA "[sect]" OUTPUT "\'a0"
Matching can also be done on entity names, and whether or not a match succeeds is based on matching the name. For example, the following TRANSLATE rule succeeds for any internal CDATA or SDATA entity whose name consists of a single letter, and simply outputs the name of the entity in parentheses (the replacement text of the entity is ignored):
TRANSLATE ENTITY NAMED LETTER => name OUTPUT "(%x(name))"
A common use of the NAMED option in internal entity matches is to identify an SDATA entity by name. For example:
TRANSLATE SDATA NAMED "amp" OUTPUT "&"
As in the case for matching the replacement text of an internal CDATA or SDATA entity, the pattern that follows the keyword NAMED must match the whole of an entity's name.
The following subsections contain examples of the flexible approach to capturing and processing internal entities supported by OmniMark. It will be very rare that one OmniMark program will use all of these techniques, but OmniMark programmers should be familiar with them, so that for a given application, the appropriate technique can be chosen.
A match can be based on both the name and the value of an internal entity.
During the development of an application, it is a convenient and commonly used convention to define character-representing internal SDATA entities with a fixed value, such as "TBD" or "[default]". A rule that matches all such entities and only those (and which extracts the selected entities' names) is:
TRANSLATE SDATA VALUED "TBD" NAMED ANY+ => entity-name OUTPUT "{\ul %x(entity-name)}"
If both an entity's name and replacement text are to be matched, then the patterns for both the value and the name must match. The NAMED and VALUED keywords (and the patterns that follow them) can be used in either order (either NAMED or VALUED can be used first).
Matching both a name and value is especially convenient when the "default" SDATA value is associated with the default general entity, so that all "undefined" entities, and their names are captured by a rule such as the above. This avoids the requirement to anticipate all entities that a user may need during the development of an OmniMark program -- specific processing can be added at a later time.
Internal SDATA entities are often used to represent characters that are not directly available in the character set being used, either at a particular location, or in a "lowest-common-denominator" interchange file. SDATA, and CDATA entities can be matched as part of a larger pattern, as in the following example:
TRANSLATE "AT" SDATA NAMED "amp" "T" OUTPUT "\ITALIC(AT&T)"
A multitude of SDATA entities that represent individual characters is defined in Annex D of ISO 8879. Combining entity and other matches in a TRANSLATE rule, allows an entity to be treated as just another character.
Care must be taken in composing patterns that include entity matching. In the preceding example, the letter "T" is matched following the SDATA entity -- the "T" is not part of what is matched as the entity's name. Parentheses can be used to modify this behaviour. If the pattern were the following, the entity name would have to be"ampT":
TRANSLATE "AT" SDATA NAMED ("amp" "T") ...
Any form of entity match can be combined with other text matching. If, for example, the "ampersand" character were matched based on its replacement text rather than its name, the following TRANSLATE rule could be used instead of that in the previous example:
TRANSLATE "AT" SDATA "[amp ]" "T" OUTPUT "\ITALIC(AT&T)"
There are many hundreds of character-representing SDATA entities defined in Annex D.4 of ISO 8879, the SGML standard, and many more are in use. There is usually a convention for constructing their names. The use of patterns to match their names allows whole classes of entities to be processed by a single TRANSLATE rule. For example, all accented forms of common European characters could be processed in the same manner, with a rule similar to the following:
GLOBAL STREAM accent-representation VARIABLE GLOBAL STREAM backspace-command SIZE 1 ... TRANSLATE SDATA NAMED (["AaEeIiOoUu"] => vowel ("grave" | "acute" | "circ" | "uml") => accent)) OUTPUT vowel || backspace-command || accent-representation ^ accent
In the example, accent-representation is a keyed shelf, initialized elsewhere in the program. accent-representation ^ accent retrieves the text sequence that represents the specified accent. (The "^" operator is discussed in Chapter 7, "Shelves".
(It is also assumed that the text formatter supports backspacing of "floating" accents. Note that the example does not cover all cases of accents in European languages.)
Alternatively, a user may choose to give all entities in a certain class a common prefix. For example, if a set of mathematical symbols all start with a capital M, and the letters following the M correspond to codes used by a text formatter, the following rule can be used:
TRANSLATE SDATA NAMED ("M" ANY+ => id) OUTPUT "\MATH{(%x(id))}"
Patterns already provide mechanisms for alternation ("|" or OR) and for capturing matched text. This allows more than one name to be matched, as in the accented letter example. With the very large number of different characters in use, and the general use of SDATA entities to represent them, some way of managing large sets of names is required. This is provided by the TRANSLATE rule.
When data content (using a "%c" format item), or a CDATA attribute value (using a "%v" format item) is written to the current set of output streams, then the data content or attribute text is first passed to any applicable TRANSLATE rules, and the result of their processing is what is written to those streams.
The "z" format modifier can be used to bypass the TRANSLATE rules.
As a consequence of this processing, any side effect of an action in a fired TRANSLATE rule or in any function called within the pattern at the head of such a rule, or in the body of such a rule, occurs only once, even though the side effect may affect more than one stream.
Processing instructions can be entered directly into an SGML document or entered through PI entities. In either case, OmniMark ignores them unless a PROCESSING-INSTRUCTION rule applies.
PROCESSING-INSTRUCTION pattern condition? local-declaration* action*
The PROCESSING-INSTRUCTION rule is selected when a processing instruction occurs whose entire text matches the pattern, and the condition is satisfied. As with other pattern-based rules, if more than one could be selected, OmniMark performs the actions defined in the rule that first appears in the OmniMark program. If a processing instruction occurs in the document but no OmniMark PROCESSING-INSTRUCTION rule is selected, OmniMark simply discards the processing instruction. In this way, processing instructions pertinent to one application can occur in the SGML document without affecting the way a different application is processed.
For example, suppose a document contains "<?newpage>" processing instructions. An OmniMark down-translation that pays attention to these processing instructions could contain the following PROCESSING-INSTRUCTION rule:
PROCESSING-INSTRUCTION "newpage" OUTPUT "\newpage{}"
A program that is translating SGML into the language of a text formatter that does a good job of determining where pages should be broken could ignore such PROCESSING-INSTRUCTION rules.
PROCESSING-INSTRUCTION rules are not permitted in cross-translations.
Processing instructions can be the replacement text of "processing instruction entities". These entities differ from other entities whose replacement text is a fully-formed processing instruction in that the text of a processing instruction entity can include non-SGML characters and the string chosen for the PIC delimiter. (The PIC delimiter closes a processing instruction. It is ">" by default).
A PROCESSING-INSTRUCTION rule allows the OmniMark programmer:
A PROCESSING-INSTRUCTION rule can use the keywords NAMED and VALUED in the same way as entity matches in a TRANSLATE rule. The following example illustrates recreating the original PI entity reference if a processing instruction was entered with such a reference, and recreating the processing instruction itself in all other cases.
PROCESSING-INSTRUCTION NAMED ANY* => pi-entity-name OUTPUT "&%x(pi-entity-name);" PROCESSING-INSTRUCTION VALUED ANY* => pi-text OUTPUT "<?%x(pi-text)>"
A PROCESSING-INSTRUCTION rule using the NAMED keyword will only match a processing instruction that is the replacement for a PI entity reference. It will not match if the processing instruction is entered directly in an SGML document or if the processing instruction is the replacement text of an entity that is not a PI entity. If NAMED is not used (i.e. only VALUED is used or neither NAMED nor VALUED) then any processing instruction can be matched by the rule, whether entered directly or by a reference to a PI entity.
NAMED and VALUED can be used in a PROCESSING-INSTRUCTION rule individually or together and in either order. Unlike the case for entity matching in a TRANSLATE rule, a processing instruction is not matched in the context of surrounding characters. Therefore the pattern following NAMED or VALUED in a PROCESSING-INSTRUCTION rule can contain multiple parts (even joined with "|" (OR)) without the use of parenthesization. However, parentheses can be used in PROCESSING-INSTRUCTION rules for consistency.
Record ends in processing instructions are not subject to the same rules as record ends in data content and attribute value text. The text in processing instructions is subject to the same processing as the text in SGML comments and IGNORE marked sections: any record-end/record-start sequence is replaced by the string specified by the SGML-OUT action. This processing is described in more detail in Section 19.2.5, "Record Ends in SGML Comments, Marked Sections and PIs".
SGML comments are typically intended for the human reader of a document. This means that programs are not, in general, interested in SGML comments. The exception is when the output of a program is itself intended to be read by humans, and it is therefore appropriate to copy the comments from the input SGML document to the output. An important example of the latter type of processing is converting SGML documents to SGML -- from one DTD to another, say, or enhancing a document with further elements, data content and attributes. Copying the SGML comments over is most important when doing a "near identity transformation", when the "converted" document is identical to the source document, with only some parts or some aspects changed.
SGML marked sections are used for a variety of purposes:
Most programs that process SGML documents are interested in the text and the element structure of the documents, and are not interested in how the SGML parser decided what is what. For example, most processing programs are quite content that IGNORE marked sections are ignored. However, as in the case of SGML comments, programs whose output is an SGML document, especially one that is close to the input SGML document, will often want to preserve the marked section information, and will want to preserve the text inside IGNORE marked sections (treating it like an SGML comment, in effect).
OmniMark allows the OmniMark programmer to identify and process SGML comments and marked sections, including the text of comments and IGNORE marked sections. The OmniMark programmer can select which types of marked sections are to be specially processed, and whether or not SGML comments are to be processed.
Depending on the type of marked section, either the marked section and the text it contains are processed by a single OmniMark rule, or, as in the case of INCLUDE marked sections, the start and end of the marked section are processed by separate rules.
SGML-COMMENT condition? local-declaration* action+
An SGML comment appears between, what are called in SGML terminology, COM delimiters (usually "--" (double dash)). SGML comments can appear in any declaration, including the USEMAP and marked section declarations that can be used in a document instance. They can also occur in declarations of their own, in what are called SGML comment declarations.
The distinction between SGML comments and comment declarations is not always made clear when talking about comments, and causes some confusion, particularly with respect to markup like "<!>", which is a comment declaration without a comment in it. On the other hand, any declaration, including an SGML comment declaration, can contain more than one comment, as in the following comment declaration, which contains two comments:
<!--first comment-- --second comment-->
In the following, each "comment" is the text of a comment, whereas "not a comment" isn't (it is part of the text ignored inside the marked section). The last comment declaration does not contain a comment:
<!USEMAP --comment-- my-map --comment--> <![--comment-- IGNORE --comment--[ <!--not a comment--> ]]> <!--comment-- --comment--> <!>
An SGML-COMMENT rule is performed whenever an SGML comment is found in a document and the condition at the start of the SGML-COMMENT rule, if any, succeeds. For example, the following rule outputs the text of any SGML comment in a document, on a line by itself, surrounded by braces:
SGML-COMMENT OUTPUT "{%c}%n"
The output from processing the sample input above, using this SGML-COMMENT rule would be:
{comment} {comment} {comment} {comment} {comment} {comment}
In other words, it would capture each of the comments.
The following statements apply to SGML-COMMENT rules:
SGML-COMMENT WHEN ELEMENT IS p OUTPUT " (NOTE: %c)" SGML-COMMENT WHEN ELEMENT ISNT p OUTPUT " NOTE: %c%n"
It is an error for more than one SGML-COMMENT rule to be selected for an SGML comment.
The "u", "l", "s", "h", and "z" format modifiers can be used on a "%c" format item in an SGML-COMMENT rule.
<!--first comment--> <!-- second comment -->
the output would be:
{first comment} { second comment }
MARKED-SECTION IGNORE condition? local-declaration* action+
IGNORE marked sections appear to an OmniMark program in just the same way as SGML comments, except that they are processed using a "MARKED-SECTION IGNORE" rule rather than an SGML-COMMENT rule. As an example, given the following marked sections:
<![IGNORE[ignored text]]> <![IGNORE[]]> <![IGNORE[<![IGNORE[nested ignored text]]>]]>
and the following "MARKED-SECTION IGNORE" rule:
MARKED-SECTION IGNORE OUTPUT "(%c)%n"
the following output would be produced:
(ignored text) () (<![IGNORE[nested ignored text]]>)
OmniMark programmers should note that, in keeping with the provisions of clause 10.4.1 of the SGML standard (ISO 8879:1986), all pairs of "<![" and "]]>" within an IGNORE marked section are matched and treated as text. This means that any marked sections nested within an IGNORE marked section, including the opening and closing delimiters, are treated as part of the text of the IGNORE marked section, as illustrated by the third marked section in the sample above -- the inner
<![IGNORE[nested ignored text]]>
is text within the outer marked section.
The following statements apply to "MARKED-SECTION IGNORE" rules. They are very similar to those for SGML-COMMENT rules.
MARKED-SECTION IGNORE WHEN ELEMENT IS p OUTPUT " (THE OTHER VERSION SAYS: %c)" MARKED-SECTION IGNORE WHEN ELEMENT ISNT p OUTPUT " NOTE -- THE OTHER VERSION SAYS: %c%n"
It is an error for more than one "MARKED-SECTION IGNORE" rule to be selected for an IGNORE marked section.
The "u", "l", "s", "h" and "z" modifiers can be used on a "%c" format item in an "MARKED-SECTION IGNORE" rule.
<![IGNORE[marked section text ]]>
the output would be:
(marked section text )
Any SGML comment in the header of an IGNORE marked section is processed prior to the processing of the IGNORE marked section. So, for example, assuming the following SGML-COMMENT rule (from the previous section):
SGML-COMMENT OUTPUT "{%c}%n"
and the "MARKED-SECTION IGNORE" rule at the start of this section, the following marked section:
<![--first comment-- IGNORE -- second comment -- [ <![--this is part of the marked section-- IGNORE [text]]> ]]>
would produce the following output:
{first comment} { second comment } (<![--this is part of the marked section-- IGNORE [text]]> )
CDATA and RCDATA marked sections serve to protect text from being misinterpreted as markup (start tags, end tags, entity references or declarations). These marked sections affect how the data is parsed by the SGML parser, but they do not normally affect the way that OmniMark processes the resulting data content.
"MARKED-SECTION CDATA" and "MARKED-SECTION RCDATA" rules can be used to identify content that was wrapped in a CDATA or RCDATA marked section.
MARKED-SECTION CDATA condition? local-declaration* action+
MARKED-SECTION RCDATA condition? local-declaration* action+
If there is no "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule to process a CDATA or RCDATA marked section, the resulting text of the marked section is treated the same way as ordinary data content. If there is an applicable "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule, then that rule determines how the OmniMark rules process the text within the marked section.
It is very important to understand that the presence or absence of MARKED-SECTION rules do not affect how marked sections are treated by the SGML parser. They only determine how the SGML parser presents the resulting text to OmniMark.
A similar set of statements applies to CDATA and RCDATA marked sections as applies to IGNORE marked sections. The major difference is that the "default" processing for these two types of marked section is to treat their text content as data content, and not to discard it.
Similarly, if an OmniMark program contains no "MARKED-SECTION RCDATA" rules, then the text within all RCDATA marked sections is treated as if it were produced by ordinary data content as well.
It is an error for more than one "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule to be selected for a CDATA or an RCDATA marked section.
All modifiers supported by "%c" can be used on a "%c" format item in an "MARKED-SECTION CDATA" or "MARKED-SECTION RCDATA" rule.
All SGML comment in the header of a CDATA or RCDATA marked section are processed prior to the processing of the marked section.
MARKED-SECTION INCLUDE-START condition? local-declaration* action*
MARKED-SECTION INCLUDE-END condition? local-declaration* action*
SGML comments, and IGNORE, CDATA and RCDATA marked sections are all processed similarly. INCLUDE marked sections, however, require quite a different approach. Instead of one rule to process an INCLUDE marked section, OmniMark provides two: one for processing the start of a marked section and one for the end. This split is necessitated by the fact that, unlike the other types of marked section, an INCLUDE marked section can start in the context of one element and end in another, and so can overlap the hierarchical structure that ties the components of a parsed SGML document together. An example of an INCLUDE marked section overlapping the element structure of a document is the following:
<title>Part of the title. <![INCLUDE[More of the title. <p>The first paragraph. <p>Part of the second paragraph. ]]>More of the second paragraph.
This kind of overlapping cannot happen with IGNORE, CDATA or RCDATA marked sections because those types of marked sections inhibit the recognition of other markup, including start and end tags, within their text. An important consequence of this is that the whole of the text of an IGNORE, CDATA or RCDATA marked section is processed with one set of output streams (as used by the OUTPUT action and as available using the #CURRENT-OUTPUT stream set) and inherits the stream destinations and stream modifiers from the ELEMENT or DATA-CONTENT rule that processes the surrounding content.
The contents of an INCLUDE marked section, as in the example, can be part of one or more elements, the ELEMENT and DATA-CONTENT rules for which may each specify different output destinations and stream modifiers. To avoid all the complexity and user confusion that could result from trying to "merge" the specifications of the rules for INCLUDE marked sections and the applicable ELEMENT and DATA-CONTENT rules, the INCLUDE marked section rules only apply to start and end of an INCLUDE marked section. The INCLUDE marked section's rules have no direct influence on the processing of the marked section's content. The two rules are the "MARKED-SECTION INCLUDE-START" and "MARKED-SECTION INCLUDE-END", as in the following example:
MARKED-SECTION INCLUDE-START DO WHEN ELEMENT IS (p | title) OUTPUT " (Start of bracketed text)" ELSE OUTPUT "(Start of bracketed text)%n" DONE MARKED-SECTION INCLUDE-END DO WHEN ELEMENT IS (p | title) OUTPUT " (End of bracketed text)" ELSE OUTPUT "(End of bracketed text)%n" DONE
The OmniMark program can influence the processing of the content of an INCLUDE marked section by setting global variables and testing them in ELEMENT and DATA-CONTENT rules, so that those rules can detect when they occur in an INCLUDE marked section.
The following statements apply to "MARKED-SECTION INCLUDE-START" and "MARKED-SECTION INCLUDE-END" rules:
Similarly, only one "MARKED-SECTION INCLUDE-END" rule may be selected for an INCLUDE marked section.
INVALID-DATA condition? local-declaration* action*
The INVALID-DATA rule is used to process erroneous input. The DTD restricts where data content is permitted to occur in an SGML document instance. The INVALID-DATA rule is intended to process data content which violates these restrictions.
The selection of an invalid data rule to perform is determined by the currently active groups and the condition, if any, on each invalid data rule in an active group, like any other output processor rule. The "%c" format item is used in the body of the INVALID-DATA rule to capture the data in question, and either it or the SUPPRESS action must be used (and only once).
If there are no INVALID-DATA rules in an OmniMark program, and invalid data is encountered, then the "MARKED-SECTION IGNORE" rules are examined as if the invalid data were the text of an IGNORE marked section.
The procedure that OmniMark follows when invalid data is found is as follows:
If there is an INVALID-DATA rule which can be performed, then it processes the invalid data. Otherwise the invalid data is discarded.
If there is a "MARKED-SECTION IGNORE" rule which can be performed, then it processes the invalid data. Otherwise the invalid data is discarded.
An example of an INVALID-DATA rule is:
INVALID-DATA PUT #ERROR "Trashed: %"%c%".%n"
SGML comments, marked sections and processing instructions all affect the processing of text in a variety of ways. The following subsections each discuss one of the ways in which these forms of markup, and the processing applied to them, can affect the output of an OmniMark program.
A DATA-CONTENT rule processes a "contiguous" sequence of text characters. A contiguous sequence of text characters is bounded by:
For example, in the following,
<!DOCTYPE test1 [ <!ELEMENT test1 - - (#PCDATA | x)*> <!ELEMENT x - - (#PCDATA)> <!NOTATION n SYSTEM> <!ENTITY y SYSTEM NDATA n> ]> <test1> aaa<x>bbb<?1>ccc </test1>
each sequence of three letters is a contiguous sequence of text characters.
SGML comments and marked section boundaries do not break up contiguous sequences of text characters. In the following, all occurrences of the letter "a" and the record ends between them form a single sequence of text characters:
<!DOCTYPE test2 [ <!ELEMENT test2 - - (#PCDATA)> ]> <test2> <!--first comment-->aaa aaa<!--second comment-->aaa aaa<![INCLUDE[aaa]]>aaa aaa<![CDATA[aaa]]>aaa aaa<![IGNORE[not data content]]>aaa aaa<!--third comment-->
The text of the comments ("first comment", "second comment" and "third comment") and the contents of the IGNORE marked section ("not data content") are not part of the text content.
If SGML-COMMENT or MARKED-SECTION rules process some or all of the comments and marked sections in an element, they occur "within" the data content, if data content has already started, and outside of the data content otherwise. For example, if the following OmniMark program were run on the "test2" example above
DOWN-TRANSLATE ELEMENT #IMPLIED OUTPUT "%c" DATA-CONTENT OUTPUT "{%c}%n" SGML-COMMENT OUTPUT "(%c)%n"
the following output would result:
(first comment) {aaa aaa(second comment) aaa aaaaaaaaa aaaaaaaaa aaaaaa aaa(third comment) }
In the output:
In the absence of SGML-COMMENT and MARKED-SECTION rules, TRANSLATE rules can match text on either side of SGML comments, IGNORE marked sections and the starts and ends of INCLUDE, CDATA and RCDATA marked sections as if the intervening markup were not there. For example, the following TRANSLATE rule
TRANSLATE "hello" OUTPUT "howdy"
will find the "hello" in
oh hel<!--comment-->lo there
However, if there is an SGML-COMMENT rule in the OmniMark program, the TRANSLATE rule will not match. In such a case, OmniMark:
In this case, the SGML comment forms a "boundary" to translate rule matching, over which a pattern cannot match.
In general, if an OmniMark program contains a rule for processing SGML comments or for processing a particular type of marked section, then SGML comments or that type of marked section cause TRANSLATE rule boundaries in text. In particular:
<p>aaa<![CDATA[bbb]]>ccc
An SGML document can be thought of as consisting of "regions": the SGML Declaration, the DTD, the document instance, and the areas in between and around them. Most of the work done in converting an SGML document is done while in the document instance, but some processing, especially of processing instructions and SGML comments, is done while in other regions.
OmniMark has a set of output processor rules that are performed at the boundaries between these regions, that allow such distinctions to be made. Any of these rules can have a condition and local-declarations. They are each performed at the appropriate point in parsing an SGML document if they are at that point a member of an active group, and if their condition, if any, succeeds.
The rules are:
A note on terminology: In an SGML document, the prolog ends and the document starts immediately prior to the start of the document element (i.e. the topmost element in the instance). The document instance continues to the end of the SGML document. As a consequence, any processing instructions and SGML comments between the DTD and start of the first element in the instance are officially part of the prolog, but any processing instructions or SGML comments following the end of that element are part of the instance.
SGML-DECLARATION-END condition? local-declaration* action*
The SGML-DECLARATION-END rule is performed at the end of the SGML Declaration. This rule can be used to access the #APPINFO information. No output processor rules are performed prior to the SGML-DECLARATION-END rule other than possible EXTERNAL-TEXT-ENTITY rules for the #CHARSET, #CAPACITY and #SYNTAX entities in the SGML Declaration and SGML-ERROR rules for warnings in the SGML Declaration.
If there are errors, OmniMark will terminate processing at the end of the SGML Declaration.
DTD-START condition? local-declaration* action*
The DTD-START rule is performed at the start of a DTD. It is performed immediately after the document element name is determined, so that the DTD's "name" is known. (The #DOCTYPE built-in stream can be used to return the DTD's name.)
The DTD-START rule allows comments and processing instructions prior to the DTD to be distinguished from those within the DTD. The fact that the document element name is known means that any comment between the DOCTYPE keyword and the document element name cannot be distinguished from those prior to the DTD.
DTD-END condition? local-declaration* action*
The DTD-END rule is performed at the end of the DTD. This rule can be used to access the #DOCTYPE document element name. It also separates the SGML comments and processing instructions in and prior to the DTD from those in the remainder of the document prolog, between the DTD and the document instance.
PROLOG-END condition? local-declaration* action*
PROLOG-END is performed at the end of the document prolog, immediately prior to the start of the document instance and the document element.
EPILOG-START condition? local-declaration* action*
EPILOG-START is performed immediately following the end of the document element. It separates the document element from those SGML comments and processing instructions that follow the document element in the document instance.
When the document instance has SGML errors, the programmer can still obtain control in the EPILOG-START rule to determine how to complete the processing. This allows the program to clean up properly while still reporting as many errors as possible before terminating the parsing of the current document.
OmniMark provides two rules, DOCUMENT-START and DOCUMENT-END, for initialization and termination in the output processor. These rules always execute in the output processor, and can not appear in cross-translations or process programs.
DOCUMENT-START condition? local-declaration* action+
The DOCUMENT-START rule allows the OmniMark programmer to do processing and produce output before the start of the SGML document (including the SGML Declaration if any).
For example:
DOCUMENT-START OUTPUT "{\rtf1\mac %n" DOCUMENT-START WHEN index ISNT OPEN OPEN index AS FILE "index.doc"
DOCUMENT-START rules are performed, in the order they appear in the OmniMark program, after all PROCESS-START rules (if any) have been performed, but before any other processing is done (including PROCESSING-INSTRUCTION rules).
The conditions on DOCUMENT-START rules can be controlled from variables set on the command-line, or from variables that have been set in preceding DOCUMENT-START or PROCESS-START rules.
DOCUMENT-START rules are not permitted in cross-translations or process programs.
Standard "setup" actions that only make sense for the output processor can be placed in a DOCUMENT-START rule in a separate file. This file can be incorporated in many different programs using the INCLUDE declaration.
This only makes sense if these rules are never needed in a cross-translation or a process program. However, if the setup actions are also useful for cross-translations, the actions should be placed in a PROCESS-START rule instead.
DOCUMENT-END condition? local-declaration* action+
The DOCUMENT-END rule allows the OmniMark programmer to do processing and produce output after the end of the SGML document element.
DOCUMENT-END OUTPUT "}%n"
DOCUMENT-END rules are performed, in the order they appear in the OmniMark program before any PROCESS-END rules, but after any other processing is done (including PROCESSING-INSTRUCTION rules).
DOCUMENT-END rules are not permitted in cross-translations or process programs.
Standard "tear-down" actions can be placed in a DOCUMENT-END rule in a separate file, which is incorporated in the program using an INCLUDE declaration. A DOCUMENT-END rule should only be used if the actions only make sense for the output processor and they are never needed in a cross-translation or a process program. However, if these actions also make sense for cross-translations, they should be placed in PROCESS-END rules.
Next chapter is Chapter 5, "Organizing Rules With Groups".
Copyright © OmniMark Technologies Corporation, 1988-1997. All rights reserved.
EUM27, release 2, 1997/04/11.