SGML record boundaries

In SGML, a document's text consists of records which are surrounded by SGML RS (record-start) and RE (record-end) characters. In general, OmniMark prepares the text directed to the parser so it will be suitable to the SGML parser, and text returned by the parser is similarly treated for the markup rules. OmniMark programmers and users usually never need to be aware of these two operations, but exceptions can arise.

The vast majority of applications on all systems use the line-feed and carriage-return character values for the record-start and record-end characters. As a consequence, very few applications will be affected by this behavior. To be affected, an application must use an SGML declaration that specifies RE and/or RS function character values other than those usually used by the system on which OmniMark is running.

OmniMark uses the system-defined values of line feed and carriage return for record-start and record-end, respectively.

By default, OmniMark supports the SGML form of line representation in the following two ways:

  • In text written to the #markup-parser stream, each instance of the newline sequence, "%n", is converted to the two-character sequence (RE, RS). The effect of this conversion is that each newline sequence becomes the record-end mark for the line the newline ends as well as the record-start mark for the following line.
  • In text provided to the parser, each instance of the RE character is converted to the newline sequence. Most record-start characters are discarded by the SGML parser once they are recognized and markup is processed, and don't need to be handled by the markup rules.

OmniMark can be used to override this behavior with the sgml-in and sgml-out actions. These actions are intended to be used when the application's view of record boundaries is different from that specified in the SGML declaration.

Additionally, these two actions can be used to suppress record boundary conversion.

There are some interdependencies between the value given in the newline declaration and the default record boundary conversions that you should be aware of.

If no sgml-in action is encountered prior to the output of (some) data to the #markup-parser stream, then the default conversion depends on the value of the newline sequence, as follows:

  • If the newline sequence is a single character or if there is no newline declaration in the OmniMark program, then all newline sequence characters in data output to the #markup-parser stream are converted to the sequence of carriage return followed by line feed. For systems that use the ASCII character set, this is equivalent to sgml-in "%13#%10#".
  • If the newline sequence has two or more characters, then newline sequences output to the #markup-parser stream are not converted. For all systems, this is equivalent to sgml-in #none.

These defaults are in effect until an sgml-in action is encountered.

If no sgml-out action is encountered prior to the processing of data content, then all record-end characters in data content are converted to the newline sequence prior to their being provided to markup rules. In other words, for all systems, the default sgml-out action is: sgml-out "%n".

Comments, marked sections and processing instructions

Record ends occurring in processing instruction text, IGNORE marked section text, and the text of SGML comments are subject to processing as record ends occurring in PCDATA. OmniMark converts the record ends to the value specified by the sgml-out action. If the sgml-out action specifies #none, record-ends are provided to the markup rules in the form in which they come from the SGML parser.

The SGML standard (ISO 8879) doesn't address the processing of text in processing instructions, IGNORE marked sections, or SGML comments, as it does for data content. As a consequence, in these types of text, OmniMark's built-in SGML parser does not discard record-start characters, as it usually does in data content and attribute value text. When the sgml-out action specifies #none, record-start characters will be present in the text.

When the sgml-out action specifies a string:

  • any sequence of record-end characters followed immediately by a record-start character (RE, RS) in the text of a processing instruction, IGNORE marked section, or SGML comment, is replaced by that string prior to the text being made available to the OmniMark program, and
  • all other record-end and record-start characters will be unchanged.

This processing is different than that for data content and attribute value text in which:

  • each record-end is replaced by an sgml-out string, and
  • all record-starts are left alone.

This processing ensures that, unless character references are used in strange ways, all "newlines" come out the same.

The conversion of the record-end/record-start sequence to the sgml-out string occurs when the %c operator is processing in a marked-section ignore rule or an sgml-comment rule, just as in a data-content rule. For a processing-instruction rule, the conversion occurs prior to the text of the processing instruction being matched to the pattern at the head of the rule.

The processing of record-starts and record-ends in the text of processing instructions differs between different versions of OmniMark:

  • In versions of OmniMark prior to Version 2, record-starts were removed from the text and record-ends were converted to the system-standard line-end character or sequence.
  • In versions of OmniMark starting with Version 2 but prior to V2R4, no conversion of record-ends or record-starts was ever done.
  • In versions of OmniMark starting with V2R4, the "default" behavior has been made compatible with that of OmniMark prior to Version 2, but the OmniMark programmer has been given control over the processing with the sgml-out action.

Record ends in processing instructions are not subject to the same rules as record ends in data content and attribute value text. The text in processing instructions is subject to the same processing as the text in SGML comments and IGNORE marked sections: any record-end/record-start sequence is replaced by the string specified by the sgml-out action.

Related Topics