What Could We Have Done Without SGML?

Beyond SGML

1. What Could We Have Done Without SGML?

Next chapter is Chapter 2, "A Grammar for Grammars".

What we could have done without SGML, had we thought of it, is to design a grammar definition language. The language described in this paper has all the effective functionality of SGML but is very much simpler to use, and allows the definition of grammars that better match the requirements of many applications.

It has, correctly, been pointed out that this language is not really "beyond" SGML at all. It addresses just the same class of problems in the same way. The only difference is that a much simpler solution exists than is described by the SGML Standard.

1.1 A Grammar

A syntax, markup and recognized structure is defined by a grammar. The grammar serves the function of the SGML Decaration and prolog of an SGML document entity. Subject to conversion between character sets, a grammar has a single representation: there are no "variant syntaxes" for the grammar. The grammar is a separate entity from any marked-up document. (In fact, a grammar is a marked-up document: the language is powerful enough that it is easy to write a grammar for grammars.)

A grammar consists of BNF-like productions augmented by a small number of declarations and annotations that aid in the interpretation of the grammar. A small sample grammar is as follows:

   document ::= document title, chapter+;
   chapter ::= "<chapter>", chapter title,
               ([#RS, " "+], paragraph | list)+;
   list ::= "<list>", ("=", list item){2}+, "</list>";
   document title, chapter title, paragraph, list item ::= #TEXT;

The right-hand side of a grammar production is written as a regular expression consisting of non-terminal syntactic names, marks, keywords, the ",", "|", "?", "*" and "+" operators, and parentheses.

Each production in the grammar defines a syntactic object, the name of which is the left-hand size of the production. A syntactic name can only be defined once in a grammar. The first syntactic object defined in a grammar is the one that matches a textual document as a whole.

In this paper, syntactic names are allowed to consist of a sequence of words, separated by spaces. The semi-colon delimiter at the end of productions would not be required if a syntactic name were a single word. Issues such as this are in the "sugar" category.

1.2 Marks, Text and Keywords

Marks are used to separate and identify the textual components of a marked-up document. The description of a mark in a grammar is a regular expression surrounded by brackets (e.g. [#RS, " "+]). If a mark consists solely of a literal string, the brackets can be omitted ("<chapter>"). Keywords are used to identify variable or system-dependent conditions, such as the start of a "line" (#RS).

The representation of lines and records and the use of control characters is normally a system-specific issue, and system-independent keywords are provided to represent each of these phenomena. Text files are typically converted with respect to these issues when they are transferred between computer systems. Grammars can be written for unconverted files by not using the system-independent keywords, but by directly coding the values of characters used in marks (e.g. ASCII escape as #27).

Textual components, meaning anything other than the text recognized as part of marks, is represented by the keyword #TEXT.

1.3 Recognizing Marks and Text

A grammar must be LL(1). At each point in the parsing process using a particular grammar there is a set of zero or more marks that can appear. Any text that does not match one of the allowed marks at any point is considered to be a textual component (matching #TEXT). Matching of marks is done on a longest-match-preferred basis.

Each mark is considered separately in the recognition process. The difference between the following two productions is that in the first case, the text of the comment requires the sequence "-->" to terminate it. In the second case, the text "--" is sufficient to end the text, but the "--" must be followed by ">".

   comment ::= "<!--", #TEXT?, "-->";
   comment ::= "<!--", #TEXT?, "--", ">";

If only marks are allowed at a point in parsing (i.e. no #TEXT), one of those marks must be recognized in the text. Each mark can only appear once in the set of marks recognized at each point.

Note that #TEXT must match at least one character (although that character may be a white-space cahracter where white-space is not otherwise recognized). #TEXT can be followed by ? (like any other mark) to make it optional (a la SGML's #PCDATA).

An identifying mark can appear in a grammar either as part of the syntactic object it identifies or in the context in which the syntactic object is recognized. This difference is illustrated in the example at the start of this paper by paragraph and list in the production for chapter.

The grammar cannot be "left-recursive". In other words, it must not be possible for a syntactic object to contain another syntactic object of the same name without there being at least one mark consisting of at least one character preceding the embedded syntactic object.

1.4 Declarations

Declarations are provided to modify the interpretation of marks. Declaration look like productions except that there is a keyword rather than a syntactic name on the left-hand side. Two declarations are exemplified by the following:

   #MARK-LENGTH ::= 1024;
   #UC-MARKS ::= yes;

The #MARK-LENGTH declaration limits the maximum length of a mark. Any longer mark signals an error. This limit is provided both to make implementing recognizers easier and to improve the quality of error reporting.

The #UC-MARKS declaration indicates whether letters in marks are to be recognized independent of whether they are in upper- or lower-case. Keywords and syntactic names are always recognized independent of whether they are in upper- or lower-case.

If some marks are recognizable in either case and some must be in a particular case, the #CASE keyword can be used preceding literal strings containing letters that must appear in the given case. For example:

   lower-case e acute ::= #CASE "&eacute;"

1.5 White Space

The basic rules for the treatment of white space are as follows:

A white space character is discarded whenever only marks are recognized and none of the marks matches the white-space character.
A white space character is considered to match #TEXT whenever #TEXT is allowed.

These rules can be modified by explicitly placing (possibly optional) marks that match white space in a grammar preceding the use of #TEXT. For example:

   name ::= [#WS?, "!"], proper name | #WS?, other name;

In this example, a sequence of (zero or more) white-space characters followed by an exclamation mark identifies a proper name, and anything else, with leading white-space characters removed, is an other name.

The keyword #WS matches a string of one or more white-space characters of any sort. The declaration defines what characters are considered white-space:

   #WS ::= " " | #TAB | #RE;

1.6 Specifying Explicit Matches for Text

Sometimes it is necessary to indicate that certain sequences of text are to be recognized as text and not markup. Usually text is recognized by what ends it, but occasionally it is recognized explicitly (for example, names in a grammar for markup itself). Such text is identified in the grammar by surrounding the characters by braces (e.g. {"<"|">"}). Non-syntactic names (but not syntactic names) can be used within the braces when the right-hand side of the non-syntactic name's production consists entirely of marks.

1.7 Empty Syntactic Objects

The keyword #EMPTY can be used as the right-hand side of a production to allow the absence of other syntactic items. For example:

   document ::= document title, (preface | missing preface), document body;
   preface ::= "<preface>", paragraph+;
   missing preface ::= #EMPTY;

It is an error for an empty syntactic object to be optional (doing so would make its presence or absence ambiguous). Note that an empty syntactic object is not the same thing as a syntactic object that consists of entirely of a mark. A mark must always match one or more text characters. An optional mark has the question mark outside the brackets, not inside, making it clear whether the mark was recognized or not.

An empty syntactic object is recognized by recognizing one of the marks or textual content that is allowed following the empty syntactic object. It is an error for one of these following marks to be indistinct from the alternatives to the empty syntactic object.

1.8 Resolving Ambiguities

The marks recognized at any point in parsing consist of all possible marks in the following part of the current production, and, if the current production can end, all possible marks following the current point in the "parent" of the current production (i.e. the production in which the syntactic name on the left-hand side of this production was recognized), and so on. If a mark, or text is recognized at more than one level in the current nesting of productions, the ambiguity is resolved by recognizing the mark in the innermost production. In other words, the fewest number of productions are terminated that result in the recognition of a mark.

Note that, in a production, the parser must "look inside" the productions whose names appear in the production to find which marks can be recognized.

1.9 Non-Syntactic Objects

Sometimes it is convenient to include syntactic objects in a grammar, not because they are structurally significant, but to simplify the grammar. For example:

   chapter ::= "<chapter>", chapter title,
               (paragraph material, section* | section+);
   section ::= "<section>", section title, paragraph material;
   paragraph material ::= (paragraph | list | example)+;

The paragraph material syntactic object serves simply to make the grammar easier to read. Such objects can be made "non-syntactic objects" by preceding the left-hand side of their production by a percent sign:

   %paragraph material ::= (paragraph | list | example)+;

Non-syntactic objects serve the same role as syntactic objects in recognizing marks and textual content, but are not considered structural objects, and their identity is not passed to any appliction for which recognition is being performed. For example, in the case of the above definition, use of the percent sign in the definition of paragraph material makes the grammar equivalent to the following from the point of the application:

   chapter ::= "<chapter>", chapter title,
               ((paragraph | list | example)+, section* | section+);
   section ::= "<section>", section title, (paragraph | list | example)+;

Note that, because non-syntactic objects can be recursively defined (i.e. use themselves as part of their own definition), it is possible to write a grammar using non-syntactic marks that has no equivalent not using non-syntactic marks.

Non-syntactic names are in the same "name space" as syntactic objects: the percent sign is not part of the object's syntactic name or distinguish it in any way. The difference between syntactic objects and non-syntactic objects is purely functional.

1.10 Included Text

Included text, equivalent in function to SGML entities, are defined by a "replaceable mark". For example:

   %generic intro title ::= "&generic-intro;" = "Introduction";

The replacement part of a replaceable mark (the text following the equals sign) is replaced, and parsing continues by parsing the replacement. Marks cannot cross replacement boundaries: they must be completely contained within or outside the replacement text. The #ASIS keyword can be used to indicate that a replacement is not to be reparsed:

   %backslash ::= "\\" = #ASIS "\";

Note the use of non-syntactic objects in these examples, because typically replacements are for convenience in marking up a document and do not affect the document's structure.

Text can be included from other sources by use of the #SOURCE keyword:

   %generic intro chapter ::= "&intro;" = #SOURCE "myintro.doc";

The literal following #SOURCE is interpreted by an application: it need not be a file name. To provide a more general mechanism for including external text, the #SOURCE keyword can be used in place of #TEXT on the right-hand side:

   included text ::= "&", #SOURCE, ";";

This example means: anything between an ampersand and a semicolon is the application-specific name of the source of text to be included and reparsed. In either use of #SOURCE, the keyword can be preceded by #ASIS, indicating that the included text is not to be reparsed.

1.11 Non-Issues

The markup-language-definition language described above has all the effective descriptive power of SGML in spite of it missing many of the specific constructs of SGML. Many of SGML's constructs are simply predefined element types, with associated distinctive markup. For example, a comment is text that is to be ignored. This "semantics" could as easily be defined by the application. The common request that SGML parsers return the text of comments to the applications invalidates any claim that there is any utility in predefined elements of this sort: if the application is not allowed to interpret an element, then implementations are not able to provide all the functionality requested by users.

Record ends and white-space handling should also be left up to the application. The attempt in SGML to provide general rules has not worked. The method described above is to treat white space as text when the grammar allows text, and to ignore it otherwise. Ambiguities can be resolved by adding contextually-significant marks to the grammar that match white space when interpreting it as text could cause difficulty.

Non-SGML entities are really a syntactic object. The requirement of SGML to provide an "intermediate" representation of a non-SGML entity in the form of SDATA text is clumsy. In the language described in this paper, the syntactic object identified by a mark identifies the "non-SGML" object. For example:

   upper-case E grave ::= "&Egrave;";

In this case, the application "sees" the syntactic object upper-case E grave and is free to interpret it in the same manner as it would the expansion of a corresponding SDATA entity. External "non-SGML" objects would be interpreted by an application, just as they are for SGML.

The question arises: where do "entities", such as È, appear in a grammar. They are not just automatically allowed anywhere: from the point of view of the grammar they are just like any other syntactic object. The answer is that they are included on the right-hand side of appropriate productions. This means, in practice, that #TEXT is only used sparingly in a grammar: usually a non-syntactic text object will be defined that is a repeatable alternation of #TEXT and all allowed "character" entities. Other types of entities can be incorporated elsewhere in appropriate spots. An example of doing this for text is:

   paragraph ::= "<p>", text;
   %text ::= (#TEXT |
              upper-case E acute |
              lower-case e acute |
              upper-case E grave |
              lower-case e grave)+;

#TEXT matches one or more text characters. It can be qualified by "?", or preceded by optional white-space marks to control whether is is optional, and whether white-space counts. This approach does not cover all the issues of text validity, but is a simple system that does a lot more than SGML does in the same circumstances.

Next chapter is Chapter 2, "A Grammar for Grammars".