HOME | COMPANY | SOFTWARE | DOCUMENTATION | EDUCATION & TRAINING | SALES & SERVICE

    "The Official Guide to Programming with OmniMark"

Site Map | Search:   
OmniMark Magazine Developer's Forum   

  International Edition   

OmniMark® Programmer's Guide Version 3

16. Processing External Entities

Detailed Table of Contents

Previous chapter is Chapter 15, "Processing SGML Errors".

Next chapter is Chapter 17, "SGML Document and Subdocument Parsing".

SGML supports several different kinds of external entities. They are divided into two classifications:

16.1 SGML Entity Managers

An SGML entity manager is a program, or part of a program, that takes an entity reference and returns the text to an SGML parser. The SGML parser manages internal entities, but it requires help from the applications using it to find the text of external entities. Every SGML system contains an entity manager for external entities. The entity manager can find external entity text in a file system or database, or the text can be "hard coded" into the entity manager itself.

The external entity manager is used primarily when a general or parameter entity reference is used in a document that requires the SGML system to read in the text of the entity and requires the SGML parser to interpret the text of the entity in the context in which the parser finds the reference.

OmniMark incorporates a "built-in" entity manager, that supports the requirements of most users for locating external entities. As an alternative, when OmniMark's built-in entity manager does not do what an application requires, an OmniMark program can implement its own entity manager, using the EXTERNAL-TEXT-ENTITY rule and other facilities described in this chapter.

16.1.1 OmniMark's Built-in Entity Manager

When OmniMark encounters a reference to an external text entity (i.e. one that can contain SGML markup), its built-in entity manager tries to find a system identifier for the entity. It first looks in the entity's declaration. If there is not a system identifier in the declaration but there is a public identifier, it looks the public identifier up in the OmniMark program's LIBRARY rules, and uses the system identifier found there, if any. Once it has a system identifier, the entity manager tries to use it as the name of a file, using each of the -libpath strings from the command line as prefixes (in the order given) in the process. Once it finds a file on the system on which OmniMark is running, it uses the text of that file as the entity. If the entity manager cannot find a system identifier, cannot find a file with the system identifier as its name, or finds that the file is unreadable, it stops the OmniMark program with an appropriate message.

An additional provision is made by OmniMark's built-in entity manager for "anonymous" entities referred to by public identifiers in the SGML Declaration, to avoid requiring all users to provide definitions for these entities. Some character sets and concrete syntaxes are "built-in" to OmniMark (in particular, those described in the SGML standard, ISO 8879). Unrecognized character sets and capacity sets are given default values. In particular, if one of these public identifiers is not mapped to a system identifier by a LIBRARY rule, the built-in entity manager will do the following:

When a file is found, the built-in entity manager passes the text of that file to the SGML parser unmodified (except possibly for changing the newline sequences in the text file to the RS/RE sequences expected by the SGML parser).

In other words, OmniMark's built-in entity manager assumes that a file identified by an external text entity contains the text of that entity. In particular, it is assumed that no conversion or other processing of the text is required. This is an appropriate assumption -- most text files that are subjects of an entity reference from within an SGML document will themselves be coded in SGML.

The public identifiers recognized by OmniMark's built-in entity manager are described in more detail in Section 16.4.2, "Public Identifiers in the SGML Declaration".

16.1.2 What Your Own Entity Manager Can Do

The relatively simple model supported by OmniMark's built-in entity manager -- an external entity is a file -- works for most OmniMark applications, but not for all. For those other applications, OmniMark programmers can, in a simple, straightforward way, write their own entity managers, using OmniMark's EXTERNAL-TEXT-ENTITY rule.

The EXTERNAL-TEXT-ENTITY rule allows the OmniMark programmer to do things other than just providing an alternative way of finding a file containing an entity's text. An entity can be any sequence of characters. The entity manager provides a sequence of characters to the SGML parser when the SGML parser passes an external entity reference to the entity manager. The sequence of characters doesn't have to be a "verbatim" copy of the text in a file. Other possibilities include:

Direct access to the entries in the LIBRARY rules and -libpath command-line arguments are provided by OmniMark so that the behavior of OmniMark's entity manager can be duplicated by a program written in OmniMark. This can be useful, for example, when what OmniMark's entity manager does is a good "fall-back" position when an application-specific scheme does not find an entity's text.

16.1.3 Which Entity Manager Is Used?

When there are no EXTERNAL-TEXT-ENTITY rules in an OmniMark program, then OmniMark uses its built-in entity manager to find the text of an SGML text entity, as described in Section 16.1.1, "OmniMark's Built-in Entity Manager".

When there are any EXTERNAL-TEXT-ENTITY rules in an OmniMark program, the built-in entity manager is not used -- the entity manager defined by the OmniMark program (by the EXTERNAL-TEXT-ENTITY rules) is used instead. In this case, one (and only one) of them must provide the replacement text for each external text entities, general or parameter, referenced in an SGML document, as follows:

(The above statements apply to both general entities and parameter entities. Similar statements apply independently to the "entities" referenced by the external identifier at the start of the DOCTYPE declaration, and by the public identifiers in the SGML declaration. For more about these entities, see Section 16.4.1, "The Public Identifier at the Start of the DTD", Section 16.4.2, "Public Identifiers in the SGML Declaration" and Section 16.5, "A Default External Text Entity Rule".)

OmniMark provides facilities that allow an OmniMark program to duplicate the behavior of OmniMark's built-in entity manager. An OmniMark program can, for example, provide or find the replacement text for a certain class of external text entities, but "fall back" on OmniMark's built-in behavior for any other external text entity by duplicating what OmniMark would have done otherwise, or by doing some variant of that.


16.2 External Entity Rules

The basic tools for writing an SGML entity manager in OmniMark are the EXTERNAL-TEXT-ENTITY and EXTERNAL-DATA-ENTITY rules.

This section describes those rules in detail.

16.2.1 Processing External Data and Subdocument Entities

Syntax

   EXTERNAL-DATA-ENTITY entity-name condition?
      local-declaration*
      action*

Several OmniMark constructs are used in processing external entities. Section 14.4, "Attributes" and Section 14.4.3.4, "Data Attributes Associated With Entity Attributes" mentions features that address the data attributes of external data entities.

This section presents the EXTERNAL-DATA-ENTITY rule.

Internal data entity references and all text entity references simply result in substitution of the entity's replacement text for the reference. The SGML parser automatically processes the replacement text and no special rules are needed in the OmniMark program.

An EXTERNAL-DATA-ENTITY rule is used to process external data and subdocument entities.

An EXTERNAL-DATA-ENTITY rule is triggered when a reference to the named entity occurs in data content and any specified condition is met. An EXTERNAL-DATA-ENTITY rule is never selected when entity occurs in the value of an ENTITY or ENTITIES attribute, because use of an entity name in an attribute value is not considered to be an entity reference.

When the same actions apply to more than one entity, the names of all the entities can be listed in the rule header, enclosed in parentheses, and separated by the operator "|" (OR). The entity name can be output using the format item "%q". (See Section 14.2.1, "Formatting Entity Names".)

For example, suppose references to entities named picture1 and picture2 are both to receive similar processing. The following EXTERNAL-DATA-ENTITY rule might be used:

   EXTERNAL-DATA-ENTITY (picture1 | picture2)
     OUTPUT "\picture{%uq.PIC}"

Specific entity names need not be listed in the rule header. The keyword #IMPLIED can be used instead to indicate that the rule should be selected whenever the condition is met and a reference to an external data or subdocument entity occurs. For instance, the following rule header can be used to define actions to be performed whenever a reference occurs to an entity declared with notation "TBL":

   EXTERNAL-DATA-ENTITY #IMPLIED WHEN NOTATION = "TBL"

It is an error if multiple EXTERNAL-DATA-ENTITY rules are selectable for one entity reference. If the same entity name is used in more than one EXTERNAL-DATA-ENTITY rule, then each rule must have a condition which ensures that only one will be selected. It is also an error if no EXTERNAL-DATA-ENTITY rule is selected when a reference occurs to an external data or subdocument entity.

16.2.1.1 Attribute References In EXTERNAL-DATA-ENTITY Rules

In the condition on an EXTERNAL-DATA-ENTITY rule, a test of an attribute value refers to a data attribute. In other rules, an unqualified attribute name refers to attributes of the current element, and data attributes must be accessed explicitly. For example, in the following rule headers, type is a data attribute:

Example A

   EXTERNAL-DATA-ENTITY graphic WHEN ATTRIBUTE type = "TIFF"

Example B

   ELEMENT list WHEN DATA-ATTRIBUTE type (OF ATTRIBUTE id) = "TIFF"

However, in the rule headers shown below, type is an attribute specified in a start-tag:

Example A

   ELEMENT list WHEN ATTRIBUTE type = "BULLET"

Example B

   EXTERNAL-DATA-ENTITY graphic WHEN ATTRIBUTE type OF ELEMENT
                   = "BULLET"

EXTERNAL-DATA-ENTITY rules are not permitted in cross-translations.

16.2.2 Processing External Text Entities

Syntax

   EXTERNAL-TEXT-ENTITY entity-name condition?
      local-declaration*
      action*

An EXTERNAL-TEXT-ENTITY rule is used to provide OmniMark's built-in SGML parser with the text of an external text entity (i.e. an external entity that is not CDATA, SDATA, NDATA or SUBDOC) whenever such an entity is referenced in an SGML document. An EXTERNAL-TEXT-ENTITY rule looks like an EXTERNAL-DATA-ENTITY rule. Its most important property is that everything written to the #SGML stream within the rule is considered to be part of the entity's text.

For example, the following EXTERNAL-TEXT-ENTITY rule specifies that the text of the entity named version is the contents of the file named "version.txt":

   EXTERNAL-TEXT-ENTITY version
      OUTPUT FILE "version.txt"

The following EXTERNAL-TEXT-ENTITY rule handles all external text entities. In a similar manner to its use in the EXTERNAL-DATA-ENTITY rule, #IMPLIED means "all SGML text entities" (actually "all named SGML text entities", see Section 16.4.1, "The Public Identifier at the Start of the DTD"). It is about the simplest entity manager that can be written. It just takes the name of the entity, lower-cases it, appends the extension ".ent", and uses it as a file name.

   EXTERNAL-TEXT-ENTITY #IMPLIED
      OUTPUT FILE "%lq.ent"

Note that "%q" gives the name of the entity, in the same way as in an EXTERNAL-DATA-ENTITY rule, and that "%eq", "%pq" and "%epq" give the entity's effective (declared or associated by the LIBRARY rules) system identifier, declared public identifier, and the system identifier mapped from the public identifier in the LIBRARY rules. However, because text entities do not have associated notations, the "o" modifier must not be used with the "%q" format item in an EXTERNAL-TEXT-ENTITY rule.

The following EXTERNAL-TEXT-ENTITY rule is a bit more complicated (and also a bit more useful). It first checks to see if an external text entity has a system identifier, then checks for a system identifier in the LIBRARY rules, and finally uses the entity name, appending the extension ".ent" if the entity is a parameter entity (the "%" kind), or the extension ".sgm" if the entity is a general entity (the "&" kind). Once it has a file name, it checks to see if the file exists and is readable. If so it passes the text of the file to the SGML parser as the entity's replacement. If not, it issues an error on the error file, and provides an SGML comment as the entity's replacement text.

   EXTERNAL-TEXT-ENTITY #IMPLIED
      LOCAL STREAM file-name
      DO WHEN ENTITY IS SYSTEM
         SET file-name TO FILE "%eq"
      ELSE WHEN ENTITY IS IN-LIBRARY
         SET file-name TO FILE "%epq"
      ELSE WHEN ENTITY IS PARAMETER
         SET file-name TO FILE "%lq.ent"
      ELSE
         SET file-name TO FILE "%lq.sgm"
      DONE
      DO WHEN FILE file-name EXISTS
          & FILE file-name IS READABLE
         OUTPUT FILE file-name
      ELSE
         PUT #ERROR "*** ERROR *** Can't read from %"%g(file-name)%"%n"
         OUTPUT
          "<!-- *** ERROR *** Can't read from %"%g(file-name)%" -->"

      DONE

The "ENTITY IS" tests are described in Section 14.2.2, "Entity Tests". The FILE ... "IS READABLE" and related tests are described in Section 10.4.1, "File Tests".

The SUBMIT action can be used in the EXTERNAL-TEXT-ENTITY rule in place of (or in addition to) the OUTPUT or PUT actions. The uses and implications of doing so are discussed in Section 16.2.2.3, "Using SUBMIT in the External Text Entity Rule".

If an EXTERNAL-TEXT-ENTITY rule outputs no text, then, from the point of view of the SGML parser, the entity's replacement text simply consists of zero characters. This is not an error.

It should be noted that, in a context-translation, an EXTERNAL-TEXT-ENTITY rule can be performed while the FIND-START rules are being performed. This will happen if the FIND-START rules output the text of the SGML Declaration and there are EXTERNAL-TEXT-ENTITY rules for processing the entities represented by the public identifiers in the SGML Declaration.

16.2.2.1 Where Entities Come From and Where They Go

The OUTPUT action is normally used inside an EXTERNAL-TEXT-ENTITY rule to provide the SGML parser with the entity's replacement text. In an EXTERNAL-TEXT-ENTITY rule, the default #CURRENT-OUTPUT stream set contains only the #SGML stream. That allows the replacement text of the entity to be fed to the parser using OUTPUT actions.

16.2.2.1.1 Constructing Entity Text from Multiple Sources

Because anything written to the #SGML stream in an EXTERNAL-TEXT-ENTITY rule becomes part of the entity's text, the entity's text can be made up of one or more pieces from one or more sources.

For example, the following EXTERNAL-TEXT-ENTITY rule processes any external text entity that has a system identifier. It treats the system identifier as a sequence of file names, separated by semicolons (note that this is a good idea on some systems, but not on others), and concatenates the text from all of the files together as the entity's replacement text+.

   EXTERNAL-TEXT-ENTITY #IMPLIED WHEN ENTITY IS SYSTEM
      REPEAT SCAN "%eq"
      MATCH [ANY EXCEPT ";"]+ => file-name
         OUTPUT FILE file-name
      MATCH ";"
         ; Ignore any semicolon
      AGAIN

An example of an entity with multiple file names is a case where there is a general entity that represents the chapters that comprise the "advanced" part of a textbook:

   <!ENTITY advanced SYSTEM "chapter7.sgm;chapter8.sgm;chapter9.sgm">

Another example is a case where a parameter entity represents more than one set of declarations, as an alternative to having separate declarations for each set:

   <!ENTITY % chars SYSTEM "mathchars.ent;pubchars.ent">
   %chars;

16.2.2.1.2 Combining External Entities in a Single File

Alternatively, if the replacement text of some of the external entities is small, all of the entities can be defined in a single file. This technique can be used to construct a "control file" for configurable documents.

The following example reads in a file called "entity.set", which contains of a set of entity definitions, each of which consist of an external entity name terminated by an equals sign, a "quote" character, which may be anything other than a newline, and the text of the entity terminated by another "quote" character. Optional, non-significant line breaks are allowed following the equals sign and following the closing "quote" on the entity text.

The example assumes that the file "entity.set" is correctly formatted, and makes no provision for the entity not being defined. (If it is not defined, the entity text will be "", the zero-length string, which may or may not be an appropriate fall-back position.)

   EXTERNAL-TEXT-ENTITY #IMPLIED
      REPEAT SCAN FILE "entity.set"
      MATCH "%q" "=" "%n"? any => quote
            ((lookahead ! another quote) any)* => text
         OUTPUT text
         EXIT
      MATCH [any except "="]+ "=" "%n"? any => quote 
            ((lookahead ! another quote) any)* another quote "%n"?
         ; Skip other entities
      AGAIN

An example of an "entity.set" file is the following, which contains information specific to processing a set of documents. Different "quotes" are used just for illustration.

   orgname=/OmniMark Technologies Corporation/
   prodname='OmniMark'
   rights=
   "All rights reserved by OmniMark Technologies Corporation.  This
   material contains the valuable properties of OmniMark Technologies
   Corporation.  No part of this material may be reproduced, translated
   or transmitted in any form or by any means, electronic, mechanical,
   or otherwise, including photocopying and recording, without the
   permission in writing from OmniMark Technologies Corporation."

Using this example, the entity reference "&orgname;" would have the replacement text "OmniMark Technologies Corporation".

In practise, an entity manager will use a combination of these techniques.

16.2.2.2 The Domains Of The External Text Entity Rule

The EXTERNAL-TEXT-ENTITY rule is unique in OmniMark, in that different parts of it are executed in each domain. The header of the rule, and any associated condition is tested in the output processor. If the rule is selected, the actions within the rule body are performed in the input processor.

The reason for this split is:

In practise, the fact that the rule header is tested in the output processor, and the actions executed in the input processor will be irrelevant to most programmers. However, it does have the following implications:

Versions of OmniMark prior to V3 treated the rule header for the EXTERNAL-TEXT-ENTITY rule as if it were evaluated in the input processor. So there may be a change in behaviour for EXTERNAL-TEXT-ENTITY rules which are not in the #IMPLIED group. It is not expected that this will be a significant change for most programs.

16.2.2.3 Using SUBMIT in the External Text Entity Rule

In the case that an external text entity's text does not need processing, it is appropriate that an EXTERNAL-TEXT-ENTITY rule will use an OUTPUT or PUT action to provide the file's text to the SGML parser.

Even if some processing is required, it can be done with a "DO SCAN" or "REPEAT SCAN" in the EXTERNAL-TEXT-ENTITY rule, wherein each MATCH part emits the processed text with OUTPUT or PUT.

On the other hand, if substantial processing is required, it will often be the case that it is more appropriate to SUBMIT the text of the file for processing by FIND rules. In this case, any output of the FIND rules that process the submitted text is considered part of the text of the entity.

If the FIND rules are different from those used to process the main input, it will be necessary to use a "USING GROUP" prefix on the SUBMIT action to specify which FIND rules are used to process the submitted text. The following sample EXTERNAL-TEXT-ENTITY rule illustrates this processing:

   EXTERNAL-TEXT-ENTITY #IMPLIED WHEN ENTITY IS (PUBLIC & IN-LIBRARY)
      USING GROUP entity-processing
         SUBMIT FILE "%pq"

16.2.2.4 OUTPUT-TO in the EXTERNAL-TEXT-ENTITY Rule

An OUTPUT-TO action is allowed in an EXTERNAL-TEXT-ENTITY rule. OUTPUT-TO in an EXTERNAL-TEXT-ENTITY rule remains in effect until the end of the EXTERNAL-TEXT-ENTITY rule, unless it is overridden by a further OUTPUT-TO.

Normally, the only active output stream in an EXTERNAL-TEXT-ENTITY rule is the #SGML stream, so that text written using the OUTPUT action becomes part of the replacement text of the external text entity. The OUTPUT-TO action allows the OmniMark programmer to redirect the output to another destination. The EXTERNAL-TEXT-ENTITY rule's active output can be restored to the #SGML stream using

   OUTPUT-TO #SGML

16.2.2.5 Guarding Entity Expansion

An entity manager may want to skip over unreadable files when it is searching for a file containing the text of an external entity. The file tests described in Section 10.4.1, "File Tests" can be used, as in the following example:

   EXTERNAL-TEXT-ENTITY #IMPLIED WHEN ENTITY IS SYSTEM
      REPEAT SCAN "%eq"
      MATCH [ANY EXCEPT ";"]+ => file-name
         DO WHEN FILE file-name IS READABLE &
                 FILE file-name ISNT DIRECTORY
            OUTPUT FILE file-name
            EXIT
         DONE
      MATCH ";"
         ; Skip over the separating semicolons
      MATCH VALUE-END
         PUT #ERROR "None of the files %"%eq%" (for entity %q) " _
                    "are readable.%n"
         HALT
      AGAIN

This example interprets the system identifier as a set of alternative file names separated by semicolons. It uses the first file that is really a file and not a directory, and that is readable as the text of the external entity. If there is no such file, the EXTERNAL-TEXT-ENTITY rule terminates the OmniMark program with a message.


16.3 Accessing the Library Rules and the Library Path

To provide compatibility between the LIBRARY rules and -libpath command-line argument of OmniMark and the EXTERNAL-TEXT-ENTITY rule, three built-in stream shelves are supported by OmniMark: #LIBRARY, #LIBPATH and #LIBVALUE.

Although these shelves have been provided primarily to support compatibility with earlier versions of OmniMark, they are very useful in their own right. Programmers who are writing their own entity managers should carefully consider how these streams can be used for other purposes. The examples in this section might help some programmers get a start on thinking up new ideas.

16.3.1 Manipulating the Library Rule Mappings

#LIBRARY is a built-in stream that starts out with one item for each entry in every LIBRARY rule in the OmniMark program or in a -library file at run-time. The key of each item is a public identifier, and the value of each item is the corresponding system identifier.

So, for example, if the text of a source element is a public identifier and the OmniMark program is to output the corresponding system identifier the following element rule could be used:

   ELEMENT source
      OUTPUT #LIBRARY ^ "%c"

The primary use of the #LIBRARY stream is in EXTERNAL-DATA-ENTITY and EXTERNAL-TEXT-ENTITY rules. In such a rule, the following two OUTPUT actions would output the same text:

   OUTPUT "%epq"
   OUTPUT #LIBRARY ^ "%pq"

Any change made to the #LIBRARY stream is immediately reflected in how the "%epq" and "%epv" format items are interpreted. If the OmniMark program contains no EXTERNAL-TEXT-ENTITY rules, then any change made to the #LIBRARY stream also determines how OmniMark's built-in entity manager interprets public identifiers in referenced external text entities.

The default "current item" of the #LIBRARY shelf is the lastmost item, as is the case with programmer-declared shelves.

16.3.2 Manipulating the Library File Search Path

The built-in stream #LIBPATH starts out with one item for each -libpath argument on the command line, in the order that the -libpath arguments appear on the command line. The following EXTERNAL-TEXT-ENTITY rule does what OmniMark's built-in entity manager does for entities with system identifiers (a more complete example, handling public identifiers, is given in a later section):

   EXTERNAL-TEXT-ENTITY #IMPLIED WHEN ENTITY IS SYSTEM
      LOCAL STREAM file-name
      DO WHEN FILE "%eq" EXISTS
         SET file-name TO "%eq"
      ELSE
         REPEAT OVER #LIBPATH
            DO WHEN FILE "%g(#LIBPATH)%eq" EXISTS
               SET file-name TO "%g(#LIBPATH)%eq"
               EXIT
            DONE
         AGAIN
      DONE
      DO WHEN file-name IS ATTACHED
         OUTPUT FILE file-name
      ELSE
         PUT #ERROR
             "No file found for entity %"%q%", system id = %"%eq%"%n"
         HALT
      DONE

In this example, provision is made for there being no -libpath command-line argument or for no -libpath prefix producing the name of an existing file. If the EXTERNAL-TEXT-ENTITY rule was missing the HALT action, then, apart from the message being written to #ERROR, the OmniMark program would just continue.

The default "current item" of the #LIBPATH shelf is the lastmost item, as is the case with programmer-declared shelves.

16.3.3 Manipulating Built-In Entity Replacement Text Values

The built-in stream #LIBVALUE starts out with one item for each public identifier that is "built-in" to OmniMark. These values are used by OmniMark's "built-in" entity manager if an entity is not resolved using the #LIBRARY shelf, which means the OmniMark programmer can often avoid having to write an EXTERNAL-TEXT-ENTITY rule. The OmniMark program can add to, delete or modify these values to suit its needs.

For example, many applications have "parameters" that neither the OmniMark programmer nor the document users want to have hard coded in either the programs being used, in the DTDs or the documents. An example of such a "parameter" is a case in which a company's name is represented by an external general text entity, &company; that is defined using a public identifier:

   <!ENTITY company PUBLIC "-//miscellany//TEXT company//EN">

The use of the public identifier assures that the entity can be interpreted independently of the document containing the reference. A convenient way to make "&company;" a parameter is to simply specify the entity's replacement text on the command line that runs the OmniMark program that processes the document:

   omnimark -s ... -define ent "company:OmniMark Technologies" ...

The programmer-defined stream ent is used as the command-line parameter. The following OmniMark program fragment illustrates how the parameter is taken from the ent stream and added to the #LIBVALUE shelf. More than one parameter is defined by separating their definitions by semicolons, and a name is separated from its replacement text by a colon.

   GLOBAL STREAM ent

   DOCUMENT-START WHEN ent IS ATTACHED
      REPEAT SCAN ent
      MATCH [ANY EXCEPT ":"]+ => public-id ":"
            [ANY EXCEPT ";"]* => value ";"?
         SET NEW #LIBVALUE ^ "-//miscellany//TEXT %x(public-id)//EN"
            TO value
      AGAIN

An important feature of this example is that the OmniMark program knows nothing about what entities are supported or even how many of them there are: it just knows how to support entities.

The "names" in the ent stream are used as a "public text description" of the public identifiers assigned to the entity, and an "unregistered owner identifier" of "miscellany" is used.

The example assumes that OmniMark's built-in entity manager is going to be used, because it uses the #LIBVALUE stream as a source of entity text. However, a programmer-supplied entity manager, written using EXTERNAL-TEXT-ENTITY rules, can be used instead.

The default "current item" of the #LIBVALUE shelf is the lastmost item, as is the case with programmer-declared shelves.

16.3.4 Restrictions on the #LIBRARY, #LIBPATH and #LIBVALUE Shelf Values

As stated above, the #LIBRARY stream, the #LIBPATH stream and the #LIBVALUE stream start out with the contents of the LIBRARY rules, -libpath command-line arguments and OmniMark's "built-in" public identifiers respectively.

They can be changed by the OmniMark program at any time, in any way, but there are some restrictions that apply:

The restrictions are imposed by OmniMark's built-in entity manager, which takes over in these cases. Entity manager designers will usually impose similar restrictions. The restrictions are:


16.4 Processing Public Identifiers

It is possible for some external entities to have public identifiers with no system identifier. If there is no LIBRARY rule to map the public identifier onto a system identifier, then the OmniMark program may have to process the public identifier itself.

Sometimes this is done because instead of the replacement text of the external entity being contained in a file, the public identifier contains all of the information necessary to fetch or construct the replacement text.

This section describes techniques for parsing public identifiers.

16.4.1 The Public Identifier at the Start of the DTD

An external identifier (public identifier and/or system identifier) is allowed following the keyword PUBLIC immediately following the keyword DOCTYPE and the document element name. This public identifier, when present, identifies an entity containing declarations to be included following those in the DOCTYPE declaration.

The OmniMark programmer's entity manager can provide the text (i.e. the declarations) of this entity using an EXTERNAL-TEXT-ENTITY rule. The keyword #DTD is used to identify this entity. For example, the following rule uses a file called "default.dtd" when there is an external identifier at the start of the DTD and it has no public identifier or system identifier (e.g. <!DOCTYPE doc SYSTEM [):

   EXTERNAL-TEXT-ENTITY #DTD WHEN ENTITY ISNT (SYSTEM | PUBLIC)
      OUTPUT FILE "default.dtd"

If the "%q" format is used for the #DTD entity, it will produce the string "#DTD". Note, however, that this entity really doesn't have a name, and that, using a variant SGML syntax, an SGML document can define an entity with the name "#DTD", that produces the results with "%q". The "ENTITY IS #DTD" test can be used to distinguish the "real" #DTD entity from the user's entity with the same name.

It sometimes happens that the external identifier at the head of a DTD has neither a system identifier nor a public identifier, as in:

   <!DOCTYPE report SYSTEM>

In this case, it may be appropriate for an OmniMark program to use the name of the document element to find the implicitly referred to entity. For example, in this case, it may be that the file "report.dtd" is intended to be used. The name of the document element is available to the OmniMark program in the #DOCTYPE stream (see Section 14.1.3.3, "The Document Element Name"). For example:

   EXTERNAL-TEXT-ENTITY #DTD WHEN ENTITY ISNT (SYSTEM | PUBLIC)
      OUTPUT FILE "%g(#DOCTYPE).dtd"

The "ENTITY IS #DTD" test is used to determine whether the entity is the #DTD one or not. #DTD is used like EXTERNAL, PUBLIC or PARAMETER in an ENTITY test, and can be combined with these other keywords. It is useful when an EXTERNAL-TEXT-ENTITY rule can process either the #DTD entity or another entity, and needs to determine which one it has. Examples of this are going to be rather complex entity managers in practise, so to illustrate the point, the following somewhat contrived example processes either the #DTD entity or the entities named "my-dtd" or "the-dtd":

   EXTERNAL-TEXT-ENTITY (#DTD | my-dtd | the-dtd)
      DO WHEN ENTITY IS SYSTEM
         OUTPUT FILE "%eq"
      ELSE WHEN ENTITY IS #DTD
         OUTPUT FILE "my.dtd"
      ELSE
         OUTPUT FILE "%q.ent"
      DONE

An EXTERNAL-TEXT-ENTITY rule of the form

   EXTERNAL-TEXT-ENTITY #IMPLIED
      ...

matches all named entities, not including the #DTD one (because it doesn't have a name). This allows the #DTD entity to be processed in a different manner than those defined by entity declarations. To match all named entities and the #DTD one, both #IMPLIED and #DTD have to be used, as in:

   EXTERNAL-TEXT-ENTITY (#IMPLIED | #DTD)
      ...

If there are any EXTERNAL-TEXT-ENTITY rules in an OmniMark program that use the keyword #DTD in their heading, then all #DTD entities must be handled by the OmniMark program. If no #DTD entity is handled by an OmniMark program then all such entities are subject to OmniMark's default processing. See Section 16.5, "A Default External Text Entity Rule" for more information on this default processing.

In the head of the EXTERNAL-TEXT-ENTITY rule, #DTD can be combined with #IMPLIED or with the names of named entities, but not both, because #IMPLIED cannot be combined with the names of entities.

The #DTD entity is considered a parameter entity (not a general entity) for the purpose of the "ENTITY IS GENERAL" and "ENTITY IS PARAMETER" tests.

16.4.2 Public Identifiers in the SGML Declaration

The public identifiers that can appear in the SGML Declaration, for the base character sets, for the capacity set and for the concrete syntax, are processed in much the same way as the #DTD entity. They are identified by the keywords #CHARSET, #CAPACITY and #SYNTAX, respectively. They are like the #DTD entity in most respects:

Entities referenced by the public identifiers in the SGML Declaration have the additional following properties:

The ISO character entities (e.g. the entity referenced by "&Eacute;") are defined in external files rather than being "hard coded" inside OmniMark's built-in entity manager, with the files divided as described in Appendix D.4 of ISO 8879, the SGML standard. These files are shipped with OmniMark, together with a file containing a LIBRARY rule that maps both ISO 8879-1986 and ISO 8879:1986 versions of the public identifiers to the appropriate files.

16.4.2.1 Default Processing

If there is no EXTERNAL-TEXT-ENTITY rule to process an entity associated with a public identifier in the SGML Declaration, and the public identifier is one of those in the following list, then OmniMark provides the entity text corresponding to the meaning of the public identifiers, as defined in the SGML standard:

   ISO 646-1983//CHARSET International Reference Version
                         (IRV)//ESC 2/5 4/0
   ISO 646:1983//CHARSET International Reference Version
                         (IRV)//ESC 2/5 4/0
   ANSI X3.4-1986//CHARSET American Standard Code for
                           Information Interchange (ASCII)//ESC 2/8 4/2
   ISO 8879-1986//SYNTAX Reference//EN
   ISO 8879-1986//SYNTAX Core//EN
   ISO 8879-1986//SYNTAX Multicode Basic//EN
   ISO 8879-1986//SYNTAX Multicode Core//EN
   ISO 8879:1986//SYNTAX Reference//EN
   ISO 8879:1986//SYNTAX Core//EN
   ISO 8879:1986//SYNTAX Multicode Basic//EN
   ISO 8879:1986//SYNTAX Multicode Core//EN

In addition, any capacity set public identifier is accepted and matched with the reference capacity set values (all 35000).

All three character sets are given the same definition: that of the IRV of ISO 646. The concrete syntaxes with colons in their names are given the same definitions as those in the SGML standard with dashes instead.

The #LIBVALUE stream starts out with one item for each of the public identifiers listed above. The key of each item is the public identifier, and the value of each item is the corresponding replacement text: a character set definition for each of the CHARSET public identifiers, and a concrete syntax definition for each of the SYNTAX public identifiers. The #LIBVALUE stream is used by OmniMark to get these text values, so if the OmniMark program changes the #LIBVALUE stream, those changes are reflected in how the SGML Declaration, in particular, is processed.

The values of the #LIBVALUE stream items must conform to the use that is made of the corresponding identifier. In particular, a public identifier for a capacity set or concrete syntax used in the SGML Declaration must have a value that is in the same format as the explicitly described capacity set or concrete syntax that could have been coded in its place in the SGML Declaration. A public identifier for a character set must have a corresponding value in the format described in Section 16.4.2.2, "Base Character Sets". More information about modifying the #LIBVALUE stream is contained in Section 16.3.4, "Restrictions on the #LIBRARY, #LIBPATH and #LIBVALUE Shelf Values".

16.4.2.2 Base Character Sets

Base character sets are identified by (usually formal) public identifiers. These public identifiers are interpreted by the implementation to produce a character set and identify:

The text associated with a character set's public identifier must conform to the description of an "external character set description", which is defined below. An external character set description describes a character set in a manner similar to a described character set portion in an SGML Declaration (ISO 8879 production [175]).

16.4.2.2.1 Defining a Base Character Set

The syntax of an external character set description is, in the notation used in ISO 8879:

   external character set description =
       ps+, (external character description, ps+)*
   external character description =
       external character number, ps+, (number of characters, ps+)?,
       (graphic character assignment | external character assignment)
   graphic character assignment =
       (lit, graphic character*, lit) |
       (lita, graphic character*, lita) |
       "TAB" |
       "B"
   external character assignment =
       "UCLETTER" | ("LCLETTER", ps+, external character number) |
       ("DIGIT", ps+, digit value) | "SPECIAL" | "DATA" | "CONTROL"
   external character number = number
   digit value = number

Where:

A graphic character assignment indicates how characters in parameter literals in the concrete syntax (delimiter strings and the LCNMSTRT, UCNMSTRT, LCNMCHAR and UCNMCHAR strings) are to be interpreted. For example,

   65   "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

indicates that the characters in the literal, when encountered in a parameter literal in the concrete syntax, are to be interpreted as characters with numbers 65 through 80 inclusive, in the character set identified by the public identifier of the entity containing the external character set description.

The character with a numeric value of zero ('\0', &#0; or CTRL-@) should not be used in a "graphic character assignment". If it does, it is ignored, as if it did not appear in the string. The "zero digit" character, '0', is not the same thing as the zero value character, and can be used.

An external character assignment assigns characters in the base character set, starting with the "external character number", and continuing for "number of characters" to one of the following categories:

All characters (in the range of allowed character values: 0 to 255 in current versions of OmniMark) not placed in one of these categories is classified as a non-significant, non-control data character. (Note that this method of defining base character sets ensures that no character will ever be two or more of LC Letter, UC Letter, Digit, Special or Control.)

Examples of using external character assignments are:

   48  Digit 0
      63  Special
      97  LCLetter 65  

These lines mean:

16.4.2.2.2 Sources of Base Character Set Information

If the OmniMark program provides zero characters of text for a #CHARSET public identifier, or only white space and SGML comments, then all characters in the base character set are made to be non-significant data characters. This is very often appropriate for character sets other than those that define the letters, digits and special characters. A reasonable thing for many applications to do is to provide the definition for the "ISO 646 (IRV)" character set when requested to do so and to provide a zero-length definition for all other character set requests.

The following text defines the "ISO 646 (IRV)" character set, which can either be kept in a file or hard coded in an OmniMark program:

   9     tab
        32     ' !"#$%&'
        39     "'()*+,-./0123456789:;<=>?"
        64     "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_"
        96     "`abcdefghijklmnopqrstuvwxyz{|}~"
         0  32 control
        39   3 special
        43   5 special
        48  10 digit     0
        58     special
        61     special
        63     special
        65  26 ucletter
        66     b
        97  26 lcletter 65
       127     control

Alternatively to providing base character set information itself, an OmniMark program can allow the user to provide files or other text containing the definitions of base character sets. This is especially useful where there is need to define additional letters, as may be appropriate when using the ISO "Added Latin" character sets.

16.4.2.2.3 Using Base Character Sets

More than one base character set can be used in a document. The document character set must be assigned the meanings of all characters in all of the base character sets used. All of the base character sets used in the document character set that contain significant (LC Letter, UC Letter, Digit or Special) characters must be used in defining the syntax-reference character set, and all significant characters in those base character sets must have their meanings assigned to syntax-reference characters. Any other base character set used to assign meanings in the document character set may be used in the syntax-reference character set, as may any base character set not used in the document character set. However, in the latter case, no "meaningful" assignments can be made to the syntax-reference characters, because there are no document characters that take on those syntax-reference character meanings.

The repetition of a public identifier in the document character set is recognized, and the previous definition of the base character set is used. Where (the same or different) base character sets assign an external character number to the same graphic character, the first assignment of the first base character set is used.

16.4.2.3 Capacity Sets

A capacity set can either be specified in the SGML Declaration or it can be described by a public identifier. If a public identifier is provided then its text must be a sequence of zero or more capacity names, each followed by a capacity points number. More precisely:

   external capacity set description =
       ps+, (name, ps+, number, ps+)*

ps+ can be any combination of white space characters and SGML comments.

Each name in an external capacity specification must be that of a capacity. The associated number becomes the limit value of that capacity. Any capacity not mentioned is set to the reference value (35000). An external capacity set description can contain no text, or only comments and white space, in which case all the capacities are set to the reference value.

16.4.2.4 Concrete Syntaxes

A concrete syntax can either be specified in the SGML Declaration or a public identifier can be provided describing a public concrete syntax. If a public identifier is provided, the entity text associated with the public identifier must conform to the part of a concrete syntax defined by production [182] in ISO 8879, the SGML standard, starting with "shunned character number identification". In other words, the entity text must be what would be put in the SGML Declaration following the keyword SYNTAX (but it cannot be another public identifier). More precisely:

   external public concrete syntax description =
       shunned character number identification, ps+,
       syntax-reference character set, ps+,
       function character identification, ps+,
       naming rules, ps+, delimiter set, ps+,
       reserved name use, ps+, quantity set

ps+ can be any combination of white space characters and SGML comments. Unlike a #CHARSET or #CAPACITY entity, a #SYNTAX entity must contain something other than white space and comments: all parts of an "external public concrete syntax description" must be present.

The following is an example of the contents of a file for a concrete syntax. It corresponds to the Reference Concrete Syntax ("_ISO 8879-1986_//_SYNTAX Reference_//EN"):

   SHUNCHAR  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
             16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
         BASESET
        "ISO 646-1983//CHARSET International
                               Reference Version (IRV)//ESC 2/5 4/0"
            DESCSET  0 256 0
         FUNCTION
            RE           13
            RS           10
            SPACE        32
            TAB SEPCHAR   9
         NAMING
            LCNMSTRT ""
            UCNMSTRT ""
            LCNMCHAR "-."
            UCNMCHAR "-."
            NAMECASE
               GENERAL YES
               ENTITY NO
         DELIM
            GENERAL SGMLREF
            SHORTREF SGMLREF
         NAMES SGMLREF
         QUANTITY SGMLREF

16.4.3 Parsing Public Identifiers

Some applications use the public text description and/or other parts of a formal public identifier to help construct the file name used to access the associated entity's text. (The "ptd" in "-//owner//TEXT ptd//EN" is the public identifier's public text description.) Formal public identifiers have a strict syntax that can easily be parsed using OmniMark patterns to extract the parts of interest to a particular application. The following is an EXTERNAL-TEXT-ENTITY rule that assumes that all external text entities have a formal public identifier and that the file containing the entity's text is formed by the public text description, a dot, and the lower-cased version of the first three letters of the public identifier's public text class (CHARSET etc.):

   EXTERNAL-TEXT-ENTITY #IMPLIED
      DO SCAN "%pq"
      MATCH (["-+"] "//")?
            ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* "//"
            [ANY EXCEPT "%_"] {3} => class3 [ANY EXCEPT " "]" " "-//"?
            ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* => description
         OUTPUT FILE "%x(description).%lx(class3)"
      DONE  

This EXTERNAL-TEXT-ENTITY rule would output the file named "Chapter3.tex" when given a reference to an entity with the following declaration:

   <!ENTITY ch3 PUBLIC "-//All Mine//TEXT Chapter3//EN">

A more general pattern, that will parse any formal public identifier, is the following:

   MATCH ("+//" ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")*
                => registered-owner-identifier |
          "-//" ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")*
                => unregistered-owner-identifier |
          ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")*
                => iso-owner-identifier)
         "//"
         [ANY EXCEPT "%_"]+ => public-text-class
         " "
         ("-//" => unavailable-text-indicator)?
         ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")*
               => public-text-description
         "//"
         (LETTER {2} => public-text-language VALUE-END |
          ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")*
               => public-text-designating-sequence
             ("//" ANY* => public-text-display-version)?)

This pattern can be used to in an EXTERNAL-TEXT-ENTITY rule, EXTERNAL-DATA-ENTITY rule or when processing an entity name valued element attribute. If this pattern were used to parse the public identifier of the previous example ("-//All Mine//TEXT Chapter3//EN", it would result in the following pattern variable assignments:

This pattern is rather lengthy because of its generality, not to mention the long pattern variable names used. Most applications will not need all parts of the public identifier. Shorter pattern variable names can be used -- the terms in the pattern are those used in the SGML standard to describe the parts of a formal public identifier. On the other hand, some OmniMark programmers will want to extend the pattern to extract details of an ISO owner identifier, public text description or designating sequence.


16.5 A Default External Text Entity Rule

The behaviour of OmniMark's built-in entity manager is equivalent to the following EXTERNAL-TEXT-ENTITY rule. This rule is provided to help OmniMark programmers mimic OmniMark's default behaviour as a fall-back position to their own entity management strategies.

There are six categories of external text entities: those defined by entity declarations and referenced by explicit entity references, and those represented by each of the keywords #DOCUMENT, #DTD, #CHARSET, #CAPACITY and #SYNTAX, respectively. If an OmniMark program contains EXTERNAL-TEXT-ENTITY rules for any category, then no default rule is provided: the OmniMark program must deal with all entities of that type.

The #DOCUMENT external text entity has very different properties than the others mentioned above, and is described in Section 2.5.3, "Controlling Input to the SGML Parser".

The following EXTERNAL-TEXT-ENTITY rule (actually, an equivalent one with different error messages) is only used for entities in categories not dealt with by the OmniMark program.

   EXTERNAL-TEXT-ENTITY (#IMPLIED | #DTD | #CHARSET | #CAPACITY | #SYNTAX)
      LOCAL STREAM file-name
      DO WHEN ENTITY IS (SYSTEM | IN-LIBRARY)
         DO WHEN FILE "%eq" EXISTS
            SET file-name TO "%eq"
         ELSE
            REPEAT OVER #LIBPATH
               DO WHEN FILE "%g(#LIBPATH)%eq" EXISTS
                  SET file-name TO "%g(#LIBPATH)%eq"
                  EXIT
               DONE
            AGAIN
         DONE
         DO WHEN file-name IS ATTACHED
            OUTPUT FILE file-name
         ELSE
            PUT #ERROR "File '%g(file-name)' for "
            DO WHEN ENTITY IS (#DTD | #CHARSET | #CAPACITY | #SYNTAX)
               PUT #ERROR "%q"
               PUT #ERROR " (%g(#DOCTYPE))" WHEN ENTITY IS #DTD
            ELSE WHEN ENTITY IS GENERAL
               PUT #ERROR "entity &%q;"
            ELSE
               PUT #ERROR "entity %%%q;"
            DONE
            PUT #ERROR " with public id%n" _
                       "   PUBLIC %"%pq%"%n" _
                       "  "
                WHEN ENTITY IS PUBLIC
            PUT #ERROR " does not exist!%n"
            HALT
         DONE
      ELSE WHEN ENTITY IS PUBLIC & #LIBVALUE HAS KEY "%pq"
         OUTPUT #LIBVALUE ^ "%pq"
      ELSE WHEN ENTITY IS (#CHARSET | #CAPACITY)
         ; Zero-length entity replacement text.
      ELSE WHEN ENTITY IS PUBLIC
         PUT #ERROR "Public identifier for "
         DO WHEN ENTITY IS (#DTD | #CHARSET | #CAPACITY | #SYNTAX)
            PUT #ERROR "%q"
            PUT #ERROR " (%g(#DOCTYPE))" WHEN ENTITY IS #DTD
         ELSE WHEN ENTITY IS GENERAL
            PUT #ERROR "entity &%q;"
         ELSE
            PUT #ERROR "entity %%%q;"
         DONE
         PUT #ERROR "%n" _
                    "   PUBLIC %"%pq%"%n" _
                    "   is not in the LIBRARY rules!%n"
            HALT
      ELSE
         DO WHEN ENTITY IS #DTD
            PUT #ERROR "#DTD (%g(#DOCTYPE)) "
         ELSE WHEN ENTITY IS GENERAL
            PUT #ERROR "Entity &%q;"
         ELSE
            PUT #ERROR "Entity %%%q;"
         DONE
         PUT #ERROR " has neither a SYSTEM nor a PUBLIC identifier!%n"
         HALT
      DONE

Note that if no file is found for a #CHARSET or #CAPACITY entity, then the zero-length string is used as its replacement text. This has the effect of providing a default of "all data characters" or "all reference values", respectively. No such default is provided in the case of any other entity, including the #SYNTAX entity.

Next chapter is Chapter 17, "SGML Document and Subdocument Parsing".

Copyright © OmniMark Technologies Corporation, 1988-1997. All rights reserved.
EUM27, release 2, 1997/04/11.

Home Copyright Information Website Feedback Site Map Search