HOME | COMPANY | SOFTWARE | DOCUMENTATION | EDUCATION & TRAINING | SALES & SERVICE | |
"The Official Guide to Programming with OmniMark" |
|
International Edition |
Previous chapter is Chapter 15, "Processing SGML Errors".
Next chapter is Chapter 17, "SGML Document and Subdocument Parsing".
SGML supports several different kinds of external entities. They are divided into two classifications:
These are entities which are always properly nested with respect to the element structure. They are comprised of:
External data entities can be processed by the EXTERNAL-DATA-ENTITY rule.
External text entities may contain markup that begins new elements or ends existing elements. Thus, they may not be properly nested within the element structure.
External text entities can be processed by the EXTERNAL-TEXT-ENTITY rule.
An SGML entity manager is a program, or part of a program, that takes an entity reference and returns the text to an SGML parser. The SGML parser manages internal entities, but it requires help from the applications using it to find the text of external entities. Every SGML system contains an entity manager for external entities. The entity manager can find external entity text in a file system or database, or the text can be "hard coded" into the entity manager itself.
The external entity manager is used primarily when a general or parameter entity reference is used in a document that requires the SGML system to read in the text of the entity and requires the SGML parser to interpret the text of the entity in the context in which the parser finds the reference.
OmniMark incorporates a "built-in" entity manager, that supports the requirements of most users for locating external entities. As an alternative, when OmniMark's built-in entity manager does not do what an application requires, an OmniMark program can implement its own entity manager, using the EXTERNAL-TEXT-ENTITY rule and other facilities described in this chapter.
When OmniMark encounters a reference to an external text entity (i.e. one that can contain SGML markup), its built-in entity manager tries to find a system identifier for the entity. It first looks in the entity's declaration. If there is not a system identifier in the declaration but there is a public identifier, it looks the public identifier up in the OmniMark program's LIBRARY rules, and uses the system identifier found there, if any. Once it has a system identifier, the entity manager tries to use it as the name of a file, using each of the -libpath strings from the command line as prefixes (in the order given) in the process. Once it finds a file on the system on which OmniMark is running, it uses the text of that file as the entity. If the entity manager cannot find a system identifier, cannot find a file with the system identifier as its name, or finds that the file is unreadable, it stops the OmniMark program with an appropriate message.
An additional provision is made by OmniMark's built-in entity manager for "anonymous" entities referred to by public identifiers in the SGML Declaration, to avoid requiring all users to provide definitions for these entities. Some character sets and concrete syntaxes are "built-in" to OmniMark (in particular, those described in the SGML standard, ISO 8879). Unrecognized character sets and capacity sets are given default values. In particular, if one of these public identifiers is not mapped to a system identifier by a LIBRARY rule, the built-in entity manager will do the following:
When a file is found, the built-in entity manager passes the text of that file to the SGML parser unmodified (except possibly for changing the newline sequences in the text file to the RS/RE sequences expected by the SGML parser).
In other words, OmniMark's built-in entity manager assumes that a file identified by an external text entity contains the text of that entity. In particular, it is assumed that no conversion or other processing of the text is required. This is an appropriate assumption -- most text files that are subjects of an entity reference from within an SGML document will themselves be coded in SGML.
The public identifiers recognized by OmniMark's built-in entity manager are described in more detail in Section 16.4.2, "Public Identifiers in the SGML Declaration".
The relatively simple model supported by OmniMark's built-in entity manager -- an external entity is a file -- works for most OmniMark applications, but not for all. For those other applications, OmniMark programmers can, in a simple, straightforward way, write their own entity managers, using OmniMark's EXTERNAL-TEXT-ENTITY rule.
The EXTERNAL-TEXT-ENTITY rule allows the OmniMark programmer to do things other than just providing an alternative way of finding a file containing an entity's text. An entity can be any sequence of characters. The entity manager provides a sequence of characters to the SGML parser when the SGML parser passes an external entity reference to the entity manager. The sequence of characters doesn't have to be a "verbatim" copy of the text in a file. Other possibilities include:
Direct access to the entries in the LIBRARY rules and -libpath command-line arguments are provided by OmniMark so that the behavior of OmniMark's entity manager can be duplicated by a program written in OmniMark. This can be useful, for example, when what OmniMark's entity manager does is a good "fall-back" position when an application-specific scheme does not find an entity's text.
When there are no EXTERNAL-TEXT-ENTITY rules in an OmniMark program, then OmniMark uses its built-in entity manager to find the text of an SGML text entity, as described in Section 16.1.1, "OmniMark's Built-in Entity Manager".
When there are any EXTERNAL-TEXT-ENTITY rules in an OmniMark program, the built-in entity manager is not used -- the entity manager defined by the OmniMark program (by the EXTERNAL-TEXT-ENTITY rules) is used instead. In this case, one (and only one) of them must provide the replacement text for each external text entities, general or parameter, referenced in an SGML document, as follows:
(The above statements apply to both general entities and parameter entities. Similar statements apply independently to the "entities" referenced by the external identifier at the start of the DOCTYPE declaration, and by the public identifiers in the SGML declaration. For more about these entities, see Section 16.4.1, "The Public Identifier at the Start of the DTD", Section 16.4.2, "Public Identifiers in the SGML Declaration" and Section 16.5, "A Default External Text Entity Rule".)
OmniMark provides facilities that allow an OmniMark program to duplicate the behavior of OmniMark's built-in entity manager. An OmniMark program can, for example, provide or find the replacement text for a certain class of external text entities, but "fall back" on OmniMark's built-in behavior for any other external text entity by duplicating what OmniMark would have done otherwise, or by doing some variant of that.
The basic tools for writing an SGML entity manager in OmniMark are the EXTERNAL-TEXT-ENTITY and EXTERNAL-DATA-ENTITY rules.
This section describes those rules in detail.
EXTERNAL-DATA-ENTITY entity-name condition? local-declaration* action*
Several OmniMark constructs are used in processing external entities. Section 14.4, "Attributes" and Section 14.4.3.4, "Data Attributes Associated With Entity Attributes" mentions features that address the data attributes of external data entities.
This section presents the EXTERNAL-DATA-ENTITY rule.
Internal data entity references and all text entity references simply result in substitution of the entity's replacement text for the reference. The SGML parser automatically processes the replacement text and no special rules are needed in the OmniMark program.
An EXTERNAL-DATA-ENTITY rule is used to process external data and subdocument entities.
An EXTERNAL-DATA-ENTITY rule is triggered when a reference to the named entity occurs in data content and any specified condition is met. An EXTERNAL-DATA-ENTITY rule is never selected when entity occurs in the value of an ENTITY or ENTITIES attribute, because use of an entity name in an attribute value is not considered to be an entity reference.
When the same actions apply to more than one entity, the names of all the entities can be listed in the rule header, enclosed in parentheses, and separated by the operator "|" (OR). The entity name can be output using the format item "%q". (See Section 14.2.1, "Formatting Entity Names".)
For example, suppose references to entities named picture1 and picture2 are both to receive similar processing. The following EXTERNAL-DATA-ENTITY rule might be used:
EXTERNAL-DATA-ENTITY (picture1 | picture2) OUTPUT "\picture{%uq.PIC}"
Specific entity names need not be listed in the rule header. The keyword #IMPLIED can be used instead to indicate that the rule should be selected whenever the condition is met and a reference to an external data or subdocument entity occurs. For instance, the following rule header can be used to define actions to be performed whenever a reference occurs to an entity declared with notation "TBL":
EXTERNAL-DATA-ENTITY #IMPLIED WHEN NOTATION = "TBL"
It is an error if multiple EXTERNAL-DATA-ENTITY rules are selectable for one entity reference. If the same entity name is used in more than one EXTERNAL-DATA-ENTITY rule, then each rule must have a condition which ensures that only one will be selected. It is also an error if no EXTERNAL-DATA-ENTITY rule is selected when a reference occurs to an external data or subdocument entity.
In the condition on an EXTERNAL-DATA-ENTITY rule, a test of an attribute value refers to a data attribute. In other rules, an unqualified attribute name refers to attributes of the current element, and data attributes must be accessed explicitly. For example, in the following rule headers, type is a data attribute:
Example A
EXTERNAL-DATA-ENTITY graphic WHEN ATTRIBUTE type = "TIFF"
Example B
ELEMENT list WHEN DATA-ATTRIBUTE type (OF ATTRIBUTE id) = "TIFF"
However, in the rule headers shown below, type is an attribute specified in a start-tag:
Example A
ELEMENT list WHEN ATTRIBUTE type = "BULLET"
Example B
EXTERNAL-DATA-ENTITY graphic WHEN ATTRIBUTE type OF ELEMENT = "BULLET"
EXTERNAL-DATA-ENTITY rules are not permitted in cross-translations.
EXTERNAL-TEXT-ENTITY entity-name condition? local-declaration* action*
An EXTERNAL-TEXT-ENTITY rule is used to provide OmniMark's built-in SGML parser with the text of an external text entity (i.e. an external entity that is not CDATA, SDATA, NDATA or SUBDOC) whenever such an entity is referenced in an SGML document. An EXTERNAL-TEXT-ENTITY rule looks like an EXTERNAL-DATA-ENTITY rule. Its most important property is that everything written to the #SGML stream within the rule is considered to be part of the entity's text.
For example, the following EXTERNAL-TEXT-ENTITY rule specifies that the text of the entity named version is the contents of the file named "version.txt":
EXTERNAL-TEXT-ENTITY version OUTPUT FILE "version.txt"
The following EXTERNAL-TEXT-ENTITY rule handles all external text entities. In a similar manner to its use in the EXTERNAL-DATA-ENTITY rule, #IMPLIED means "all SGML text entities" (actually "all named SGML text entities", see Section 16.4.1, "The Public Identifier at the Start of the DTD"). It is about the simplest entity manager that can be written. It just takes the name of the entity, lower-cases it, appends the extension ".ent", and uses it as a file name.
EXTERNAL-TEXT-ENTITY #IMPLIED OUTPUT FILE "%lq.ent"
Note that "%q" gives the name of the entity, in the same way as in an EXTERNAL-DATA-ENTITY rule, and that "%eq", "%pq" and "%epq" give the entity's effective (declared or associated by the LIBRARY rules) system identifier, declared public identifier, and the system identifier mapped from the public identifier in the LIBRARY rules. However, because text entities do not have associated notations, the "o" modifier must not be used with the "%q" format item in an EXTERNAL-TEXT-ENTITY rule.
The following EXTERNAL-TEXT-ENTITY rule is a bit more complicated (and also a bit more useful). It first checks to see if an external text entity has a system identifier, then checks for a system identifier in the LIBRARY rules, and finally uses the entity name, appending the extension ".ent" if the entity is a parameter entity (the "%" kind), or the extension ".sgm" if the entity is a general entity (the "&" kind). Once it has a file name, it checks to see if the file exists and is readable. If so it passes the text of the file to the SGML parser as the entity's replacement. If not, it issues an error on the error file, and provides an SGML comment as the entity's replacement text.
EXTERNAL-TEXT-ENTITY #IMPLIED LOCAL STREAM file-name DO WHEN ENTITY IS SYSTEM SET file-name TO FILE "%eq" ELSE WHEN ENTITY IS IN-LIBRARY SET file-name TO FILE "%epq" ELSE WHEN ENTITY IS PARAMETER SET file-name TO FILE "%lq.ent" ELSE SET file-name TO FILE "%lq.sgm" DONE DO WHEN FILE file-name EXISTS & FILE file-name IS READABLE OUTPUT FILE file-name ELSE PUT #ERROR "*** ERROR *** Can't read from %"%g(file-name)%"%n" OUTPUT "<!-- *** ERROR *** Can't read from %"%g(file-name)%" -->" DONE
The "ENTITY IS" tests are described in Section 14.2.2, "Entity Tests". The FILE ... "IS READABLE" and related tests are described in Section 10.4.1, "File Tests".
The SUBMIT action can be used in the EXTERNAL-TEXT-ENTITY rule in place of (or in addition to) the OUTPUT or PUT actions. The uses and implications of doing so are discussed in Section 16.2.2.3, "Using SUBMIT in the External Text Entity Rule".
If an EXTERNAL-TEXT-ENTITY rule outputs no text, then, from the point of view of the SGML parser, the entity's replacement text simply consists of zero characters. This is not an error.
It should be noted that, in a context-translation, an EXTERNAL-TEXT-ENTITY rule can be performed while the FIND-START rules are being performed. This will happen if the FIND-START rules output the text of the SGML Declaration and there are EXTERNAL-TEXT-ENTITY rules for processing the entities represented by the public identifiers in the SGML Declaration.
The OUTPUT action is normally used inside an EXTERNAL-TEXT-ENTITY rule to provide the SGML parser with the entity's replacement text. In an EXTERNAL-TEXT-ENTITY rule, the default #CURRENT-OUTPUT stream set contains only the #SGML stream. That allows the replacement text of the entity to be fed to the parser using OUTPUT actions.
Because anything written to the #SGML stream in an EXTERNAL-TEXT-ENTITY rule becomes part of the entity's text, the entity's text can be made up of one or more pieces from one or more sources.
For example, the following EXTERNAL-TEXT-ENTITY rule processes any external text entity that has a system identifier. It treats the system identifier as a sequence of file names, separated by semicolons (note that this is a good idea on some systems, but not on others), and concatenates the text from all of the files together as the entity's replacement text+.
EXTERNAL-TEXT-ENTITY #IMPLIED WHEN ENTITY IS SYSTEM REPEAT SCAN "%eq" MATCH [ANY EXCEPT ";"]+ => file-name OUTPUT FILE file-name MATCH ";" ; Ignore any semicolon AGAIN
An example of an entity with multiple file names is a case where there is a general entity that represents the chapters that comprise the "advanced" part of a textbook:
<!ENTITY advanced SYSTEM "chapter7.sgm;chapter8.sgm;chapter9.sgm">
Another example is a case where a parameter entity represents more than one set of declarations, as an alternative to having separate declarations for each set:
<!ENTITY % chars SYSTEM "mathchars.ent;pubchars.ent"> %chars;
Alternatively, if the replacement text of some of the external entities is small, all of the entities can be defined in a single file. This technique can be used to construct a "control file" for configurable documents.
The following example reads in a file called "entity.set", which contains of a set of entity definitions, each of which consist of an external entity name terminated by an equals sign, a "quote" character, which may be anything other than a newline, and the text of the entity terminated by another "quote" character. Optional, non-significant line breaks are allowed following the equals sign and following the closing "quote" on the entity text.
The example assumes that the file "entity.set" is correctly formatted, and makes no provision for the entity not being defined. (If it is not defined, the entity text will be "", the zero-length string, which may or may not be an appropriate fall-back position.)
EXTERNAL-TEXT-ENTITY #IMPLIED REPEAT SCAN FILE "entity.set" MATCH "%q" "=" "%n"? any => quote ((lookahead ! another quote) any)* => text OUTPUT text EXIT MATCH [any except "="]+ "=" "%n"? any => quote ((lookahead ! another quote) any)* another quote "%n"? ; Skip other entities AGAIN
An example of an "entity.set" file is the following, which contains information specific to processing a set of documents. Different "quotes" are used just for illustration.
orgname=/OmniMark Technologies Corporation/ prodname='OmniMark' rights= "All rights reserved by OmniMark Technologies Corporation. This material contains the valuable properties of OmniMark Technologies Corporation. No part of this material may be reproduced, translated or transmitted in any form or by any means, electronic, mechanical, or otherwise, including photocopying and recording, without the permission in writing from OmniMark Technologies Corporation."
Using this example, the entity reference "&orgname;" would have the replacement text "OmniMark Technologies Corporation".
In practise, an entity manager will use a combination of these techniques.
The EXTERNAL-TEXT-ENTITY rule is unique in OmniMark, in that different parts of it are executed in each domain. The header of the rule, and any associated condition is tested in the output processor. If the rule is selected, the actions within the rule body are performed in the input processor.
The reason for this split is:
In practise, the fact that the rule header is tested in the output processor, and the actions executed in the input processor will be irrelevant to most programmers. However, it does have the following implications:
Versions of OmniMark prior to V3 treated the rule header for the EXTERNAL-TEXT-ENTITY rule as if it were evaluated in the input processor. So there may be a change in behaviour for EXTERNAL-TEXT-ENTITY rules which are not in the #IMPLIED group. It is not expected that this will be a significant change for most programs.
In the case that an external text entity's text does not need processing, it is appropriate that an EXTERNAL-TEXT-ENTITY rule will use an OUTPUT or PUT action to provide the file's text to the SGML parser.
Even if some processing is required, it can be done with a "DO SCAN" or "REPEAT SCAN" in the EXTERNAL-TEXT-ENTITY rule, wherein each MATCH part emits the processed text with OUTPUT or PUT.
On the other hand, if substantial processing is required, it will often be the case that it is more appropriate to SUBMIT the text of the file for processing by FIND rules. In this case, any output of the FIND rules that process the submitted text is considered part of the text of the entity.
If the FIND rules are different from those used to process the main input, it will be necessary to use a "USING GROUP" prefix on the SUBMIT action to specify which FIND rules are used to process the submitted text. The following sample EXTERNAL-TEXT-ENTITY rule illustrates this processing:
EXTERNAL-TEXT-ENTITY #IMPLIED WHEN ENTITY IS (PUBLIC & IN-LIBRARY) USING GROUP entity-processing SUBMIT FILE "%pq"
An OUTPUT-TO action is allowed in an EXTERNAL-TEXT-ENTITY rule. OUTPUT-TO in an EXTERNAL-TEXT-ENTITY rule remains in effect until the end of the EXTERNAL-TEXT-ENTITY rule, unless it is overridden by a further OUTPUT-TO.
Normally, the only active output stream in an EXTERNAL-TEXT-ENTITY rule is the #SGML stream, so that text written using the OUTPUT action becomes part of the replacement text of the external text entity. The OUTPUT-TO action allows the OmniMark programmer to redirect the output to another destination. The EXTERNAL-TEXT-ENTITY rule's active output can be restored to the #SGML stream using
OUTPUT-TO #SGML
An entity manager may want to skip over unreadable files when it is searching for a file containing the text of an external entity. The file tests described in Section 10.4.1, "File Tests" can be used, as in the following example:
EXTERNAL-TEXT-ENTITY #IMPLIED WHEN ENTITY IS SYSTEM REPEAT SCAN "%eq" MATCH [ANY EXCEPT ";"]+ => file-name DO WHEN FILE file-name IS READABLE & FILE file-name ISNT DIRECTORY OUTPUT FILE file-name EXIT DONE MATCH ";" ; Skip over the separating semicolons MATCH VALUE-END PUT #ERROR "None of the files %"%eq%" (for entity %q) " _ "are readable.%n" HALT AGAIN
This example interprets the system identifier as a set of alternative file names separated by semicolons. It uses the first file that is really a file and not a directory, and that is readable as the text of the external entity. If there is no such file, the EXTERNAL-TEXT-ENTITY rule terminates the OmniMark program with a message.
To provide compatibility between the LIBRARY rules and -libpath command-line argument of OmniMark and the EXTERNAL-TEXT-ENTITY rule, three built-in stream shelves are supported by OmniMark: #LIBRARY, #LIBPATH and #LIBVALUE.
Although these shelves have been provided primarily to support compatibility with earlier versions of OmniMark, they are very useful in their own right. Programmers who are writing their own entity managers should carefully consider how these streams can be used for other purposes. The examples in this section might help some programmers get a start on thinking up new ideas.
#LIBRARY is a built-in stream that starts out with one item for each entry in every LIBRARY rule in the OmniMark program or in a -library file at run-time. The key of each item is a public identifier, and the value of each item is the corresponding system identifier.
So, for example, if the text of a source element is a public identifier and the OmniMark program is to output the corresponding system identifier the following element rule could be used:
ELEMENT source OUTPUT #LIBRARY ^ "%c"
The primary use of the #LIBRARY stream is in EXTERNAL-DATA-ENTITY and EXTERNAL-TEXT-ENTITY rules. In such a rule, the following two OUTPUT actions would output the same text:
OUTPUT "%epq" OUTPUT #LIBRARY ^ "%pq"
Any change made to the #LIBRARY stream is immediately reflected in how the "%epq" and "%epv" format items are interpreted. If the OmniMark program contains no EXTERNAL-TEXT-ENTITY rules, then any change made to the #LIBRARY stream also determines how OmniMark's built-in entity manager interprets public identifiers in referenced external text entities.
The default "current item" of the #LIBRARY shelf is the lastmost item, as is the case with programmer-declared shelves.
The built-in stream #LIBPATH starts out with one item for each -libpath argument on the command line, in the order that the -libpath arguments appear on the command line. The following EXTERNAL-TEXT-ENTITY rule does what OmniMark's built-in entity manager does for entities with system identifiers (a more complete example, handling public identifiers, is given in a later section):
EXTERNAL-TEXT-ENTITY #IMPLIED WHEN ENTITY IS SYSTEM LOCAL STREAM file-name DO WHEN FILE "%eq" EXISTS SET file-name TO "%eq" ELSE REPEAT OVER #LIBPATH DO WHEN FILE "%g(#LIBPATH)%eq" EXISTS SET file-name TO "%g(#LIBPATH)%eq" EXIT DONE AGAIN DONE DO WHEN file-name IS ATTACHED OUTPUT FILE file-name ELSE PUT #ERROR "No file found for entity %"%q%", system id = %"%eq%"%n" HALT DONE
In this example, provision is made for there being no -libpath command-line argument or for no -libpath prefix producing the name of an existing file. If the EXTERNAL-TEXT-ENTITY rule was missing the HALT action, then, apart from the message being written to #ERROR, the OmniMark program would just continue.
The default "current item" of the #LIBPATH shelf is the lastmost item, as is the case with programmer-declared shelves.
The built-in stream #LIBVALUE starts out with one item for each public identifier that is "built-in" to OmniMark. These values are used by OmniMark's "built-in" entity manager if an entity is not resolved using the #LIBRARY shelf, which means the OmniMark programmer can often avoid having to write an EXTERNAL-TEXT-ENTITY rule. The OmniMark program can add to, delete or modify these values to suit its needs.
For example, many applications have "parameters" that neither the OmniMark programmer nor the document users want to have hard coded in either the programs being used, in the DTDs or the documents. An example of such a "parameter" is a case in which a company's name is represented by an external general text entity, &company; that is defined using a public identifier:
<!ENTITY company PUBLIC "-//miscellany//TEXT company//EN">
The use of the public identifier assures that the entity can be interpreted independently of the document containing the reference. A convenient way to make "&company;" a parameter is to simply specify the entity's replacement text on the command line that runs the OmniMark program that processes the document:
omnimark -s ... -define ent "company:OmniMark Technologies" ...
The programmer-defined stream ent is used as the command-line parameter. The following OmniMark program fragment illustrates how the parameter is taken from the ent stream and added to the #LIBVALUE shelf. More than one parameter is defined by separating their definitions by semicolons, and a name is separated from its replacement text by a colon.
GLOBAL STREAM ent DOCUMENT-START WHEN ent IS ATTACHED REPEAT SCAN ent MATCH [ANY EXCEPT ":"]+ => public-id ":" [ANY EXCEPT ";"]* => value ";"? SET NEW #LIBVALUE ^ "-//miscellany//TEXT %x(public-id)//EN" TO value AGAIN
An important feature of this example is that the OmniMark program knows nothing about what entities are supported or even how many of them there are: it just knows how to support entities.
The "names" in the ent stream are used as a "public text description" of the public identifiers assigned to the entity, and an "unregistered owner identifier" of "miscellany" is used.
The example assumes that OmniMark's built-in entity manager is going to be used, because it uses the #LIBVALUE stream as a source of entity text. However, a programmer-supplied entity manager, written using EXTERNAL-TEXT-ENTITY rules, can be used instead.
The default "current item" of the #LIBVALUE shelf is the lastmost item, as is the case with programmer-declared shelves.
As stated above, the #LIBRARY stream, the #LIBPATH stream and the #LIBVALUE stream start out with the contents of the LIBRARY rules, -libpath command-line arguments and OmniMark's "built-in" public identifiers respectively.
They can be changed by the OmniMark program at any time, in any way, but there are some restrictions that apply:
The restrictions are imposed by OmniMark's built-in entity manager, which takes over in these cases. Entity manager designers will usually impose similar restrictions. The restrictions are:
The item must not be opened for writing, it must not be a file, and it must not be unattached. This restriction is imposed so that the built-in entity manager always has an immediately accessible value that can be determined without opening a file.
On the other hand, there are no restrictions on items of the #LIBRARY shelf that do not have keys (because they will never be accessed by the built-in entity manager), or on items that for other, program-specific reasons are never accessed. Also, there are no restrictions as long as the built-in entity manager is not invoked.
When trying to find a file matching an external entity reference, the items of the #LIBPATH shelf are examined one by one, and used as prefixes for the system identifier. As soon as a prefix and system identifier combination is found that is the name of an existing file, it is used, and any following items of #LIBPATH are ignored.
The items of #LIBPATH can be keyed or not -- the built-in entity manager never examines the keys. Because it is only used when trying to open a referenced external text entity, the #LIBPATH stream is never used by the built-in entity manager if there is any EXTERNAL-TEXT-ENTITY rule in an OmniMark program.
It must not be opened for writing, it must not be a file, and it must not be unattached. This restriction is imposed so that the built-in entity manager always has an immediately accessible value that it can determine without opening a file.
On the other hand, there are no restrictions on items of the #LIBVALUE shelf that do not have keys (because they will never be accessed by the built-in entity manager), or on items that for other, program-specific reasons are never accessed. Also, there are no restrictions as long as the built-in entity manager is not invoked.
It is possible for some external entities to have public identifiers with no system identifier. If there is no LIBRARY rule to map the public identifier onto a system identifier, then the OmniMark program may have to process the public identifier itself.
Sometimes this is done because instead of the replacement text of the external entity being contained in a file, the public identifier contains all of the information necessary to fetch or construct the replacement text.
This section describes techniques for parsing public identifiers.
An external identifier (public identifier and/or system identifier) is allowed following the keyword PUBLIC immediately following the keyword DOCTYPE and the document element name. This public identifier, when present, identifies an entity containing declarations to be included following those in the DOCTYPE declaration.
The OmniMark programmer's entity manager can provide the text (i.e. the declarations) of this entity using an EXTERNAL-TEXT-ENTITY rule. The keyword #DTD is used to identify this entity. For example, the following rule uses a file called "default.dtd" when there is an external identifier at the start of the DTD and it has no public identifier or system identifier (e.g. <!DOCTYPE doc SYSTEM [):
EXTERNAL-TEXT-ENTITY #DTD WHEN ENTITY ISNT (SYSTEM | PUBLIC) OUTPUT FILE "default.dtd"
If the "%q" format is used for the #DTD entity, it will produce the string "#DTD". Note, however, that this entity really doesn't have a name, and that, using a variant SGML syntax, an SGML document can define an entity with the name "#DTD", that produces the results with "%q". The "ENTITY IS #DTD" test can be used to distinguish the "real" #DTD entity from the user's entity with the same name.
It sometimes happens that the external identifier at the head of a DTD has neither a system identifier nor a public identifier, as in:
<!DOCTYPE report SYSTEM>
In this case, it may be appropriate for an OmniMark program to use the name of the document element to find the implicitly referred to entity. For example, in this case, it may be that the file "report.dtd" is intended to be used. The name of the document element is available to the OmniMark program in the #DOCTYPE stream (see Section 14.1.3.3, "The Document Element Name"). For example:
EXTERNAL-TEXT-ENTITY #DTD WHEN ENTITY ISNT (SYSTEM | PUBLIC) OUTPUT FILE "%g(#DOCTYPE).dtd"
The "ENTITY IS #DTD" test is used to determine whether the entity is the #DTD one or not. #DTD is used like EXTERNAL, PUBLIC or PARAMETER in an ENTITY test, and can be combined with these other keywords. It is useful when an EXTERNAL-TEXT-ENTITY rule can process either the #DTD entity or another entity, and needs to determine which one it has. Examples of this are going to be rather complex entity managers in practise, so to illustrate the point, the following somewhat contrived example processes either the #DTD entity or the entities named "my-dtd" or "the-dtd":
EXTERNAL-TEXT-ENTITY (#DTD | my-dtd | the-dtd) DO WHEN ENTITY IS SYSTEM OUTPUT FILE "%eq" ELSE WHEN ENTITY IS #DTD OUTPUT FILE "my.dtd" ELSE OUTPUT FILE "%q.ent" DONE
An EXTERNAL-TEXT-ENTITY rule of the form
EXTERNAL-TEXT-ENTITY #IMPLIED ...
matches all named entities, not including the #DTD one (because it doesn't have a name). This allows the #DTD entity to be processed in a different manner than those defined by entity declarations. To match all named entities and the #DTD one, both #IMPLIED and #DTD have to be used, as in:
EXTERNAL-TEXT-ENTITY (#IMPLIED | #DTD) ...
If there are any EXTERNAL-TEXT-ENTITY rules in an OmniMark program that use the keyword #DTD in their heading, then all #DTD entities must be handled by the OmniMark program. If no #DTD entity is handled by an OmniMark program then all such entities are subject to OmniMark's default processing. See Section 16.5, "A Default External Text Entity Rule" for more information on this default processing.
In the head of the EXTERNAL-TEXT-ENTITY rule, #DTD can be combined with #IMPLIED or with the names of named entities, but not both, because #IMPLIED cannot be combined with the names of entities.
The #DTD entity is considered a parameter entity (not a general entity) for the purpose of the "ENTITY IS GENERAL" and "ENTITY IS PARAMETER" tests.
The public identifiers that can appear in the SGML Declaration, for the base character sets, for the capacity set and for the concrete syntax, are processed in much the same way as the #DTD entity. They are identified by the keywords #CHARSET, #CAPACITY and #SYNTAX, respectively. They are like the #DTD entity in most respects:
ENTITY IS (#CHARSET | #CAPACITY | #SYNTAX)
Entities referenced by the public identifiers in the SGML Declaration have the additional following properties:
The ISO character entities (e.g. the entity referenced by "É") are defined in external files rather than being "hard coded" inside OmniMark's built-in entity manager, with the files divided as described in Appendix D.4 of ISO 8879, the SGML standard. These files are shipped with OmniMark, together with a file containing a LIBRARY rule that maps both ISO 8879-1986 and ISO 8879:1986 versions of the public identifiers to the appropriate files.
If there is no EXTERNAL-TEXT-ENTITY rule to process an entity associated with a public identifier in the SGML Declaration, and the public identifier is one of those in the following list, then OmniMark provides the entity text corresponding to the meaning of the public identifiers, as defined in the SGML standard:
ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0 ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0 ANSI X3.4-1986//CHARSET American Standard Code for Information Interchange (ASCII)//ESC 2/8 4/2 ISO 8879-1986//SYNTAX Reference//EN ISO 8879-1986//SYNTAX Core//EN ISO 8879-1986//SYNTAX Multicode Basic//EN ISO 8879-1986//SYNTAX Multicode Core//EN ISO 8879:1986//SYNTAX Reference//EN ISO 8879:1986//SYNTAX Core//EN ISO 8879:1986//SYNTAX Multicode Basic//EN ISO 8879:1986//SYNTAX Multicode Core//EN
In addition, any capacity set public identifier is accepted and matched with the reference capacity set values (all 35000).
All three character sets are given the same definition: that of the IRV of ISO 646. The concrete syntaxes with colons in their names are given the same definitions as those in the SGML standard with dashes instead.
The #LIBVALUE stream starts out with one item for each of the public identifiers listed above. The key of each item is the public identifier, and the value of each item is the corresponding replacement text: a character set definition for each of the CHARSET public identifiers, and a concrete syntax definition for each of the SYNTAX public identifiers. The #LIBVALUE stream is used by OmniMark to get these text values, so if the OmniMark program changes the #LIBVALUE stream, those changes are reflected in how the SGML Declaration, in particular, is processed.
The values of the #LIBVALUE stream items must conform to the use that is made of the corresponding identifier. In particular, a public identifier for a capacity set or concrete syntax used in the SGML Declaration must have a value that is in the same format as the explicitly described capacity set or concrete syntax that could have been coded in its place in the SGML Declaration. A public identifier for a character set must have a corresponding value in the format described in Section 16.4.2.2, "Base Character Sets". More information about modifying the #LIBVALUE stream is contained in Section 16.3.4, "Restrictions on the #LIBRARY, #LIBPATH and #LIBVALUE Shelf Values".
Base character sets are identified by (usually formal) public identifiers. These public identifiers are interpreted by the implementation to produce a character set and identify:
The text associated with a character set's public identifier must conform to the description of an "external character set description", which is defined below. An external character set description describes a character set in a manner similar to a described character set portion in an SGML Declaration (ISO 8879 production [175]).
The syntax of an external character set description is, in the notation used in ISO 8879:
external character set description = ps+, (external character description, ps+)* external character description = external character number, ps+, (number of characters, ps+)?, (graphic character assignment | external character assignment) graphic character assignment = (lit, graphic character*, lit) | (lita, graphic character*, lita) | "TAB" | "B" external character assignment = "UCLETTER" | ("LCLETTER", ps+, external character number) | ("DIGIT", ps+, digit value) | "SPECIAL" | "DATA" | "CONTROL" external character number = number digit value = number
Where:
A graphic character assignment indicates how characters in parameter literals in the concrete syntax (delimiter strings and the LCNMSTRT, UCNMSTRT, LCNMCHAR and UCNMCHAR strings) are to be interpreted. For example,
65 "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
indicates that the characters in the literal, when encountered in a parameter literal in the concrete syntax, are to be interpreted as characters with numbers 65 through 80 inclusive, in the character set identified by the public identifier of the entity containing the external character set description.
The character with a numeric value of zero ('\0', � or CTRL-@) should not be used in a "graphic character assignment". If it does, it is ignored, as if it did not appear in the string. The "zero digit" character, '0', is not the same thing as the zero value character, and can be used.
An external character assignment assigns characters in the base character set, starting with the "external character number", and continuing for "number of characters" to one of the following categories:
All characters (in the range of allowed character values: 0 to 255 in current versions of OmniMark) not placed in one of these categories is classified as a non-significant, non-control data character. (Note that this method of defining base character sets ensures that no character will ever be two or more of LC Letter, UC Letter, Digit, Special or Control.)
Examples of using external character assignments are:
48 Digit 0 63 Special 97 LCLetter 65
These lines mean:
If the OmniMark program provides zero characters of text for a #CHARSET public identifier, or only white space and SGML comments, then all characters in the base character set are made to be non-significant data characters. This is very often appropriate for character sets other than those that define the letters, digits and special characters. A reasonable thing for many applications to do is to provide the definition for the "ISO 646 (IRV)" character set when requested to do so and to provide a zero-length definition for all other character set requests.
The following text defines the "ISO 646 (IRV)" character set, which can either be kept in a file or hard coded in an OmniMark program:
9 tab 32 ' !"#$%&' 39 "'()*+,-./0123456789:;<=>?" 64 "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_" 96 "`abcdefghijklmnopqrstuvwxyz{|}~" 0 32 control 39 3 special 43 5 special 48 10 digit 0 58 special 61 special 63 special 65 26 ucletter 66 b 97 26 lcletter 65 127 control
Alternatively to providing base character set information itself, an OmniMark program can allow the user to provide files or other text containing the definitions of base character sets. This is especially useful where there is need to define additional letters, as may be appropriate when using the ISO "Added Latin" character sets.
More than one base character set can be used in a document. The document character set must be assigned the meanings of all characters in all of the base character sets used. All of the base character sets used in the document character set that contain significant (LC Letter, UC Letter, Digit or Special) characters must be used in defining the syntax-reference character set, and all significant characters in those base character sets must have their meanings assigned to syntax-reference characters. Any other base character set used to assign meanings in the document character set may be used in the syntax-reference character set, as may any base character set not used in the document character set. However, in the latter case, no "meaningful" assignments can be made to the syntax-reference characters, because there are no document characters that take on those syntax-reference character meanings.
The repetition of a public identifier in the document character set is recognized, and the previous definition of the base character set is used. Where (the same or different) base character sets assign an external character number to the same graphic character, the first assignment of the first base character set is used.
A capacity set can either be specified in the SGML Declaration or it can be described by a public identifier. If a public identifier is provided then its text must be a sequence of zero or more capacity names, each followed by a capacity points number. More precisely:
external capacity set description = ps+, (name, ps+, number, ps+)*
ps+ can be any combination of white space characters and SGML comments.
Each name in an external capacity specification must be that of a capacity. The associated number becomes the limit value of that capacity. Any capacity not mentioned is set to the reference value (35000). An external capacity set description can contain no text, or only comments and white space, in which case all the capacities are set to the reference value.
A concrete syntax can either be specified in the SGML Declaration or a public identifier can be provided describing a public concrete syntax. If a public identifier is provided, the entity text associated with the public identifier must conform to the part of a concrete syntax defined by production [182] in ISO 8879, the SGML standard, starting with "shunned character number identification". In other words, the entity text must be what would be put in the SGML Declaration following the keyword SYNTAX (but it cannot be another public identifier). More precisely:
external public concrete syntax description = shunned character number identification, ps+, syntax-reference character set, ps+, function character identification, ps+, naming rules, ps+, delimiter set, ps+, reserved name use, ps+, quantity set
ps+ can be any combination of white space characters and SGML comments. Unlike a #CHARSET or #CAPACITY entity, a #SYNTAX entity must contain something other than white space and comments: all parts of an "external public concrete syntax description" must be present.
The following is an example of the contents of a file for a concrete syntax. It corresponds to the Reference Concrete Syntax ("_ISO 8879-1986_//_SYNTAX Reference_//EN"):
SHUNCHAR 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255 BASESET "ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 256 0 FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9 NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR "-." UCNMCHAR "-." NAMECASE GENERAL YES ENTITY NO DELIM GENERAL SGMLREF SHORTREF SGMLREF NAMES SGMLREF QUANTITY SGMLREF
Some applications use the public text description and/or other parts of a formal public identifier to help construct the file name used to access the associated entity's text. (The "ptd" in "-//owner//TEXT ptd//EN" is the public identifier's public text description.) Formal public identifiers have a strict syntax that can easily be parsed using OmniMark patterns to extract the parts of interest to a particular application. The following is an EXTERNAL-TEXT-ENTITY rule that assumes that all external text entities have a formal public identifier and that the file containing the entity's text is formed by the public text description, a dot, and the lower-cased version of the first three letters of the public identifier's public text class (CHARSET etc.):
EXTERNAL-TEXT-ENTITY #IMPLIED DO SCAN "%pq" MATCH (["-+"] "//")? ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* "//" [ANY EXCEPT "%_"] {3} => class3 [ANY EXCEPT " "]" " "-//"? ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* => description OUTPUT FILE "%x(description).%lx(class3)" DONE
This EXTERNAL-TEXT-ENTITY rule would output the file named "Chapter3.tex" when given a reference to an entity with the following declaration:
<!ENTITY ch3 PUBLIC "-//All Mine//TEXT Chapter3//EN">
A more general pattern, that will parse any formal public identifier, is the following:
MATCH ("+//" ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* => registered-owner-identifier | "-//" ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* => unregistered-owner-identifier | ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* => iso-owner-identifier) "//" [ANY EXCEPT "%_"]+ => public-text-class " " ("-//" => unavailable-text-indicator)? ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* => public-text-description "//" (LETTER {2} => public-text-language VALUE-END | ([ANY EXCEPT "/"]+ | "/" LOOKAHEAD ! "/")* => public-text-designating-sequence ("//" ANY* => public-text-display-version)?)
This pattern can be used to in an EXTERNAL-TEXT-ENTITY rule, EXTERNAL-DATA-ENTITY rule or when processing an entity name valued element attribute. If this pattern were used to parse the public identifier of the previous example ("-//All Mine//TEXT Chapter3//EN", it would result in the following pattern variable assignments:
This pattern is rather lengthy because of its generality, not to mention the long pattern variable names used. Most applications will not need all parts of the public identifier. Shorter pattern variable names can be used -- the terms in the pattern are those used in the SGML standard to describe the parts of a formal public identifier. On the other hand, some OmniMark programmers will want to extend the pattern to extract details of an ISO owner identifier, public text description or designating sequence.
The behaviour of OmniMark's built-in entity manager is equivalent to the following EXTERNAL-TEXT-ENTITY rule. This rule is provided to help OmniMark programmers mimic OmniMark's default behaviour as a fall-back position to their own entity management strategies.
There are six categories of external text entities: those defined by entity declarations and referenced by explicit entity references, and those represented by each of the keywords #DOCUMENT, #DTD, #CHARSET, #CAPACITY and #SYNTAX, respectively. If an OmniMark program contains EXTERNAL-TEXT-ENTITY rules for any category, then no default rule is provided: the OmniMark program must deal with all entities of that type.
The #DOCUMENT external text entity has very different properties than the others mentioned above, and is described in Section 2.5.3, "Controlling Input to the SGML Parser".
The following EXTERNAL-TEXT-ENTITY rule (actually, an equivalent one with different error messages) is only used for entities in categories not dealt with by the OmniMark program.
EXTERNAL-TEXT-ENTITY (#IMPLIED | #DTD | #CHARSET | #CAPACITY | #SYNTAX) LOCAL STREAM file-name DO WHEN ENTITY IS (SYSTEM | IN-LIBRARY) DO WHEN FILE "%eq" EXISTS SET file-name TO "%eq" ELSE REPEAT OVER #LIBPATH DO WHEN FILE "%g(#LIBPATH)%eq" EXISTS SET file-name TO "%g(#LIBPATH)%eq" EXIT DONE AGAIN DONE DO WHEN file-name IS ATTACHED OUTPUT FILE file-name ELSE PUT #ERROR "File '%g(file-name)' for " DO WHEN ENTITY IS (#DTD | #CHARSET | #CAPACITY | #SYNTAX) PUT #ERROR "%q" PUT #ERROR " (%g(#DOCTYPE))" WHEN ENTITY IS #DTD ELSE WHEN ENTITY IS GENERAL PUT #ERROR "entity &%q;" ELSE PUT #ERROR "entity %%%q;" DONE PUT #ERROR " with public id%n" _ " PUBLIC %"%pq%"%n" _ " " WHEN ENTITY IS PUBLIC PUT #ERROR " does not exist!%n" HALT DONE ELSE WHEN ENTITY IS PUBLIC & #LIBVALUE HAS KEY "%pq" OUTPUT #LIBVALUE ^ "%pq" ELSE WHEN ENTITY IS (#CHARSET | #CAPACITY) ; Zero-length entity replacement text. ELSE WHEN ENTITY IS PUBLIC PUT #ERROR "Public identifier for " DO WHEN ENTITY IS (#DTD | #CHARSET | #CAPACITY | #SYNTAX) PUT #ERROR "%q" PUT #ERROR " (%g(#DOCTYPE))" WHEN ENTITY IS #DTD ELSE WHEN ENTITY IS GENERAL PUT #ERROR "entity &%q;" ELSE PUT #ERROR "entity %%%q;" DONE PUT #ERROR "%n" _ " PUBLIC %"%pq%"%n" _ " is not in the LIBRARY rules!%n" HALT ELSE DO WHEN ENTITY IS #DTD PUT #ERROR "#DTD (%g(#DOCTYPE)) " ELSE WHEN ENTITY IS GENERAL PUT #ERROR "Entity &%q;" ELSE PUT #ERROR "Entity %%%q;" DONE PUT #ERROR " has neither a SYSTEM nor a PUBLIC identifier!%n" HALT DONE
Note that if no file is found for a #CHARSET or #CAPACITY entity, then the zero-length string is used as its replacement text. This has the effect of providing a default of "all data characters" or "all reference values", respectively. No such default is provided in the case of any other entity, including the #SYNTAX entity.
Next chapter is Chapter 17, "SGML Document and Subdocument Parsing".
Copyright © OmniMark Technologies Corporation, 1988-1997. All rights reserved.
EUM27, release 2, 1997/04/11.