External entities fall into two categories: text entities and data entities. XML supports only text entities in content. SGML supports both text and data entities. Different processing is required for text and data entities.
An external text entity is an entity whose replacement text forms part of the document being parsed. You must resolve an external entity and feed its replacement text to the parser so that it is parsed as part of the current document.
To resolve an external text entity you can write an external-text-entity
rule. This rule will be fired whenever the corresponding entity occurs in the document. When an external-text-entity
rule is fired, a new output scope is created. The destination of this output scope is the parser. This means that you can send the replacement text of the entity to the parser simply by outputting it in the external-text-entity
rule.
Here is the simplest case of an external-text-entity
rule. This rule simply outputs a replacement string for the entity "company-name":
external-text-entity "company-name" output "Stilo International plc"
In most cases, however, you will want to use a SYSTEM or PUBLIC identifier to locate the replacement text. A SYSTEM identifier is a reference to a data source in system-specific terms. In XML, SYSTEM identifiers are URIs. The PUBLIC identifier is an abstract string that can be mapped to a SYSTEM identifier on a system by system basis.
You can retrieve the SYSTEM identifier of an entity with the format string "%eq" and the PUBLIC identifier with "%pq". You can retrieve the name of the entity itself with "%q". This example uses the SYSTEM identifier to retrieve an entity definition:
external-text-entity #implied local stream replacement-text-file-name set replacement-text-file-name to "%eq" drop (ul "file://") do when file replacement-text-file-name exists output file replacement-text-file-name else output "" done catch #program-error output ""
This rule is written to support only SYSTEM identifiers that are files on the local system. To support a full range of URIs, you must write code to determine the type of URI used in a SYSTEM identifier and retrieve the replacement text from that location.
Since the location of entity replacement files can vary from one system to another, you may want to use PUBLIC identifiers to reference system identifiers appropriate to the local system. In the following code, the PUBLIC identifier is used to index into a shelf of SYSTEM identifiers to retrieve the replacement text:
global stream system-identifiers variable ... external-text-entity #implied local stream replacement-text-file-name set replacement-text-file-name to system-identifiers {"%pq"} drop (ul "file://") do when file replacement-text-file-name exists output file replacement-text-file-name else output "" done catch #program-error output ""
This code assumes that you have initialized the shelf system-identifiers
with SYSTEM identifiers as items and PUBLIC identifiers as keys. Since you control the SYSTEM identifiers that are loaded onto the shelf, you can map public identifiers to any set of system identifiers that are appropriate for the system your program will run on.
Notice that the code above presumes that the system identifiers on the system-identifiers
shelf are plain system file names, not URIs. If all the resources in your catalog are local files, there is no point in adding full URI syntax, just to strip it out again.
Instead of creating your own shelf to do the mapping, such as system-identifiers
in the code above, you can use the built in OmniMark shelf #library
. There are two advantages to using #library
rather than declaring your own shelf:
library
declaration.
external-text-entity
rule and rely on OmniMark's default behavior.
The following sample shows the code above rewritten to use #library
and a library declaration
:
library "-//ES//DTD for foo//EN" "foo.dtd" "-//ES//TEXT for bar//EN" "bar.txt" "-//ES//TEXT for baz//EN" "baz.txt" "-//ES//TEXT for bat//EN" "bat.txt" external-text-entity #implied do when file #library {"%pq"} exists output file #library {"%pq"} else output "" done
Usually, mappings of PUBLIC identifiers to SYSTEM identifiers are maintained in files called catalogs. The use of catalogs simplifies the maintenance of entity management.
The simplest way to create a catalog is to place an OmniMark library
declaration in a separate file and then include it in any program that needs it. If you want to avoid hardcoding catalog references into your program, you can reference a catalog file using the "-library" command-line option, which is described in the OmniMark Engine documentation.
There are other catalog formats that you can use. The SOCAT catalog format is a standard for use with SGML. You may need to use one of these catalog formats if you are using other XML or SGML tools that use a particular format. To use another catalog format, simply omit the library declaration and write your own function to load the #library
shelf by scanning the catalog.
You can use plain file names rather than whole paths as system identifiers on #library
. You can then specify the paths that OmniMark will search for these files by adding them to the #libpath
shelf. You should end all such paths with a directory separator (usually a "/" or "\") as the built-in entity manager will form paths by appending the system identifier to each entry on the #libpath
shelf, without interpreting the strings in any way. The following code sample uses #libpath to help resolve PUBLIC identifiers:
declare catch entity-replacement-found external-text-entity #implied do when entity is in-library do when file #library{"%pq"} exists output file #library{"%pq"} else repeat over #libpath do when file (#libpath || #library{"%pq"}) exists output file (#libpath || #library{"%pq"}) throw entity-replacement-found done again output "<? No definition for entity %q ?>" done catch entity-replacement-found done
You can add items to the #libpath
shelf with an ordinary set new
statement or with the -libpath command-line option, which is described in the OmniMark Engine documentation.
If you have an XML document that references a DTD with a SYSTEM or PUBLIC identifier, you can write an external-text-entity
rule to retrieve the DTD. The only difference from the examples above is the form of the external-text-entity
rule itself. The external-text-entity
rule header must refer to the DTD entity using the keyword #dtd
:
external-text-entity #dtd
Note that an external-text-entity #implied
will fire all normal entities but does not fire for an external DTD. You must write a separate external-text-entity #dtd
rule to retrieve an external DTD.
If all your entity replacement files are local (rather then being retrieved from the web), you can take advantage of OmniMark's default entity manager. If you don't put any external-text-entity
rules in your program, OmniMark will provide a default external-text-entity
and compile it into your program for you. The source code for this rule is provided below.
The built-in entity manager assumes all the system identifiers it sees are local file names. It makes use of #library and #libpath, so you can load the appropriate values into those shelves to get the behavior you need.
In SGML, SYSTEM identifiers are not required to be URIs. In fact there is no specific requirement about their contents, though it is common practice that they are file paths on the local system. This makes the basic code for retrieving an entity definition by its SYSTEM identifier even simpler:
external-text-entity #implied do when file "%eq" exists output file "%eq" else output "" done
Apart from this difference in SYSTEM identifiers, resolving entities in SGML works the same as resolving them in XML.
You can resolve external data entities by writing an external-data-entity
rule. An external-data-entity
rule behaves just like an external-text-entity
rule except that no new output scope is created. Output in an external-data-entity
rule goes to the existing output scope.
You can use exactly the same methods to resolve SYSTEM and PUBLIC identifiers as you do with an external-text-entity
rule. However, the built-in entity manager is not active for external data entities.
How you process the data referenced by an external data entity is entirely up to you.
A special case of an external data entity is a SUBDOC entity. A SUBDOC entity references another SGML document that does not form part of the text of the current document, that may conform to a completely different DTD, but which shares the SGML declaration of the document that references it.
To process the document referenced by the SUBDOC entity, you use the do sgml-parse
action. To cause this parse to inherit the SGML declaration of the outer parse, you specify subdocument
instead of document
in the do sgml-parse
statement:
external-data-entity #implied when entity is subdoc-entity do sgml-parse subdocument scan file "%eq" output "%c" done catch #external-exception output ""
There are a number of standard entities that are referred to by public identifiers in most SGML declarations. OmniMark predefined these entities to avoid requiring all users to provide definitions for these entities. The definition for these entities is provided on the #libvalue
shelf, which is used by the default entity manager. You can add or change items on the #libvalue
shelf if you need to.
OmniMark's default entity manager is an external-text-entity
rule
that is compiled into any module that does not specify any external-text-entity
rules itself. This
default rule resolves external text entity references using the built-in shelves #library
and #libpath
that can be set from the command line, as well as #libvalue
.