Entities, external

External entities fall into two categories: text entities and data entities. XML supports only text entities in content. SGML supports both text and data entities. Different processing is required for text and data entities.

Resolving external text entities

An external text entity is an entity whose replacement text forms part of the document being parsed. You must resolve an external entity and feed its replacement text to the parser so that it is parsed as part of the current document.

To resolve an external text entity you can write an external-text-entity rule. This rule will be fired whenever the corresponding entity occurs in the document. When an external-text-entity rule is fired, a new output scope is created. The destination of this output scope is the parser. This means that you can send the replacement text of the entity to the parser simply by outputting it in the external-text-entity rule.

Here is the simplest case of an external-text-entity rule. This rule simply outputs a replacement string for the entity "company-name":

  external-text-entity "company-name"
     output "Stilo International plc"

In most cases, however, you will want to use a SYSTEM or PUBLIC identifier to locate the replacement text. A SYSTEM identifier is a reference to a data source in system-specific terms. In XML, SYSTEM identifiers are URIs. The PUBLIC identifier is an abstract string that can be mapped to a SYSTEM identifier on a system by system basis.

You can retrieve the SYSTEM identifier of an entity with the format string "%eq" and the PUBLIC identifier with "%pq". You can retrieve the name of the entity itself with "%q". This example uses the SYSTEM identifier to retrieve an entity definition:

  external-text-entity #implied
     local stream replacement-text-file-name
     set replacement-text-file-name
      to "%eq" drop (ul "file://")
     do when file replacement-text-file-name exists
        output file replacement-text-file-name
     else
        output ""
     done
     catch #program-error
        output ""

This rule is written to support only SYSTEM identifiers that are files on the local system. To support a full range of URIs, you must write code to determine the type of URI used in a SYSTEM identifier and retrieve the replacement text from that location.

Since the location of entity replacement files can vary from one system to another, you may want to use PUBLIC identifiers to reference system identifiers appropriate to the local system. In the following code, the PUBLIC identifier is used to index into a shelf of SYSTEM identifiers to retrieve the replacement text:

  global stream system-identifiers variable
  ...
  external-text-entity #implied
     local stream replacement-text-file-name
     set replacement-text-file-name
      to system-identifiers {"%pq"}
      drop (ul "file://")
     do when file replacement-text-file-name exists
        output file replacement-text-file-name
     else
        output ""
     done
     catch #program-error
        output ""

This code assumes that you have initialized the shelf system-identifiers with SYSTEM identifiers as items and PUBLIC identifiers as keys. Since you control the SYSTEM identifiers that are loaded onto the shelf, you can map public identifiers to any set of system identifiers that are appropriate for the system your program will run on.

Notice that the code above presumes that the system identifiers on the system-identifiers shelf are plain system file names, not URIs. If all the resources in your catalog are local files, there is no point in adding full URI syntax, just to strip it out again.

Using the #library shelf

Instead of creating your own shelf to do the mapping, such as system-identifiers in the code above, you can use the built in OmniMark shelf #library. There are two advantages to using #library rather than declaring your own shelf:

You can load the values onto the shelf automatically using the library declaration.
OmniMark's default entity manager will use the #library shelf. In many cases you will be able to avoid writing an external-text-entity rule and rely on OmniMark's default behavior.

The following sample shows the code above rewritten to use #library and a library declaration:

  library "-//ES//DTD for foo//EN" "foo.dtd"
          "-//ES//TEXT for bar//EN" "bar.txt"
          "-//ES//TEXT for baz//EN" "baz.txt"
          "-//ES//TEXT for bat//EN" "bat.txt"
  
  external-text-entity #implied
     do when file #library {"%pq"} exists
        output file #library {"%pq"}
     else
        output ""
     done

Using OmniMark catalogs

Usually, mappings of PUBLIC identifiers to SYSTEM identifiers are maintained in files called catalogs. The use of catalogs simplifies the maintenance of entity management.

The simplest way to create a catalog is to place an OmniMark library declaration in a separate file and then include it in any program that needs it. If you want to avoid hardcoding catalog references into your program, you can reference a catalog file using the "-library" command-line option, which is described in the OmniMark Engine documentation.

There are other catalog formats that you can use. The SOCAT catalog format is a standard for use with SGML. You may need to use one of these catalog formats if you are using other XML or SGML tools that use a particular format. To use another catalog format, simply omit the library declaration and write your own function to load the #library shelf by scanning the catalog.

Using #libpath

You can use plain file names rather than whole paths as system identifiers on #library. You can then specify the paths that OmniMark will search for these files by adding them to the #libpath shelf. You should end all such paths with a directory separator (usually a "/" or "\") as the built-in entity manager will form paths by appending the system identifier to each entry on the #libpath shelf, without interpreting the strings in any way. The following code sample uses #libpath to help resolve PUBLIC identifiers:

  declare catch entity-replacement-found
  external-text-entity #implied
     do when entity is in-library
        do when file #library{"%pq"} exists
           output file #library{"%pq"}
        else
           repeat over #libpath
              do when file (#libpath || #library{"%pq"}) exists
                 output file (#libpath || #library{"%pq"})
                 throw entity-replacement-found
              done
           again
           output "<? No definition for entity %q ?>"
        done
     catch entity-replacement-found
     done

You can add items to the #libpath shelf with an ordinary set new statement or with the -libpath command-line option, which is described in the OmniMark Engine documentation.

Resolving a reference to an external DTD

If you have an XML document that references a DTD with a SYSTEM or PUBLIC identifier, you can write an external-text-entity rule to retrieve the DTD. The only difference from the examples above is the form of the external-text-entity rule itself. The external-text-entity rule header must refer to the DTD entity using the keyword #dtd:

  external-text-entity #dtd

Note that an external-text-entity #implied will fire all normal entities but does not fire for an external DTD. You must write a separate external-text-entity #dtd rule to retrieve an external DTD.

Using the default entity manager

If all your entity replacement files are local (rather then being retrieved from the web), you can take advantage of OmniMark's default entity manager. If you don't put any external-text-entity rules in your program, OmniMark will provide a default external-text-entity and compile it into your program for you. The source code for this rule is provided below.

The built-in entity manager assumes all the system identifiers it sees are local file names. It makes use of #library and #libpath, so you can load the appropriate values into those shelves to get the behavior you need.

Resolving external text entities in SGML

In SGML, SYSTEM identifiers are not required to be URIs. In fact there is no specific requirement about their contents, though it is common practice that they are file paths on the local system. This makes the basic code for retrieving an entity definition by its SYSTEM identifier even simpler:

  external-text-entity #implied
     do when file "%eq" exists
        output file "%eq"
     else
        output ""
     done

Apart from this difference in SYSTEM identifiers, resolving entities in SGML works the same as resolving them in XML.

Resolving external data entities in SGML

You can resolve external data entities by writing an external-data-entity rule. An external-data-entity rule behaves just like an external-text-entity rule except that no new output scope is created. Output in an external-data-entity rule goes to the existing output scope.

You can use exactly the same methods to resolve SYSTEM and PUBLIC identifiers as you do with an external-text-entity rule. However, the built-in entity manager is not active for external data entities.

How you process the data referenced by an external data entity is entirely up to you.

Resolving SUBDOC entities in SGML

A special case of an external data entity is a SUBDOC entity. A SUBDOC entity references another SGML document that does not form part of the text of the current document, that may conform to a completely different DTD, but which shares the SGML declaration of the document that references it.

To process the document referenced by the SUBDOC entity, you use the do sgml-parse action. To cause this parse to inherit the SGML declaration of the outer parse, you specify subdocument instead of document in the do sgml-parse statement:

  external-data-entity #implied when entity is subdoc-entity
     do sgml-parse subdocument
      scan file "%eq"
        output "%c"
     done
  catch #external-exception
     output ""

Anonymous entities

There are a number of standard entities that are referred to by public identifiers in most SGML declarations. OmniMark predefined these entities to avoid requiring all users to provide definitions for these entities. The definition for these entities is provided on the #libvalue shelf, which is used by the default entity manager. You can add or change items on the #libvalue shelf if you need to.

The default entity manager

OmniMark's default entity manager is an external-text-entity rule that is compiled into any module that does not specify any external-text-entity rules itself. This default rule resolves external text entity references using the built-in shelves #library and #libpath that can be set from the command line, as well as #libvalue.

Related Topics