HOME | COMPANY | SOFTWARE | DOCUMENTATION | EDUCATION & TRAINING | SALES & SERVICE | |
"The Official Guide to Programming with OmniMark" |
|
International Edition |
Previous chapter is Chapter 16, "Processing External Entities".
Next chapter is Chapter 18, "How Asynchronous Concurrent Context Translations Work".
In-line subdocument parsing in OmniMark is supported by a very general facility that allows any SGML document to be parsed at any point in an OmniMark program. Processing SUBDOC entities is a special case of this. Issues related specifically to subdocuments are discussed in Section 17.6, "An Example of Subdocument Processing".
SGML subdocuments are not the only other SGML documents that may need to be processed while processing an SGML document or a general text document:
In addition, SGML subdocuments may need to be processed prior to or following the processing of the main document, at the point of reference, or some combination of the two. So in the manner in which they are processed there is no particular distinction between when and how subdocuments and "main" SGML documents are processed.
The consequence of all this is that, rather than OmniMark processing subdocuments explicitly, OmniMark programs are provided with the means to implement the SUBDOC facility. Among other things this means that the OmniMark programmer is responsible for detecting when the allowed number of nested subdocuments is exceeded (see Section 17.6, "An Example of Subdocument Processing").
Besides SGML documents and subdocuments, the following also need to be parsed:
This requirement comes from database applications, that store partial SGML documents and need to process them separately.
Once an SGML DTD has been parsed, it may be desirable to parse one or more instances that conform to that DTD separately from "compiling" the DTD.
SGML documents have the property that, context issues aside, any complete subelement of a document instance can be parsed as if it were a full document instance, differing only from parsing the "full" instance in the selection of an appropriate document element for the instance. OmniMark supports full instance and instance part parsing by allowing for an optional program specification of the DOCUMENT-ELEMENT when doing instance parsing.
In this document, the phrase "instance part" is used to describe both full instances and complete subelements of instances -- a part or a whole.
The primary utility of instance part parsing is expected to be in the context of a database system, in which "instance parts" of SGML documents are kept, for the purpose of independent editing and multiple use. A key aspect of instance part use, and an increasingly important requirement in the documentation industry, is "reusability". A document database really takes on value when its components (instance parts) can be used in multiple contexts. Even where a component may actually be used only once, it is often the case that the context in which it is used, even what book or manual it will finally appear in, is not decided until after most of the components have already been created.
Both reuse and "use after authoring" require that the form of an instance part not depend on the context of its use. Certainly there should be no assumptions about what elements appear around it, or how deep it is in the document structure hierarchy.
These considerations lead one to the point of view that an instance part should not depend on any context other than the DTD itself.
On the other hand, an instance part should be allowed to take advantage of SGML minimization features. Using minimization provides flexibility in how instance parts are created and edited. Those minimization features that have been found most successful, OMITTAG and (most of) SHORTTAG, tend to be independent of context, at least in their successful forms. The minimization features that have been found least successful, RANK and #CURRENT attributes, tend to be most dependent on context. So a context-free approach to instance parts is consistent with using the more successful types of minimization.
A "DO SGML-PARSE" action launches the parsing of an SGML document. It can be used anywhere in the output processor.
"DO SGML-PARSE" provides a set of actions that are performed in the output processor and a function whose "function body" is performed in the input processor. The function provides the text of the new SGML document to the SGML parser.
A simple example of a "DO SGML-PARSE" action is the following:
DO SGML-PARSE SUBDOCUMENT SCAN FILE "subdoc.doc" OUTPUT "%c" DONE
This "DO SGML-PARSE" action launches the parsing of a subdocument. FILE "subdoc.doc" identifies the file containing the text of the subdocument. The "%c" provides the output destination for the subdocument as a whole.
The full syntax of the "DO SGML-PARSE" action is:
DO SGML-PARSE sgml-parse-type SCAN (input-source | (INPUT input-function-call)) local-declaration* action+ DONE
The sgml-parse-type is described in Section 17.2.3, "Types of SGML Document Parsing". The input-source and input-function-call are described in Section 17.2.1, "Specifying the Input of the SGML Parser".
The actions which form the body of the "DO SGML-PARSE" are performed in the output processor. The actions, or a function that they call, must perform exactly one "%c" format item or SUPPRESS action. This "%c" or SUPPRESS provides the default #CURRENT-OUTPUT stream set for the output processor rules that process the output of the SGML parser for the document being parsed. The actions in the "DO SGML-PARSE" action prior to the "%c" or SUPPRESS are performed prior to any document processing, and the actions after the "%c" or SUPPRESS are performed after the document processing. (These groups of actions can be viewed analogously to the actions in DOCUMENT-START and DOCUMENT-END rules, respectively.)
Like an EXTERNAL-TEXT-ENTITY rule, a "DO SGML-PARSE" action can be used in a down-translation, a context-translation or a process program, but not an up-translation or a cross-translation.
No FIND-START rule or DOCUMENT-START rule is performed prior to the "DO SGML-PARSE", and no FIND-END rule or DOCUMENT-END rule is performed following the parsing of the new document entity. These rules are performed only once, prior to and following parsing the "main" document. The input function, and the body of the "DO SGML-PARSE" action, together with the ability to call functions there, provide locations at which equivalent pre- and post-processing can be done.
The source of input to the SGML parser can be one of two things:
An example of this is:
EXTERNAL-DATA-ENTITY #IMPLIED WHEN ENTITY IS SUBDOC DO SGML-PARSE SUBDOCUMENT SCAN FILE "subdoc.doc" OUTPUT "%c" DONE
This example effectively performs a down-translation on the file containing the subdocument, incorporating the results in the parent document.
A simple example of this is:
DEFINE FUNCTION submit-file VALUE STREAM file-name AS SUBMIT FILE file-name EXTERNAL-DATA-ENTITY #IMPLIED WHEN ENTITY IS SUBDOC DO SGML-PARSE SUBDOCUMENT SCAN INPUT submit-file "subdoc.doc" OUTPUT "%c" DONE
This example effectively performs a context-translation on the named input file, incorporating the results into the current document.
If there are errors in the SGML Declaration or prolog (DTD), then the "DO SGML-PARSE" will terminate, but the amount of input read is undefined in this situation. That is, OmniMark may choose to consume the entire input source, it may stop reading the input immediately, or it may do something in between.
The input function is called from the output processor, and its arguments are evaluated in the calling domain. The arguments are evaluated prior to actually entering the "DO SGML-PARSE" or establishing a new SGML parsing activity, so any SGML enquiries refer to the surrounding context, not the new SGML parse.
However the body of the input function, is performed in the input processor:
These provisions apply not only to the function, but to any function or (FIND) rule it initiates.
When the input function is called, it is provided with the a #SGML stream, created to provide input to the newly launched SGML parse activity. The new #SGML stream is the initial current output stream of the input function.
Writing to the SGML parser can either be done directly, with OUTPUT or PUT, or by using SUBMIT, in which case any submitted text will be processed by FIND rules and the FIND rules' output submitted to the SGML parser.
The text provided to the new SGML stream within the input function constitutes the text of the SGML document entity or SGML subdocument entity being parsed.
An input function called by a "DO SGML-PARSE" action has a lot in common with an EXTERNAL-TEXT-ENTITY rule, especially in its treatment of groups, SUBMIT and #CURRENT-OUTPUT. This is in keeping with the fact that an SGML subdocument is an external entity that is processed by the SGML parser, just like an external text entity.
An input function differs from an EXTERNAL-TEXT-ENTITY in that an input function is explicitly invoked by the programmer. (It is declared as a regular function: it is its call in a "DO SGML-PARSE" action that makes it an input function.)
When an input function terminates (by RETURN or "dropping off the bottom") a signal is sent to the SGML parser that the end of the text being parsed has been encountered. OmniMark will resume processing of the most recently suspended document, if any. Execution is resumed following the "%c" or SUPPRESS in the "DO SGML-PARSE" action.
The input function of a "DO SGML-PARSE" action must be an internal function.
The body of the input function does not inherit any scanning source.
"SOURCE #CURRENT-INPUT" should not be passed to an input function if "SOURCE #CURRENT-INPUT" could be data content in an SGML document being parsed.
A "DO SGML-PARSE" action saves the current setting of SGML-IN and SGML-OUT, and restores them at the end of the action. Any SGML-IN action or SGML-OUT action performed while parsing a nested document, subdocument or instance part only affects that nested document, subdocument or instance part.
If an error occurs in the SGML Declaration or prolog (DTD) it is not specified whether OmniMark will complete the input function or whether the input function will be terminated. Thus, the programmer can not rely on side effects of the input function when errors occur in the SGML Declaration or prolog (DTD).
The #CURRENT-INPUT established for an input function and the #CURRENT-OUTPUT established for the body of a "DO SGML-PARSE" derive from where the "DO SGML-PARSE" is invoked:
The idea here is that the invoking domain is bifurcated into a new input processor and output processor, with the current input going into the new input processor and the current output going out of the new output processor.
The sgml-parse-type portion of the "DO SGML-PARSE" action indicates the type of document, subdocument, instance or "instance part" to be parsed as follows:
Syntax
DOCUMENT (CREATING DTDS ^ string-expression)?
When DOCUMENT is specified, then the action prepares OmniMark's built-in SGML parser to parse a full SGML document.
Creating a DTDS item is described in Section 17.3, "The DTDS Shelf".
The keyword KEY can be used as a synonym for the operator "^".
Syntax
SUBDOCUMENT (CREATING DTDS ^ string-expression)?
If the "DO SGML-PARSE" action designates SUBDOCUMENT, then the action prepares OmniMark's built-in SGML parser to parse an SGML subdocument. The concrete syntax defined by the document whose parsing is suspended by the "DO SGML-PARSE" action is used to parse the subdocument: in accordance with the SGML standard, the subdocument's text must not contain an SGML Declaration.
Creating a DTDS item is described in Section 17.3, "The DTDS Shelf".
The keyword KEY can be used as a synonym for the operator "^".
Syntax
INSTANCE (WITH DOCUMENT-ELEMENT string-expression)? WITH DTDS ^ string-expression
If the "DO SGML-PARSE" action designates INSTANCE, then the action prepares OmniMark's built-in SGML parser to parse all or part of an SGML document instance. The SGML DTD and concrete syntax identified by the dtd-item is used to parse the instance part.
The element's start- and end-tags can be present, or they can be omitted if the element allows it. SGML comments, processing instructions and even marked sections can precede and follow the element's start- and end-tags, but anything else (particularly other elements, data, entity references or USEMAP declarations) is in error.
When "WITH DOCUMENT-ELEMENT" is specified, the instance part is parsed as if the element specified in the string-expression is the document element.
Selecting a DTDS item is described in Section 17.3, "The DTDS Shelf".
The keyword KEY can be used as a synonym for the operator "^".
When parsing an SGML document from within a "DO SGML-PARSE" action, all the SGML document region rules are performed in the usual manner: DTD-START, DTD-END, PROLOG-END and EPILOG-START. The SGML-DECLARATION-END rule is performed in the usual manner if the "DO SGML-PARSE" action specifies DOCUMENT. However, if it specifies SUBDOCUMENT, then no SGML-DECLARATION-END rule is performed -- a subdocument starts, in effect, after the end of the SGML Declaration.
When parsing an instance part (with INSTANCE ), none of these rules are performed.
The choice between whether the DOCUMENT or SUBDOCUMENT form of the "DO SGML-PARSE" action is used depends on the context in which it is used and the type of SGML entity to be parsed:
When the "DO SGML-PARSE" action specifies "CREATING DTDS" and a key, it terminates parsing at the end of the SGML document prolog, creates a "compiled" DTD, and saves it as the specified item of the DTDS shelf. The saved DTDS item can then be used later to parse an instance or instance part.
If the DTDS shelf does not already have such an item, a new item, with the given key, is added to the end of the DTDS shelf, and the newly compiled DTD is stored in that item. If the DTDS shelf already has an item with that key, then the newly created DTD replaces the previous one.
The new DTDS item is not created until after the entire "DO SGML-PARSE" has completed. If there is an error in the SGML Declaration or prolog (DTD) then the DTDS shelf item will not be created or updated.
All the usual output processor rules are processed during parsing of the (optional SGML Declaration and) DTD. This allows information in processing instructions associated with the DTD, for example, to be captured by the program.
The document instance is not parsed.
When parsing an INSTANCE , the string-expression following "WITH DOCUMENT-ELEMENT" identifies an element within the DTD used to create the DTDS item. This element is the element used as the base for parsing an instance part. The instance cannot contain a document prolog.
The string-expression must match an element name exactly: no extra spaces and it must be in upper-case if the element name is in upper-case (e.g. because NAMECASE GENERAL is set to YES).
INSTANCE , like DOCUMENT, saves and resets the counter value returned by the "NUMBER OF" "CURRENT SUBDOCUMENTS" (and restores the saved value when the action has finished).
a "DO SGML-PARSE" action can be used anywhere in the output processor, including in DOCUMENT-START, DOCUMENT-END, PROCESS, PROCESS-START and PROCESS-END rules, except that a "DO SGML-PARSE" action cannot be called from within an SGML-ERROR rule or from any function called within an SGML-ERROR rule.
When a "DO SGML-PARSE" action is used, a duplicate is made of the current output set in the calling domain, complete with their options and modifiers. That duplicate becomes the #CURRENT-OUTPUT stream set inherited by rules that process the parsed document, including the SGML-DECLARATION-END rule (if any), DTD-START rule, DTD-END rule, PROLOG-END rule and the ELEMENT rule for the document element.
If any of the SGML document region rules modify the #CURRENT-OUTPUT stream set (using OUTPUT-TO), then the modified set applies to the following rules for the parsed document, including the document element's ELEMENT rule. The modifications do not, however, apply to any output processor rule in the context in which the "DO SGML-PARSE" action was used -- it is the duplicate of the context's output stream set that is modified, not the context's.
There are three restrictions on using the "DO SGML-PARSE" action:
To "DO SGML-PARSE" with SUBDOCUMENT, the SGML Declaration of the "parent" document has to have been encountered and parsed, or it has to have been determined that the "parent" document does not have an SGML Declaration.
If a "DO SGML-PARSE" with SUBDOCUMENT is used within a PROLOG-IN-ERROR rule, the action will be allowed if the error in the prolog occurred following successful parsing of the SGML Declaration. Otherwise the PROLOG-IN-ERROR rule is in error.
Compiled DTDs are kept as items on the DTDS shelf:
When "DO SGML-PARSE" is used to parse an instance or instance part using a previously compiled DTD, a "copy" of that compiled DTD is used, so that even if the DTD used is removed from the DTDS shelf during the "DO SGML-PARSE", there is no problem: the compiled DTD will be hung on to by OmniMark until all "DO SGML-PARSE" actions using it have completed.
DO SGML-PARSE INSTANCE WITH DOCUMENT-ELEMENT "SECTION" WITH DTDS ^ "DOC" SCAN FILE "sect37.txt" SET FILE "sect37.out" TO "%c" DONE
Most of the common accesses, shelf actions and tests can be performed on the DTDS shelf:
CLEAR DTDS
which removes all the compiled DTDs permanently, subject to the provisos for REMOVE.
ELEMENT #IMPLIED WHEN ATTRIBUTE source IS SPECIFIED ; attribute "source" is a #CONREF attribute DO SGML-PARSE INSTANCE WITH DOCUMENT-ELEMENT "%q" ; reparse current element WITH CURRENT DTD SCAN FILE "%ev(source)" OUTPUT "%c" DONE
"CURRENT DTD" can only be used when there clearly is a currently active DTD:
The "CURRENT DTD" need not be an item of the DTDS shelf -- it can be established by a "DO SGML-PARSE" DOCUMENT or SUBDOCUMENT.
"CURRENT DTD" cannot be used immediately following CREATING -- it doesn't make sense.
WHEN DTDS HAS KEY ". . ."
This form of the test succeeds only if the DTDS shelf currently has an item with a key that matches the string value in the test.
The NEW action cannot be applied to the DTDS shelf: the CREATING option of "DO SGML-PARSE" adds items to the DTDS shelf. In addition, none of the other shelf modification operations ("SET KEY OF", "REMOVE KEY OF", etc.) are allowed on the DTDS shelf, except that the programmer can remove compiled DTDs, either one at a time (REMOVE) or all at once (CLEAR).
DTDS can be thought of as a "shelf", except that its items' values cannot be created or used except by "DO SGML-PARSE". It is not possible to "output" a compiled DTD, for example.
"NUMBER OF" allows the programmer to quickly determine how deeply the subdocuments being parsed are nested. "NUMBER OF" "CURRENT SUBDOCUMENTS" returns the number of subdocuments of the current document or instance part (that is not a subdocument) being parsed.
The value returned by "NUMBER OF" "CURRENT SUBDOCUMENTS" is determined as follows:
Certain built-in stream and counter shelves exist only to support SGML parsing activity. These shelves must not be accessed outside SGML parsing -- in other words, they must not be accessed in a rule that is not being run in the output processor or the input processor (note that a FIND rule can be run in either the output processor or the input processor). The following stream and counter built-in shelves must not be accessed in PROCESS, PROCESS-START or PROCESS-END rules:
Whether or not #APPINFO and #DOCTYPE actually provide information depends on the state of SGML parsing. They only provide information if the SGML document being parsed, or the DTD being used to parse it provide the information. In particular, this means that they are never attached:
The mechanisms in the previous subsections provide the programmer with what is required to implement fully conforming SGML subdocument processing, complete with error checking. The following is an example of how to cause references to SGML subdocument entities to trigger parsing of the subdocument entities. The source of the subdocument entity text in the example is assumed to be a file whose name is either the system identifier, provided by a LIBRARY rule, the "public text description" part of the public identifier, or the name of the entity (upper-cased and with ".ENT" file extension appended).
EXTERNAL-DATA-ENTITY #IMPLIED WHEN ENTITY IS SUBDOC LOCAL STREAM file-name OUTPUT "SUBDOC depth exceeded!%n" WHEN NUMBER OF CURRENT SUBDOCUMENTS > 100 DO WHEN ENTITY IS SYSTEM SET file-name TO "%eq" ELSE WHEN ENTITY IS IN-LIBRARY SET file-name TO "%epq" ELSE WHEN ENTITY IS PUBLIC DO SCAN "%pq" MATCH (["+-"] "//")? ((LOOKAHEAD ! "//") ANY)* "//" [ANY EXCEPT " "]* " " "-//"? ((LOOKAHEAD ! "//") ANY)* => public-text-description SET file-name TO public-text-description DONE ELSE SET file-name TO "%uq.ENT" DONE DO SGML-PARSE SUBDOCUMENT SCAN FILE file-name OUTPUT "%c" DONE
Next chapter is Chapter 18, "How Asynchronous Concurrent Context Translations Work".
Copyright © OmniMark Technologies Corporation, 1988-1997. All rights reserved.
EUM27, release 2, 1997/04/11.