HOME | COMPANY | SOFTWARE | DOCUMENTATION | EDUCATION & TRAINING | SALES & SERVICE | |
"The Official Guide to Programming with OmniMark" |
|
International Edition |
Previous chapter is Chapter 1, "Introduction".
Next chapter is Chapter 3, "Generalized Document Processing".
OmniMark provides a variety of ways to configure programs, and a lot of help in doing so. There are two basic types of programs:
OmniMark makes it easy to write both types of programs. Although all types of OmniMark programs can make good use of most facilities in the language, there are some features which are primarily designed to support one type of programming or the other.
Batch processing programs can take advantage of OmniMark's built-in translation types, which aid the programmer by pre-configuring OmniMark to different types of rule-based conversions.
Server-based processing programs directly issue OmniMark's easy-to-use "DO SGML-PARSE" actions for SGML processing, and SUBMIT actions for general text processing.
Essentially, OmniMark programs run with one or two program-controlled subsystems or threads, called domains. Programs can have one domain, or they can have two domains, joined together by OmniMark's built-in SGML parser.
General text processing is usually done with SUBMIT actions and FIND rules. SUBMIT actions feed input data to OmniMark, and FIND rules use pattern matching to analyse the input data. These features are enhanced by OmniMark's SGML support -- general text processing can be used to help create SGML documents, to further process the results of SGML parsing, or do both. (FIND rules are described in Section 3.1, "General Document Processing Rules". SUBMIT is described in Section 3.2.1, "Submitting Input to FIND Rules".)
Additionally, OmniMark's unique "source" functions and "output" functions mean that input data and output data can be transmitted from and to any location: the local file system, databases, across networks, or across the Internet. Source functions are described in Section 12.3.3, "Externally-Defined Sources". Output functions are described in Section 12.3.4, "External Output Functions".
Multi-domain programs can initiate SGML parsing themselves (as is typical in server-based processing) or use one of OmniMark's built-in translation types. They can even combine a built-in translation type with program-controlled initiation of further SGML parsing.
The OmniMark run-time environment is a system composed of one or more of the following three cooperating subsystems:
The input processor provides the input to the SGML parser. In the process it may need to convert non-SGML data into SGML. The output processor converts the result of SGML parsing into some other form (which may even be SGML conforming to a different DTD).
The input processor has traditionally been called the find domain, because FIND rules figure largely in the kind of processing done there. Similarly, the output processor has traditionally been called the element domain, because that's where ELEMENT rules and other SGML processing is done. The terms input processor and output processor better characterize the role of the domains.
For translation programs, the translation type determines which subsystems are involved. For process programs, subsystems are started and suspended dynamically explicitly by the actions in the program.
Writing batch processing programs is made easier using a translation type, which automatically sets up the interaction of the OmniMark subsystems and sets up the input and output processing. By choosing different translation types, the programmer can control whether the input, the output, or an intermediate stage is parsed by the SGML parser, and what is produced as the "main output" of the program.
Batch processing programs are also referred to as translation programs because they are classified according to their translation type.
The translation type is specified by a single keyword at the start of the program. There are four translation types:
This is the only one of the four translation types that does not make use of the SGML parser. Cross translations consequently use only one domain.
A DOWN-TRANSLATE is typically used to convert SGML documents into non-SGML forms, or into other SGML documents.
An UP-TRANSLATE differs from the other translation types in that the input to the SGML parser is also the main output of the program. The SGML parser is used for validation, and, often more importantly, to provide contextual information that is used to drive the conversion into SGML.
This further processing can be used to convert the document into some other form, to "clean up" the markup and data in the converted document, or to convert it into an SGML document conforming to another DTD (or Document Type Definition).
Where a programmer does not want an OmniMark program to be configured as one of the above types of translations -- for example in a server-based application -- the programmer can specify directly what is to be subjected to FIND rule processing and what are to be the inputs and outputs of SGML parsing.
To avoid automatic configuration of the program, the programmer merely omits the translation type at the start.
The following subsections describe each of OmniMark's four "built-in" translation types in more detail.
A cross-translation is a translation that converts a document from one arbitrary form to another. A cross-translation program does not make any use of the SGML parser. A cross-translation must begin with:
CROSS-TRANSLATE
Figure 2 -- Cross Translation Block Diagram shows a simple block diagram of a cross-translation.
In a cross-translation, the OmniMark programmer must define their own conversion events. To this end, OmniMark provides a rich language for specifying patterns to match text in the input. When text in the input matches a pattern, the associated rule is executed. OmniMark also provides a very expressive mechanism for saving the text matched by pieces of the pattern, so that the matched text can play a role in the actions that will be executed.
The operation of a cross-translation is:
The following is an example of a simple OmniMark CROSS-TRANSLATE that removes all spaces at the starts and ends of lines, collapses runs of spaces between words into a single space character, and upper-cases every word starting with the letter "j". In the process tabs are converted into spaces:
CROSS-TRANSLATE FIND LINE-START BLANK+ | BLANK+ LINE-END ; just ignore spaces and tabs at the start of end of lines FIND BLANK {2}+ ; two or more "blank" (space or tab) characters OUTPUT " " FIND WORD-START (UL "J" LETTER*) => word OUTPUT "%ux(word)" ; The "u" modifier upper-cases the word.
OmniMark's pattern recognition capability is not limited by line boundaries, nor does it arbitrarily break up the text into fields. Because of this, cross-translation is a technique that is applicable to a wide range of data analysis and conversion tasks.
A down-translation is a translation whose input is a complete SGML document or an SGML document instance corresponding to a specified Document Type Definition.
The output of a down-translation can be SGML or some other format. It could be a document suitable for input into a text formatter, for example. Or a down-translation can be used to enter information from the SGML document into a database. A down-translation can even be used to transform an SGML document into another SGML document, for instance by "cleaning up" the input, or by restructuring it.
A down-translation is defined by entering:
Syntax
DOWN-TRANSLATE
at the start of an OmniMark program.
A down-translation is composed of rules that recognize SGML events, like elements. The basic operation of a down-translation program is:
If the component found is one that has content, such as an element or an SGML comment, then the actions in the rule control when the content is processed. The content is processed when a "%c" format item or a SUPPRESS action is encountered.
Should the SGML parser detect any errors in the markup of the SGML document it will report the errors. The OmniMark program can customize the manner in which these errors are reported. (See Chapter 15, "Processing SGML Errors").
The SGML parser will always recover from a markup error and return meaningful information to OmniMark, allowing processing to continue. Because of this, as many errors as possible will be detected in one run of the program.
The following simple example displays the titles of all the chapter elements in a document, prefixed by chapter numbers. All other elements are suppressed:
DOWN-TRANSLATE GLOBAL COUNTER chapter-count INITIAL {0} ELEMENT title WHEN PARENT IS chapter INCREMENT chapter-count PUT #MAIN-OUTPUT "Chapter %d(chapter-count): %c%n" ELEMENT #IMPLIED SUPPRESS ; Suppress the output of all other elements.
When the content of a component is processed by a rule, that rule is temporarily suspended. Events within the component's content (such as subelements) can cause other rules to execute. When the content has been completely processed, the suspended rule is resumed, and the remainder of the actions are executed. This behaviour gives the execution of an OmniMark program the same hierarchical structure that an SGML document has.
Care must be taken with programs written for earlier releases of OmniMark, which didn't support programs without a translation type. For the earlier releases, a program without a translation type was assumed to be a DOWN-TRANSLATE. If the program cannot be modified to add the translation type, then the current releases of OmniMark can still be made to process such programs correctly using the -herald command-line option. See Section 19.1.4.8, "Version 2 Compatibility".
An up-translation is a translation whose output is generally a complete SGML document. OmniMark parses the SGML document as it is generated, and any errors are reported. The same SGML document that is parsed is the "main output" of the program.
OmniMark also provides the ability to send information to the main output or the SGML parser individually. This allows programmers to send the SGML prolog to the SGML parser without sending it to the main output, for example. That way the output consists solely of the document instance. This is very useful for environments where the document instances are stored separately from the DTDs.
OmniMark places no restrictions on the format of the input to an up-translation; most often the input is a data file compatible with a non-SGML text processing system.
An up-translation must begin with:
Syntax
UP-TRANSLATE
Figure 4 -- Up Translation Block Diagram shows a simple block diagram for an up-translation.
When writing an up-translation, the OmniMark programmer uses FIND rules to describe the patterns of interest in a document and the actions to take to transform the document into an SGML document.
The operation of an up-translation is:
As markup is found and submitted to the parser, OmniMark will collect context information; that is, it will collect information about the document hierarchy being formed. This context information can be used in FIND rules to qualify subsequent FIND rules.
In an up-translation, the SGML document created is strictly a result of the patterns which can be found, in context, in the input document. The final SGML document provided at the output is identical to the document provided to the parser.
If there are errors in the generated markup, the parser will report the markup errors and perform as much error correction as possible. The error reports can be customized and even acted upon by the OmniMark programmer, to help the program recover from such markup errors.
The following example of an UP-TRANSLATE program converts RTF (Rich Text Format) into SGML. It:
In practise preamble material (style sheets) will need to be skipped over, and other styles and paragraph commands (such as "\par") will need to be recognized.
UP-TRANSLATE FIND "\s23" LOOKAHEAD ! DIGIT OUTPUT "<P>" FIND "\" LETTER [LETTER | DIGIT | "-"]* | ; RTF command "\" ANY | ; other RTF code ["{}"] ; RTF grouping ; Output nothing for these. FIND "\" ["{}\"] => protected-character OUTPUT protected-character ; Some characters are protected by \ FIND "\'" ANY {2} => hex-code ; Some characters are in hexadecimal LOCAL COUNTER character-value SET character-value TO hex-code BASE 16 OUTPUT "&#%d(character-value);" FIND "<" OUTPUT "<<!>" ; "<" often needs protecting in the SGML FIND "&" OUTPUT "&<!>" ; Likewise "&"
Up-translations work well for relatively simple documents. For complex documents context-translations are almost always preferable.
A context-translation is the most general of the built-in translation types. A context-translation is a translation that converts data from one form to another, using SGML as an intermediate form. A context-translation can be viewed as an up-translation to produce an intermediate SGML document combined with a simultaneous down-translation of that SGML document.
Patterns in the original document suggest its structure and allow (a possibly partial) conversion to SGML. OmniMark parses the SGML form and, using the SGML parser, corrects structure errors. The final output makes use of the structure discovered by the parser to produce a fully marked-up document, a minimized document, or some other form of data.
A context-translation begins with
Syntax
CONTEXT-TRANSLATE
Figure 5 -- Context Translation Block Diagram shows a simple block diagram of a context-translation. A context-translation combines the best features of an up-translation and a down-translation with the powerful error recovery and context tracking capability of the parser.
Although the following example is simple, it nonetheless illustrates the typical roles of the input and output processors in a context-translation.
Consider an input document like the following:
Context-Translation This is a simple context-translation. It takes an ASCII text file and produces SGML. The find rules just insert the markup. The element rules add white-space to make the document look more readable. The Input Document The input document consists of paragraphs and chapter titles. Chapter titles are preceded and followed by two blank lines to make them stand out. Paragraphs are separated from each other by a single blank line.
If the file "my.dtd" contains the element declarations:
<!ELEMENT doc - o (chapter+)> <!ELEMENT chapter - o (title, p+)> <!ELEMENT title - o (#PCDATA)> <!ELEMENT p - o (#PCDATA)>
The following program will convert the input to an SGML document conforming to those element declarations.
CONTEXT-TRANSLATE FIND-START OUTPUT "<!DOCTYPE doc SYSTEM 'my.dtd'>%n"_ "<DOC><CHAPTER><TITLE>" FIND "%n"{2}+ ANY-TEXT+ => title-text "%n"{2}+ OUTPUT "<CHAPTER><TITLE>%x(title-text)</TITLE><P>" FIND "%n"{2}+ OUTPUT "<P>" ELEMENT doc OUTPUT "%c" ELEMENT chapter OUTPUT "<CHAPTER>%n%c" ELEMENT #IMPLIED OUTPUT "<%q>%sc</%q>%n"
The FIND-START rule ensures that the first line of the document is interpreted as a chapter title. Following that, each single line of text surrounded by blank lines is interpreted as a further chapter title. ("%n"{2}+ matches a sequence of two or more line end characters.) All other blocks of text are interpreted as paragraphs.
On the output processor side, the resulting SGML is "cleaned up":
A detailed explanation of how OmniMark manages a context-translation can be found in Chapter 18, "How Asynchronous Concurrent Context Translations Work".
A programmer can take control of whether an OmniMark program does SGML parsing, where its main input comes from, and where its main output goes to, by omitting the translation type. A program without a translation type is called a "process program", because its "main" processing is specified in PROCESS rules. A process program can explicitly invoke SGML parsing using the "DO SGML-PARSE" action (see Section 17.2, "The "DO SGML-PARSE" Action").
Unlike translation programs, process programs do not perform any automatic processing of files named on the command-line. If files are named on the command-line, the names must be accessed using the #COMMAND-LINE-NAMES shelf (see Section 2.6, "Accessing The Command-Line Arguments").
The main processing in a process program is performed by PROCESS rules:
PROCESS condition? local-declaration* action*
A process program can contain more than one PROCESS rule. These rules are examined in the order that they occur in the program. When the rule is examined, if it has no condition, or if it has a condition which is true, the rule is executed. OmniMark then examines the next PROCESS rule.
The following is a simple program fragment that individually SGML-parses the files named on the command line, and puts the result of processing each of them into a file with ".out" appended to its name:
PROCESS REPEAT OVER #COMMAND-LINE-NAMES DO SGML-PARSE DOCUMENT WITH SCAN FILE #COMMAND-LINE-NAMES SET FILE (#COMMAND-LINE-NAMES || ".out") TO "%c" DONE AGAIN
Of course, the program also requires ELEMENT rules to process the contents of each document. This program is the equivalent of performing a down-translation on each file individually.
It is not an error for a process program to contain no PROCESS rules, or for all of the PROCESS rules to be unselectable because they either have conditions that cannot be satisfied or because they are in an inactive group. In this case only the PROCESS-START and PROCESS-END rules, if any, are performed. Such programs are frowned upon. It is intended that the PROCESS rules in a program contain its "main" processing. (PROCESS-START and PROCESS-END rules are described in Section 2.4, "Program Initialization and Termination".)
One of the main advantages of translation programs is that they will automatically take their input from the files named on the command-line. These files are processed as if they are all components of one big document that has been broken down into files. This can be especially convenient for processing SGML documents where the SGML declaration, DTD, and document instance are in separate files.
Because the processing of these files begins automatically, OmniMark provides initialization and termination rules to allow the programmer to gain control before the first file is processed, and after the last one is processed.
The initialization and termination rules that may be used depend on the domain in which they are used:
The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule replaces the automatic processing of command-line arguments, so it is used when the programmer wishes to manage the command-line explicitly. Otherwise, if the files named on the command-line are to be processed automatically, FIND-START and FIND-END rules should be used.
The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is described in Section 2.5.3, "Controlling Input to the SGML Parser".
Initialization and termination rules that can be used in any kind of OmniMark program are needed to allow programs to provide generic services that are applicable to any type of program.
PROCESS-START and PROCESS-END rules fulfil this need.
PROCESS-START condition? local-declaration* action*
PROCESS-END condition? local-declaration* action*
PROCESS-START and PROCESS-END are provided to allow initiation and termination processing to be placed adjacent to the GLOBAL declarations, function definitions and processing rules with which it is associated. This promotes a declarative style of programming.
Because they are independent of program type, PROCESS-START and PROCESS-END rules can be used in translation programs as well as process programs. This makes them suitable for use in INCLUDE files that, for example, contain:
Such an INCLUDE file can be used both in a DOWN-TRANSLATE (which doesn't allow FIND-START and FIND-END rules), in a CROSS-TRANSLATE (which doesn't allow DOCUMENT-START and DOCUMENT-END rules), and in a process program (which does not allow FIND-START, FIND-END, DOCUMENT-START, or DOCUMENT-END rules).
PROCESS-START and PROCESS-END rules have the following properties:
all rules.
all output processor rules.
all rules.
all output processor rules.
can affect any following rule.
can only affect output processor rules.
Of course, changing groups never affects a subsequent rule if it is done within an action preceded by a "USING GROUP AS" prefix. See Chapter 5, "Organizing Rules With Groups" for more information on groups.
if it is done in any rule.
only if it is done in an output processor rule.
Many programs can contain more than one type of initialization or termination rule. Context-translations, for instance, can contain FIND-START, DOCUMENT-START, and PROCESS-START rules, all in the same program.
In general, the order in which rules are performed (if the rule is available in that program type) is:
The list above is a general guideline, but there are special cases which fall outside this ordering:
In some situations, the FIND-START rule may supply the SGML declaration (if any), the DTD, and the initial part of a document instance to the #SGML stream. In this case, ELEMENT rules will begin to execute before all of the FIND-START rules have completed.
Similarly, a program may use a FIND-END rule to provide the trailing part of a document instance. In this case, ELEMENT rules will continue to execute even though the FIND-END rules have begun.
In a process program, there is no real distinction between the PROCESS-START, PROCESS and PROCESS-END rules, except in that they are performed in that order. OmniMark doesn't distinguish what can be done in these rules. However, PROCESS-START and PROCESS-END rules should only be used for whole-program initiation and termination functions, and PROCESS rules used for "main" processing, and for initiation and termination for individual transactions in a multi-transaction process.
OmniMark helps the programmer by dealing with a lot of the details of input and output, including, if the programmer wants, most of the reading and writing of files. However, OmniMark also allows the programmer to take control by giving them direct access to the command-line arguments, and by making the input sources and output streams explicitly available.
The OmniMark program reads data from the outside world through input sources, and writes data to the outside world through streams attached to files and externally-defined output streams.
OmniMark also provides a predefined set of streams and sources, which can be thought of as data's "main roads" into and out of an OmniMark program, and between the input and output processor.
The following subsections describe how the input sources and output streams are accessed and used.
The built-in streams that provide the output for the program are:
#PROCESS-OUTPUT identifies the default output destination supplied to the OmniMark program by the system. This corresponds to what is usually referred to as "standard output" ("stdout") on UNIX systems.
#PROCESS-OUTPUT can be written to from either domain.
The "DECLARE #PROCESS-OUTPUT" declarations can be used to change some of the characteristics of the #PROCESS-OUTPUT stream.
#PROCESS-OUTPUT was identified by the name #CONSOLE in earlier releases of OmniMark, and #CONSOLE can be used as a synonym for #PROCESS-OUTPUT. However, the use of #CONSOLE is deprecated -- there's no reason to use two names for the same thing.
#MAIN-OUTPUT identifies the output destination described by the -of or -aof directive on the OmniMark program's command line. When there isn't an -of or -aof on the command line, then #MAIN-OUTPUT identifies the same destination as #PROCESS-OUTPUT (i.e. "standard output").
Even when #MAIN-OUTPUT and #PROCESS-OUTPUT identify the same destination they are considered distinct streams. They can each have their own set of properties. For example, #MAIN-OUTPUT could be written in TEXT-MODE and #PROCESS-OUTPUT in BINARY-MODE (although in most cases it would be strange to do so).
#MAIN-OUTPUT is the default output (#CURRENT-OUTPUT) in an OmniMark program. It is "owned" by (and part of the #CURRENT-OUTPUT set in) the output processor in a context-translation, down-translation or process program and the input processor in a cross-translation or up-translation.
The "DECLARE #MAIN-OUTPUT" declarations can be used to change some of the characteristics of the #MAIN-OUTPUT stream.
#MAIN-OUTPUT was identified by the name OUTPUT in earlier releases of OmniMark. OUTPUT can still be used as a synonym for #MAIN-OUTPUT (except in contexts where the programmer has declared a shelf, argument or function with the name OUTPUT) but its use is deprecated. Using #MAIN-OUTPUT produces easier-to-understand programs.
#SGML identifies the input to the SGML parser. It is part of the current output set (#CURRENT-OUTPUT) in the following contexts:
Furthermore, it is available in any function called from the above contexts, or any FIND rules performed as a result of a SUBMIT in any of the above contexts.
It cannot be written to in any other contexts.
#SGML was identified by the name SGML in earlier releases of OmniMark. SGML can still be used as a synonym for #SGML (except in contexts where the programmer has declared a shelf, argument or function with the name SGML) but its use is deprecated. Using #SGML produces easier-to-understand programs, because it clearly identifies the #SGML stream to be an OmniMark artifact and not a programmer-declared name.
The #SUPPRESS stream is used to discard data. Output can be directed to #SUPPRESS for actions which may produce output in situations where output is not wanted. If #SUPPRESS is specified with other destination streams in an OUTPUT-TO, "USING OUTPUT AS", or PUT action, then the data is still written to the other streams. The data is only discarded if #SUPPRESS is the only stream being written to.
Output processor rules which process content may use the SUPPRESS action to discard the content. The SUPPRESS action sets the current output stream set to #SUPPRESS so that any rules invoked during the processing of the content will also discard their output.
The #SUPPRESS stream can be written from either domain, but it is the initial current output set for the output processor in an up-translation.
In a similar manner to the #PROCESS-OUTPUT special stream, #ERROR always refers to "standard error" ("stderr") and can be used in the same manner as #PROCESS-OUTPUT. (If no -log command-line argument is given, #ERROR is where OmniMark places all error and informative messages. Defining -log does not change the definition of #ERROR: the latter remains the "standard error" output.) If a system does not distinguish between "standard output" and "standard error", they are defined to be the same destination.
An example of using the #ERROR stream to report errors is:
LOCAL COUNTER list-items DO WHEN NUMBER OF list-items != list-count LOCAL COUNTER temp PUT #ERROR "Found a condition that shouldn't have happened:%n" SET temp TO NUMBER OF list-items PUT #ERROR " list-count = %d(list-count), but " _ "list-items has %d(temp) item" PUT #ERROR "s" WHEN temp != 1 PUT #ERROR ".%n" HALT WITH 2 ; signal an error condition while stopping DONE
The built-in streams can be heralded with STREAM, in contexts where such names are allowed to be heralded. The following two examples are equivalent:
Example A
DEFINE FUNCTION put-nl MODIFIABLE STREAM s AS PUT s "%n" FIND-START put-nl STREAM #MAIN-OUTPUT
Example B
DEFINE FUNCTION put-nl MODIFIABLE STREAM s AS PUT s "%n" FIND-START put-nl #MAIN-OUTPUT
The built-in input sources are:
#PROCESS-INPUT identifies the default input source supplied to the OmniMark program by the system. This corresponds to what is usually referred to as "standard input" ("stdin") on UNIX systems.
When the -term command-line option is given, #PROCESS-INPUT is unavailable to the program, and access of #PROCESS-INPUT is an error.
How #PROCESS-INPUT is used is described in Section 2.5.2.1, "Making Use of Built-In Input Sources".
In a translation program, #MAIN-INPUT identifies the text that will be automatically processed. Thus, when files are named on the command-line, #MAIN-INPUT supplies the text of each of the files in the order that their names appear on the command line. When there isn't any file named on the command line, then #MAIN-INPUT identifies the same source as #PROCESS-INPUT (i.e. "standard input").
In a process program, #MAIN-INPUT always identifies the same source as #PROCESS-INPUT.
In earlier releases of OmniMark, FIND rule pattern matching was unable to match across file "boundaries" -- a pattern couldn't match part of one file and part of the following. As of OmniMark V3, the files are joined together as if by the JOIN string concatenation operator, and pattern matching can match across any number of input files.
OmniMark gives the programmer control over how the command-line files are read by not actually opening any of them until absolutely required. The OmniMark program opens a command-line file if:
Otherwise, a programmer can be sure that OmniMark does not open any of the named files.
How #MAIN-INPUT is used is described in Section 2.5.2.1, "Making Use of Built-In Input Sources".
The appropriate type herald for #PROCESS-INPUT and #MAIN-INPUT is SOURCE, as in:
SUBMIT SOURCE #MAIN-INPUT
#PROCESS-INPUT and #MAIN-INPUT are built-in input sources. They explicitly identify sources of input and can be used as the scanning source in:
For example, the following simple rule exchanges square brackets in the input for tag open/tag close characters and vice versa, and provides the result to the SGML parser:
EXTERNAL-TEXT-ENTITY #DOCUMENT REPEAT SCAN #MAIN-INPUT MATCH "[" OUTPUT "<" MATCH "]" OUTPUT ">" MATCH "<" OUTPUT "[" MATCH ">" OUTPUT "]" MATCH [ANY EXCEPT "[]<>"]+ => other-text OUTPUT other-text AGAIN
Other uses can be made of #PROCESS-INPUT and #MAIN-INPUT, as required by a programmer.
#PROCESS-INPUT and #MAIN-INPUT are each subject to a variety of constraints:
If #PROCESS-INPUT or #MAIN-INPUT is used in any other way, an attempt is made to read in the whole of their input data into a string buffer in memory, and that string buffer is used in further operations. For example, using #MAIN-INPUT as the first argument of the "||" (JOIN) operator in the following action causes it to be read in its entirety prior to concatenating the period:
OUTPUT #MAIN-INPUT || "."
Further difficulty arises where either #PROCESS-INPUT or #MAIN-INPUT do not have an "end". This can happen when they are piped from a keyboard or other such device, where the input can wait forever for another character. This will have the effect of "hanging" a program that attempts to read in all of #PROCESS-INPUT or #MAIN-INPUT at once.
As a consequence of these difficulties, care should be taken that #PROCESS-INPUT and #MAIN-INPUT are normally read in an incremental manner.
EXTERNAL-TEXT-ENTITY #DOCUMENT
An "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is used to take control of where input comes from in a down-translation, in much the same way that the #COMMAND-LINE-NAMES shelf gives input control to process programs and cross-translations. Up-translation and context-translations can use either technique for input control.
An "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule provides a framework for explicitly providing an entire "SGML document entity" to the SGML parser.
The following example demonstrates how the names on the OmniMark command line can be interpreted as URLs of HTML documents. The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule uses SUBMIT to make sure FIND rules can be used to convert the HTML into appropriately conforming SGML. An externally defined "source" function called get-url is assumed to be available for getting the text of HTML files via the Internet:
CONTEXT-TRANSLATE EXTERNAL-TEXT-ENTITY #DOCUMENT REPEAT OVER #COMMAND-LINE-NAMES SUBMIT get-url ("http://" || #COMMAND-LINE-NAMES) AGAIN ...
If a CONTEXT-TRANSLATE or UP-TRANSLATE contains an "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule then there is no automatic SUBMIT of either the files named on the command line, or the #PROCESS-INPUT. The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is used for processing the "SGML document entity". It can examine the #COMMAND-LINE-NAMES built-in stream (Section 2.6, "Accessing The Command-Line Arguments") if it needs to access files named on the command-line.
The "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule is different than other kinds of EXTERNAL-TEXT-ENTITY rules, in that:
EXTERNAL-TEXT-ENTITY #DOCUMENT | #DTD ...
is not allowed.
The other forms of the EXTERNAL-TEXT-ENTITY rule are described in Section 16.2.2, "Processing External Text Entities".
OmniMark defines a global unkeyed read-only built-in stream shelf, #COMMAND-LINE-NAMES, that contains, as its values, the "words" on the command-line that are not recognized as OmniMark command-line options.
The #COMMAND-LINE-NAMES shelf can be used to:
This can be useful in programmer-generated messages. The current item of the #COMMAND-LINE-NAMES shelf can be used in the messages to identify the file which caused the message to be generated.
This can be useful for checking the arguments to a program which takes a long time to execute. By immediately checking for any files named on the command-line that do not exist, the program can let the user running it know immediately when they have made a mistake typing in the command-line.
This is useful when the programmer wishes to avoid processing the files automatically. The most prominent cases where automatic file reading is bypassed is in process programs, and in programs that use an "EXTERNAL-TEXT-ENTITY #DOCUMENT" rule to provide the text of the SGML document entity to be parsed.
#COMMAND-LINE-NAMES is available to all types of OmniMark programs.
In the process of populating #COMMAND-LINE-NAMES, the following components of the command line are recognized as command-line options and are not placed in the shelf:
All other words on the command-line are recognized as "names" and not commands. In particular, the following, so long as they are not recognized as the arguments that follow a dash command, are placed on the #COMMAND-LINE-NAMES shelf:
If there are no words on the command line recognized as names, then the #COMMAND-LINE-NAMES shelf has no items.
Variables initialized on the command-line are often referred to as "command-line arguments" (such as a stream value set using -d; see the companion manual, Using OmniMark 3 [eum13]). These are different from the #COMMAND-LINE-NAMES shelf: the former is entered on the command line with an identification of the global shelf it parameterizes, the latter is the set of "unidentified" names on the command line.
This section provides a sample set of programs that illustrate some basic uses of OmniMark.
The first example shows how an SGML document might be translated into an HTML document for publishing on the Internet. The second example shows how legacy TeX documents may be converted into SGML.
Because OmniMark rules are defined in terms of document structure rather than markup in a down-translation, output is not affected by markup minimization, or ignored record ends in the SGML source document. Unless the program has rules for comments and marked sections, they also do not affect the output.
The predominant rule in a down-translation is the ELEMENT rule. An SGML element is described as the part of the document that spans from the beginning of a start tag to the end of the corresponding end tag for a particular element name. Elements may contain text, entity references, processing instructions, and more elements, thus forming a hierarchy.
It is the OmniMark programmer's task to identify the elements which may occur in an SGML document and set up a rule for each one. The elements which may occur are identified by examining the Document Type Definition. The OmniMark programmer may define more than one rule for any one element. In this case, the programmer must specify the conditions or qualifications under which the rule becomes relevant.
For a simple but practical example, suppose a programmer wishes to present simple glossaries using a Web browser. The most straightforward way of doing this is to convert the glossaries into HTML. A glossary begins with a title that is followed by one or more entries. Entries in turn consist of the term being defined followed by a single, one-paragraph definition. The input is entered in SGML to correspond to the following Document Type Definition:
<!DOCTYPE glossary [ <!ELEMENT glossary o o (title, entry+)> <!ELEMENT title o o (#PCDATA)> <!ELEMENT entry - o (term, def)> <!ELEMENT term o - (#PCDATA)> <!ELEMENT def o o (#PCDATA)> <!ENTITY end-term ENDTAG "term"> <!SHORTREF term-map "&#RE;" end-term> <!USEMAP term-map term> ]>
This Document Type Definition permits some markup minimization. Since it is assumed, for instance, that the defined terms are never longer than one input record, a term is ended by a record end ("&#RE"). Various start- and end-tags may be omitted. Using these conventions, a typical source document might appear as shown below:
SGML Definitions <entry>containing element An element within which a subelement occurs. <entry>data entity An entity that was declared to be data and therefore is not parsed when referenced. <entry>name A name token whose first character is a name start character.
An OmniMark program to process this glossary contains a rule for each element type. The actions in the OmniMark rules below indicate how HTML tags are inserted around the contents of each element. In these actions, "%c" represents an element's content (possibly including the content of subelements), and "%n" indicates insertion of a line break in the output.
DOWN-TRANSLATE ELEMENT glossary OUTPUT "<HTML>%n<HEAD>%n%c</UL>%n" || "</BODY></HTML>%n" ELEMENT title LOCAL STREAM title-text SET title-text TO "%c" OUTPUT "<TITLE>" || title-text || "</TITLE>%n" || "</HEAD><BODY>%n" || "<H1>" || title-text || "</H1>%n" || "<UL>%n" ELEMENT entry OUTPUT "<LI>%c%n" ELEMENT term OUTPUT "<STRONG>%c</STRONG>%n" ELEMENT def OUTPUT "%c"
The first rule specifies that the glossary's content is to be output prefixed by the HTML start tag "<HTML>", and followed by the HTML end tags "</UL>", "</BODY>" and "</HTML>". The title rule specifies the tags surrounding the glossary's title. There are two copies of the title output: one for the top of the browser window, and one within the text area -- so a temporary variable, title-text is defined and used. The rule for entries simply indicates that the content of each entry is output as a "<LI>" list item. Each entry consists of a term and a definition, whose text is output with "<STRONG>" tagging surrounding the term.
The translation of the sample glossary source document shown above is the following HTML source file:
<HTML> <HEAD> <TITLE>SGML Definitions</TITLE> </HEAD><BODY> <H1>SGML Definitions</H1> <UL> <LI><STRONG>containing element</STRONG> An element within which a subelement occurs. <LI><STRONG>data entity</STRONG> An entity that was declared to be data and therefore is not parsed when referenced. <LI><STRONG>name</STRONG> A name token whose first character is a name start character. </UL> </BODY></HTML>
It is important to observe that identical output is generated if the source document is edited by inserting all omitted tags and placing the existing entry start-tags on separate lines. When translating SGML documents, the writer of an OmniMark program need never be concerned with variations of an SGML source document that, according to the provisions of ISO 8879, do not affect its interpretation.
An up-translation starts with an arbitrary data file and produces an SGML document or document instance. Since the SGML document is parsed as it is generated, the translation can be guided by the structure of the SGML document.
Suppose the glossary described in the previous section was just one of many similar documents originally written in TeX. Rather than convert all of them to SGML by hand, which would be an error-prone task, it makes sense to write a program to do the conversion.
The TeX document from the previous example may look like this:
\input glossmac \title{SGML Definitions} \term{containing element}{% An element within which a subelement occurs.} \term{data entity}{% An entity that was declared to be data and therefore is not parsed when referenced.} \term{name}{% A name token whose first character is a name start character.} \bye
Later, an application arises for the same material in SGML, in an environment that does not support the OMITTAG feature. The following OmniMark program defines the translation:
UP-TRANSLATE ; to start the translation FIND-START OUTPUT FILE "file.dtd" ; to start the translation FIND "\input glossmac" WHITE-SPACE* OUTPUT "<glossary>%n" ; look for the start of the title FIND "\title{" OUTPUT "<title>" ; translate } FIND "}" "%n"? DO WHEN ELEMENT IS title OUTPUT "</title>%n" ELSE WHEN ELEMENT IS term OUTPUT "</term>%n" ELSE WHEN ELEMENT IS def OUTPUT "</def>%n</entry>%n" DONE ; look for start of term FIND "\term{" OUTPUT "<entry>%n<term>" ; look for start of definition FIND "{%%%n" OUTPUT "<def>" ; look for end of glossary FIND "\bye" ANY OUTPUT "</glossary>%n"
The program begins by identifying the translation type. As mentioned earlier, this is an up-translation, whose result is an SGML document corresponding to a given Document Type Definition. The bulk of the translation consists of FIND rules.
As it reads the TeX file, OmniMark looks for strings corresponding to the patterns defined by the FIND rules. When one is found, the actions in the rule are performed. As the output is generated, the SGML parser verifies that it corresponds to the Document Type Definition.
The first rule is a FIND-START rule that passes the DTD to the SGML parser. It assumes that the file named "file.dtd" contains the DTD used to guide the translation.
The first FIND rule is
FIND "\input glossmac" WHITE-SPACE* OUTPUT "<title>"
This rule tells OmniMark to look for the string \input glossmac followed by any number of spaces, tabs, or end-of-line sequences. When the pattern is found, the action within the rule writes the <glossary> start-tag. The next rule uses a similar technique to search for the start of the title. The third rule is a little more complicated:
FIND "}" "%n"? OUTPUT "</title>%n" WHEN ELEMENT IS title OUTPUT "</term>%n" WHEN ELEMENT IS term OUTPUT "</def>%n</entry>%n" WHEN ELEMENT IS def
This rule searches for a right brace, possibly followed by an end-of-line sequence. The action taken when this pattern is found depends on the context. The appropriate end-tags are written according to the state of the SGML parser. This ability to qualify an action distinguishes OmniMark from other pattern-matching languages.
The remainder of the program is straightforward. It produces the following document instance:
<glossary> <title>SGML Definitions</title> <entry> <term>containing element</term> <def>An element within which a subelement occurs.</def> </entry> <entry> <term>data entity</term> <def>An entity that was declared to be data and therefore is not parsed when referenced.</def> </entry> <entry> <term>name</term> <def>A name token whose first character is a name start character.</def> </entry> </glossary>
Next chapter is Chapter 3, "Generalized Document Processing".
Copyright © OmniMark Technologies Corporation, 1988-1997. All rights reserved.
EUM27, release 2, 1997/04/11.