Parsing (XML and SGML)

OmniMark has integrated XML and SGML parsers. Because they are part of the language, the parsers do not use a parser interface like DOM or SAX. Instead, they are integrated into the streaming model of OmniMark. You can also use a third party parser.

In general terms, any program that reads a data stream and analyzes it to reveal its structure is a parser. Almost all OmniMark programs are parsers in this sense. XML and SGML parsers perform a specific and formal kind of parsing that corresponds to the requirements of the XML and SGML specifications respectively.

XML and SGML parsers perform three basic functions:

  1. separate markup from data and report the structure of the document,
  2. validate the structure of the document and report errors, and
  3. expand entities (sometimes with help from your program).

This behavior is appropriate in all cases in which you are attempting to interpret the XML or SGML document based on its structure and content. If you want to process an XML or SGML document in another way (for example, to programmatically edit existing XML or SGML documents) it may be appropriate to write your own parsing routine using scanning techniques.

OmniMark fits parsing into the streaming and hierarchical model of OmniMark processing. The parser takes over the job of scanning the input source and reports the structure of the parsed document by converting all its markup tags into markup events. The result of parsing is thus a stream of data content and markup events; it can be either accessed unprocessed as #content or used to fire markup rules with %c or suppress. In the latter case, you write code in the body of the markup rules to respond to the reported structure of the document.

The data content of a parsed document is streamed through to current output unless you explicitly process it. See data content, processing and parsed data, formatting.

Parsing and processing example

Here is a simple XML document:

  <person>
   <name>Mary</name>
   <bio>
    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>
   </bio>
  </person>

Here is a program that processes this XML document and produces HTML output:

  process
     using output as file "output.txt"
     do xml-parse scan file "input.xml"
        output "<HTML>%c</HTML>"
     done
  
  element "person"
     output "<BODY>%c</BODY>"
  
  element "name"
     output "<H1>%c</H1>"
  
  element "bio"
     output "%c"
  
  element "p"
     output "<p>%c</p>"

The output of the program is:

  <HTML><BODY>
   <H1>Mary</H1>
  
    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>
  
  </BODY></HTML>

The do xml-parse statement is used to configure the parser before parsing begins. It tells the parser what form of parsing to use and what source to scan. To parse SGML, you would use do sgml-parse instead of do xml-parse. You could also apply do markup-parse with custom-made or third-party parsers such as Xerces XML parser.

Parsing actually begins when the parse continuation operator %c is encountered within the body of the do xml-parse block. This initiates parsing which then proceeds until the first markup rule is fired, in this case it will be the person element rule. Within the element rule you can do any processing you want to do before and after the content of the person element is parsed. Anything you do before %c is invoked happens before the element's content is parsed. Anything after %c happens after the element's content is parsed. In this case, the HTML tag <BODY> is output before the element's content is parsed and the tag </BODY> is output after.

The next element rule to fire is name. It fires as a result of the parsing initiated by %c in the person element rule. The person rule is suspended at the %c until all its content is parsed. In this way OmniMark builds up a hierarchy of fired rules that corresponds to the hierarchy of the document being parsed.

The name element rule contains a single action: output "<H1>%c</H1>". This causes the string <H1> to be output. Then the %c causes the parser to resume. The name element of the document contains only the data content Mary. The parser streams this data content to the current output scope. The name element rule then resumes and outputs the string </H1>. The result of these three output events, in this order, is that the current output scope receives the text <H1>Mary</H1>.

Input and output of markup rules

You can assign "%c" to a shelf:

  element "name"
     local string name-text
  
     set name-text to "%c"
          

The shelf name-text will contain Mary. However, do not be misled into believing that %c returns the data content of an element. All %c does is force the parser to continue. The parser then outputs the data content of the element to the current output scope. The reason that the text Mary ends up in the shelf name-text is that the set command creates a new current output scope and makes its first argument the destination for that output scope. This change of output scopes lasts only as long as the set action, but since %c occurs within the set action, that scope is in effect when the parser outputs the data content Mary.

The consequence of this mechanism becomes clear if we introduce a set action into the person element rule:

  element "person"
     local string person-text
  
     set person-text to "%c"
          

This will place the following text into the shelf person-text:

  <BODY>
   <H1>Mary</H1>
  
    <p>Mary had a little lamb</p>
    <p>Its fleece was white as snow</p>
  
  </BODY>

The shelf has become the output destination for all the processing of the content of element person.

If you want to capture the actual content of element person before it is processed, you can use markup-buffer instead:

  element "person"
     local markup-buffer person-content
  
     using output as person-content
        output #content

In this case, the name and other element rules will not fire at all because the content of element person has not been procesed, it has only been parsed and stored in person-content. To process the stored content, apply do markup-parse to it and trigger the rules with %c:

     do markup-parse person-content
        output "%c"
     done

For other ways to control the processing of element content, see markup processing control.

Validation

OmniMark's XML and SGML parsers validate the documents they parse. The kind of validation done depends on how the parser is configured. The example above does well-formed parsing, so the only validation done is to ensure that the document is well formed. You can also configure the parsers for DTD or schema validation, or you can separately validate the markup stream that is produced by the parser.

When the parser encounters invalid input it inserts a markup error event into the markup stream. Then it attempts to recover and continue parsing the input as best it can. In order to do that, the parser may omit erroneous parts of the input document, or it may insert the missing parts. After the error recovery, the parser always produces a well-formed markup stream.

When you process a markup error event using %c or suppress, a markup-error rule is fired. A markup error is not a program error, it is simply an event that you can deal with in your program by writing the appropriate code in a markup-error rule.

For more information, see markup errors.

Retrieving parse state information

Because they are streaming parsers, the OmniMark XML and SGML parsers do not build a parse tree in memory. However, the hierarchy of rules existing at any point in a parsing operation contains all the information you need about the current state of the parse and the structure of the document at that point.

For example, you can test to see if an element has a particular parent:

  element "name" when parent is "person"
     output "<H1>%c</H1>"
  
  
  element "name" when parent isnt "person"
     output "%c"

You can also make the test inside the rule body:

  element "name"
     do when parent is "person"
        output "<H1>%c</H1>"
  
     else
        output "%c"
     done

Different parse state information is available in different rules. See the various markup rules for specific information.

Related Topics