do xml-parse

control structure

Syntax
do xml-parse document (with id-checking switch-expression)? 
                      (with utf-8 switch-expression)? 
                      (creating xml-dtds{string-expression})? scan string-source-expression 
   local-declaration*
   action*
done


do xml-parse scan string-source-expression 
   local-declaration*
   action*
done


do xml-parse instance (with document-element string-expression)? 
                      (with (xml-dtds{string-expression} | current xml-dtd))?
                      (with id-checking switch-expression)? scan string-source-expression 
   local-declaration*
   action*
done
    


Purpose

Basic usage

do xml-parse is used to invoke the XML parser. A number of activities must occur within a do xml-parse block.

  1. Specify the type of parse to be launched: well-formed or validating.
  2. Provide a string source that will be used for input.
  3. Consume the resulting markup source, either directly as #content or by executing exactly one parse continuation operator (%c or suppress) to fire markup rules.

Well-formed parsing

Well-formed parsing (that is, XML parsing without validating against a DTD) is invoked by leaving out the document keyword and other do xml-parse arguments except for the scan keyword followed by the input.

  do xml-parse scan file #args[1]
     output "%c"
  done
            

Earlier versions of OmniMark required the keyword instance following do xml-parse when configuring the parser for well-formed parsing. This use of the keyword instance is deprecated: the instance keyword should be reserved for validating against a pre-compiled DTD.

The instance supplied to the parser when performing a well-formed parse may still include a DTD: the parser will read and use entity definitions from the DTD but will not validate against the structural information in the DTD.

Validating parsing

The document keyword is used to configure the XML parser to validate its input against a DTD. The DTD is supplied as part of the input.

  do xml-parse document scan file #args[1]
     output "%c"
  done
            

This assumes that the file whose name is specified on the command-line contains an XML document. If the DTD and the instance are in different files, they can be joined:

  do xml-parse document scan file #args[1] || file #args[2]
     output "%c"
  done
            

Validating of multiple documents

If the same DTD is to be used to parse several input instances, it is best to pre-compile the DTD and store it on the built-in sgml-dtds shelf:

  do xml-parse document creating xml-dtds{"my-dtd"} scan file #args[1]
     suppress
  done
            
If the instance file names are stored on a shelf my-instances, then each instance can then be processed in turn:
  repeat over my-instances
     do xml-parse instance with xml-dtds{"my-dtd"} scan file my-instances 
        output "%c"
     done
  again   
            

Validating against an outer DTD

A nested XML parse can use the same DTD as an outer XML parse to validate its own input: for instance,

  process
     using group "one"
     do xml-parse document scan "<!DOCTYPE a ["
                              || "<!ELEMENT a (b | #PCDATA)*>"
                              || "<!ELEMENT b (#PCDATA)>]>"
                              || "<a><b>Hello, World!</b></a>"
        output "%c"
     done
  
  
  group "one"
     element "a"
        using group "b"
        do xml-parse instance with current xml-dtd scan "<a>Salut, Monde!</a>"
           output "%c"
        done
        output "%c"
  
  
     element "b"
        output "%c"
  
  
  group "b"
     element "a"
        output "%c"
            

In this program, the XML parse launched in the element rule for a inside group one uses the same DTD as the parse launched in the process rule.

Validating a partial instance

It is possible to parse a partial instance: a piece of data comprising an element from a DTD which is not the doctype element of that DTD. In this case, the element to be used as the effective doctype for parsing the data is specified using the document-element argument:

  do xml-parse instance with document-element "lamb" with xml-dtds{"my-dtd"} scan file #args[1]
     output "%c"
  done
            
XML comments, processing instructions and even marked sections can precede and follow the element's start and end tags, but anything else (particularly other elements, data, or entity references) is an error.

Controlling ID/IDREF checking

By default, OmniMark checks all XML idref attributes to make sure they reference valid IDs. This checking may not be appropriate in processing a partial instance. It also takes time. It can be disabled using with id-checking followed by a switch expression. The following code will parse the specified document without checking IDREFs:

  do xml-parse document with id-checking false scan file #args[1]
     output "%c"
  done
            

Parsing documents with different character set encodings

The XML standard specifies that XML documents use the Unicode character set. However, there are many different encodings of the Unicode character set. OmniMark can process documents in any of these encodings: see Character set encoding for details.

One character encoding issue that arises in markup processing is the question of how numeric character entities must be encoded. When doing either validating parsing or well-formed parsing, OmniMark uses UTF-8 encoding to represent numeric character entities. This behaviour can be configured when doing validating parsing: the parser can be configured to use either UTF-8 encoding, or to flag a markup error when it encounters a character entity whose value is greater than 255. This has a number of consequences.

  • The processing of the documents that do not contain character entities greater than 127 is not affected, unless the document is encoded using a character encoding that does not correspond with 7-bit ASCII for characters 0 to 127. In this case, it is best to convert the document to UTF-8 for processing and convert it back afterwards.
  • Documents encoded in UTF-8 do not need special considerations.
  • When performing a validating parse of a document encoded in something other than UTF-8 and containing numeric character entities with values no greater than 255 (for example: Latin 1), the with utf-8 modifier should be passed to do xml-parse, with a switch expression that evaluates to false:
      do xml-parse document with utf-8 false scan file #args[1]
         output "%c"
      done
                    
    
  • If the document being processed is encoded in something other than UTF-8 and does contains character entities with values greater 255, the document should be converted to UTF-8 for processing and converted back afterwards.