|
|||||||||
|
|
|||||||||
| Related Syntax | Related Concepts | ||||||||
| control structure | do xml-parse | ||||||||
Syntax
do xml-parse document (with id-checking Boolean-expression)? (with utf-8 Boolean-expression)? (creating xml-dtds key keyname)?
scan (input source | (input input-function-call))
action+
done
do xml-parse instance? (with document-element element-name)?
(with (xml-dtd key key | current xml-dtd))?
(with id-checking Boolean-value)?
scan (input source | (input input-function-call))
done
You can parse an XML document using OmniMark's integrated XML parser.
You can invoke the XML parser with do xml-parse. To invoke the SGML parser, use do sgml-parse.
The do xml-parse statement prepares the parser to parse a document. Actual parsing begins with a call to the parse continuation operator "%c" from within the do xml-parse block.
To prepare the parser for parsing, you must do the following:
To configure the parser for well-formed parsing, use the following syntax. (In most of the examples that follow file #args[1] will be used as the source of the document to be parsed. You can use any valid OmniMark source.)
do xml-parse
scan file #args[1]
output "%c"
done
Earlier versions of OmniMark required the keyword instance following do xml-parse when configuring the parser for well-formed parsing. The use of the keyword instance is now optional.
You may include a DTD in the instance if you wish. If you do, the parser will read and use entity definitions from the DTD but will not validate against the structural information in the DTD.
To have the parser validate an XML document against its DTD, you specify that you are giving the parser a complete XML document, including DTD, using the document keyword:
do xml-parse document
scan file #args[1]
output "%c"
done
The parser will validate the document against the DTD. You must supply the DTD as part of the source. If the DTD is specified as an external text entity using SYSTEM or PUBLIC you may need to write an external-text-entity rule to locate the DTD and provide it to the parser.
Suppose you have 20 instances to process, all of which use the same DTD. It is wasteful to parse the same DTD 20 times. To avoid doing this, you can pre-compile the DTD and place it on the built-in shelf xml-dtds:
do xml-parse document
creating xml-dtds {"my-dtd"}
scan file "my-dtd.dtd"
suppress
done
You can then process each instance in turn. The following code assumes you have placed the filenames of the instances on a shelf called "my-instances":
repeat over my-instances
do xml-parse
with xml-dtds {"my-dtd"}
scan file my-instances
output "%c"
done
again
If you start an XML parse in the scope of an existing XML parse and you want to use the DTD of the current parse for the nested parse, you can specify that the nested parse use the current DTD:
do xml-parse instance
with current xml-dtd
scan file my-instances
output "%c"
done
In some cases you may wish to parse a partial instance, that is, a piece of data comprising an element from a DTD which is not the doctype element of that DTD. In this case, you can specify the element to be used as the effective doctype for parsing the data using the document-element keyword:
do xml-parse
with document-element "lamb"
with xml-dtds {"my-dtd"}
scan file "partinst.xml"
output "%c"
done
By default, OmniMark checks all XML IDREF attributes to make sure they reference a valid ID. This checking may not be appropriate in processing a partial instance. It also takes time. You can turn this checking on and off using with id-checking followed by a Boolean expression. The following code will parse the specified document without checking IDREFs:
do xml-parse document
with id-checking false
scan file "my-xml.xml"
output "%c"
done
The XML standard specifies that XML documents use the Unicode character set. However, there are many different encodings of the Unicode character set. OmniMark lets you process documents in any of these encodings. See Character set encoding for details.
One character encoding issue that arises in markup processing is the question of which character set encoding the parser is to use when resolving numeric character entities. When doing well-formed parsing, OmniMark uses UTF-8 encoding to represent numeric character entities. When doing validating parsing, you can select whether to use UTF-8 or Latin-1 encoding for numeric character entities. The default is Latin-1. This has the following consequences:
with utf-8 true:
process do xml-parse document with utf-8 true scan file "myfile.sgm" output "%c" done
|
Related Syntax #current-output creating document-end document-start external-text-entity find-end find-start suppress |
Related Concepts Input Input functions XML DTDs: creating XML/SGML parsing: built-in shelves |
| ---- |