XML Parsing and UTF-8 Encoding

XML Parsing and UTF-8 Encoding

Prerequisite Concepts

Version 4.0.1 of the OmniMark programming language supports UTF-8 encoding as part of the XML parser. To allow characters to be processed in a uniform manner, independently of how they come to the XML parser, OmniMark converts numeric character references (such as "�") and hexadecimal character references (such as "&#xA1") into their corresponding UTF-8 encodings.

Version 4.0 of OmniMark supported XML and UTF-8, but had some problems with character references. These problems have been fixed in version 4.0.1.

Version 4.0.1 fixes the following problems that occurred in version 4.0:

Numeric character representations with values greater than 255 were rejected (consumed and ignored) by the XML parser and caused an error to be reported.
Hexadecimal character representations gave low-order 8 bits, and did not signal an error.
Both character representations with values between 128 and 255, when treated as bytes, appeared as UTF-8 encoding to the XML parser and were treated as such. This caused characters to be misrepresented when passed through the parser.

The following translate rule can be used as a method of converting UTF-8 encodings outside the ASCII range back into hexadecimal values:

  translate utf8-char => c
     local counter n
     set n to utf8-char-number c
     do when n <= "%16r{7F}"
        output c
     else
        output "&#x%16rud(n);"
     done

Prerequisite Concepts
XML document processing

----

[CONTENTS] [CONCEPTS] [SYNTAX] [LIBRARIES] [SAMPLES] [ERRORS] [INDEX]