Syntax
define external source function utf16-input-file
value stream filename
endian value counter endian optional initial {0}
strip-bom value switch strip-bom optional initial {true}
exceptions-to value io-exception exceptions-to optional
Purpose
This external function reads the file named by the "filename" argument and returns the text of that file converted from a UTF-16 encoding to a UTF-8 encoding. The file is in UTF-16, but the program sees UTF-8.
Arguments:
- "filename". This is the name of the UTF-16 encoded file you want to read and convert to UTF-8. If a zero-length "filename" is used (that is, ""), then
utf16-input-file does not open a file, but reads from standard input. The zero-length name option allows the conversion functionality to be used in an OmniMark program that is being used as a filter.
- "endian". This optional argument determines the binary ordering of pairs of octets in the UTF-16 encoding (in the file). The default is 0 (high-order octet first) for this function.
- "strip-bom". When true, this optional argument indicates to the function that a leading Byte Order Mark (BOM), if found, is not to be passed on to the program, but is to be stripped from the data. When false, the UTF-8 encoding of the BOM is passed to the program. The default value for this argument is "true".
- "exceptions-to". This optional argument indicates that errors are to be recorded in the passed "io-exception" object, and that the OmniMark program is not to be immediately terminated. There are three types of errors, categorized according to how they are handled:
- Whenever an invalid or out-of-range encoding is found, it is converted to the UTF-8 encoding of the Unicode "REPLACEMENT CHARACTER" (0xFFFD). If "exceptions-to" is specified, the "io-exception" object is marked for a data encoding error, and the function continues processing.
- If the external source function cannot be created, either because the declaration does not match what is expected or because there is not enough memory to create the source object, an error is signalled to OmniMark, and your program is terminated.
- If "exceptions-to" is specified, then for any other type of error that occurs during memory allocation, file opening or closing, or reading or writing, the "io-exception" object is marked for the error found, and processing continues. If "exceptions-to" is not specified, an error is signalled to OmniMark and your program is terminated.
Example:
; Submitting a file of UTF-16 encoded characters for scanning by find rules.
submit utf16-input-file "widetext.txt"
|