A character class is used to pattern match one of a set of characters. A character in the input data will match a character class if it matches any one of the characters in the character class.
For example, the OmniMark built-in character class digit
includes the characters 0,
1, 2, 3, 4, 5, 6, 7, 8,
and 9. Given the input data 123ABC, the following pattern will match 1:
find digit
And the following pattern will match 123:
find digit+
OmniMark provides the following predefined character classes:
letter
—matches a single letter character, uppercase or lowercase
uc
—matches a single uppercased letter
lc
—matches a single lowercased letter
digit
—matches a single digit (0-9)
space
—matches a single space character
blank
—matches a single space or tab character
white-space
—matches a single space, tab, or newline character
any-text
—matches any single character except for a newline
any
—matches any single character
Since the predefined character classes may not always meet your needs, OmniMark lets you define your own
character classes. A programmer-defined character class is contained between square brackets. For example, the
following pattern matches an arithmetic operator:
find ["+-*/"]
This character class consists of any of the characters in the string +-*/. If your character class
will contain many characters, you can include every character except those you specify by preceding the string of
characters with the operator \
. For example, the following pattern matches any character
except the XML
markup characters <, &, and >:
find [\ "<&>"]
You can also specify a character set by adding or subtracting characters from a built-in character set. To add
characters, you join character classes and strings with the or operator |
. For example, the following
pattern matches any hexadecimal digit:
find [digit | "AaBcCcDdEeFf"]
To subtract characters, you use the operator \
. For example, the following pattern matches any octal
digit:
find [digit \ "89"]
You can also use the operator |
to join two or more built-in character classes, as in this pattern that
matches any alpha-numeric character:
find [letter | digit]
Note that while you can use the operator |
as many times as you like, you can only use the operator
once in a character class. Thus this pattern is not valid:
find [letter \ "xyz" | digit \ "7"]
You must rewrite it as follows:
find [letter | digit \ "xyz7"]
You can also specify ranges of characters using to
. For example, the following code fragment matches
any character between the lowercase letters a and m:
find ["a" to "m"]
You can combine ranges or exclude them from other things in a character set, including other ranges. For example,
the following pattern matches any character between the lowercase letters a and z as well as
the period, the comma, or the interrogation mark; it does not match the lowercase letters between
i and n or the lowercase letter t:
find ["a" to "z" | ".,?" \ "i" to "n" | "t"]
Take care when using character set ranges because the letters of the alphabet are not always contiguous in a character set. In the EBCDIC character encoding, for example, there are non-alphabetic characters between A and Z.
Don't confuse a character class with a pattern. If you want to match any number of characters up to the first
colon you can write either:
find [\ ":"]*or
find any ** lookahead ":"
But if you need to match any number of characters up to a multi-character delimiter such as
</price>, you must write:
find any ** lookahead "</price>"and not
find [\ "</price>"]*
The latter will match any number of characters up to the first <, /, p, r, i, c, e, or > character, not any number of characters up to the string </price>.
In previous versions of OmniMark, the keyword any
was required before the \
operator in
creating an any except
character class. Thus the character class [\ "aeiou"]
would be written
[any \ "aeiou"]
. The form [any \ "aeiou"]
is still permitted and is identical in
meaning to [\ "aeiou"]
.
The word except
is a deprecated synonym for the except operator \
.