Character classes

A character class is used to pattern match one of a set of characters. A character in the input data will match a character class if it matches any one of the characters in the character class.

For example, the OmniMark built-in character class digit includes the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Given the input data 123ABC, the following pattern will match 1:

  find digit

And the following pattern will match 123:

  find digit+

OmniMark provides the following predefined character classes:

  • letter—matches a single letter character, uppercase or lowercase
  • uc—matches a single uppercased letter
  • lc—matches a single lowercased letter
  • digit—matches a single digit (0-9)
  • space—matches a single space character
  • blank—matches a single space or tab character
  • white-space—matches a single space, tab, or newline character
  • any-text—matches any single character except for a newline
  • any—matches any single character

Since the predefined character classes may not always meet your needs, OmniMark lets you define your own character classes. A programmer-defined character class is contained between square brackets. For example, the following pattern matches an arithmetic operator:

  find ["+-*/"]

This character class consists of any of the characters in the string +-*/. If your character class will contain many characters, you can include every character except those you specify by preceding the string of characters with the operator \. For example, the following pattern matches any character except the XML markup characters <, &, and >:

  find [\ "<&>"]

You can also specify a character set by adding or subtracting characters from a built-in character set. To add characters, you join character classes and strings with the or operator |. For example, the following pattern matches any hexadecimal digit:

  find [digit | "AaBcCcDdEeFf"]

To subtract characters, you use the operator \. For example, the following pattern matches any octal digit:

  find [digit \ "89"]

You can also use the operator | to join two or more built-in character classes, as in this pattern that matches any alpha-numeric character:

  find [letter | digit]

Note that while you can use the operator | as many times as you like, you can only use the operator once in a character class. Thus this pattern is not valid:

  find [letter \ "xyz" | digit \ "7"]

You must rewrite it as follows:

  find [letter | digit \ "xyz7"]

You can also specify ranges of characters using to. For example, the following code fragment matches any character between the lowercase letters a and m:

  find ["a" to "m"]

You can combine ranges or exclude them from other things in a character set, including other ranges. For example, the following pattern matches any character between the lowercase letters a and z as well as the period, the comma, or the interrogation mark; it does not match the lowercase letters between i and n or the lowercase letter t:

  find ["a" to "z" | ".,?" \ "i" to "n" | "t"]

Take care when using character set ranges because the letters of the alphabet are not always contiguous in a character set. In the EBCDIC character encoding, for example, there are non-alphabetic characters between A and Z.

Don't confuse a character class with a pattern. If you want to match any number of characters up to the first colon you can write either:

  find [\ ":"]*
or
  find any ** lookahead ":"

But if you need to match any number of characters up to a multi-character delimiter such as </price>, you must write:

  find any ** lookahead "</price>"
and not
  find [\ "</price>"]*

The latter will match any number of characters up to the first <, /, p, r, i, c, e, or > character, not any number of characters up to the string </price>.

In previous versions of OmniMark, the keyword any was required before the \ operator in creating an any except character class. Thus the character class [\ "aeiou"] would be written [any \ "aeiou"]. The form [any \ "aeiou"] is still permitted and is identical in meaning to [\ "aeiou"].

Deprecated syntax

The word except is a deprecated synonym for the except operator \.

Prerequisite Concepts