Pattern matching

OmniMark allows you to search for particular strings in input data using find rules. For example, the following find rule will fire if the string Hamlet: is encountered in the input:

  find "Hamlet:"
     output "<b>Hamlet</b>: "

Using this method, however, you would have to write a separate find rule for each character name you wanted to enclose in HTML bold tags. For example:

  find "Hamlet:"
     output "<b>Hamlet</b>: "
  
  
  find "Horatio:"
     output "<b>Horatio</b>: "
  
  
  find "Bernardo:"
     output "<b>Bernardo</b>: "

This approach does not scale well, since there is much duplication involved.

This is where OmniMark patterns come in. OmniMark has rich, built-in, pattern-matching capabilities which allow you to match strings by way of a more abstract model of a string rather than matching a specific string. For example:

  find letter+ ":"

This find rule will match any string that contains any number of letters followed immediately by a colon.

Unfortunately, the pattern described in this find rule isn't specific enough to flawlessly match only character names. It will match any string of letters that is followed by a colon that appears anywhere in the text, meaning that words in the middle of sentences will be matched.

Words that appear in the middle of sentences rarely begin with an uppercased letter, while names usually do. This allows us to add further details to our find rule:

  find uc letter+ ":"

This find rule matches any string that begins with an uppercase letter (uc) followed by at least one other letter (letter+) and a colon (":").

If we were actually trying to mark up an ASCII copy of Hamlet, however, our find rule would only match character names that contain a single word, such as Hamlet, Ophelia, or Horatio. Only the second part of two-part names would be matched, so the names Queen Gertrude, Lord Polonius, and so forth, would be incorrectly marked up.

In order to match these more complex names as well as the single-word names, we'll have to further refine our find rule:

  find uc letter+ (white-space+ uc letter+)? ":"

In this version of the find rule, the pattern can match a second word prior to the colon. The pattern (white-space+ uc letter+)? can match one or more white-space characters followed by an uppercase letter and one or more letters. All of this allows the find rule to match character names that consist of one or two words.

The operators + and ? are occurrence indicators: they indicate how many occurrences of a pattern are expected in the input. OmniMark has other occurrence indicators. For example, if you wanted to match a series of three numbers, you could use the following pattern:

  find digit{3}

If you wanted to match either a four- or a five-digit number, you could use the following pattern:

  find digit{4 to 5}

To match a date that occurs in the yy/mm/dd format, the following pattern could be used:

  find digit{2} "/" digit{2} "/" digit{2}

A Canadian postal code could be matched with the following pattern:

  find letter digit letter " " digit letter digit

The letter and uc keywords that are used to create the patterns shown above are called character classes. OmniMark provides a variety of these built-in character classes:

letter matches a single letter character, uppercase or lowercase,
uc matches a single uppercased letter,
lc matches a single lowercased letter,
digit matches a single digit (0-9),
space matches a single space character,
blank matches a single space or tab character,
white-space matches a single space, tab, or newline character,
any-text matches any single character except for a newline, and
any matches any single character.

Any pattern can be modified through the use of occurrence indicators

+ (one or more),
* (zero or more),
? (zero or one),
** (zero or more upto), and
++ (one or more upto).

So, as shown in the find rules above, for example, letter+ matches one or more letters, letter* matches zero or more letters, and uc? matches zero or one uppercased letter.

Using the identity operator

You must use the identity operator to match an item on a shelf, or its key:

  find ~foo[2]
  find ~foo{"bar"}

Using the "upto" occurrence indicators

The expressions any* and any+ are voracious. They will gobble up all the remaining input regardless of any other pattern that follows them. To contain their appetite, you can use the "upto" forms of these occurrence indicators, any ** and any ++. These forms match only up to the next pattern. Thus to match everything between the words start and end, you could write a pattern:

  find "start" any ** => middle "end"

There are two restrictions on the "upto" occurrence indicators:

They can only modify a character class.
There must be a following pattern (the thing they match up to).

You can apply the upto occurrence indicators to any character class, built-in or user-defined, but in practice they are most commonly used when used with any. Other possible applications include using them with any-text which will match any characters up to the specified delimiter, as long as it occurs on the same line.

To match up to a delimiter without consuming that delimiter, use lookahead:

  find "start" any ** => middle lookahead "end"

When matching up to a delimiter, ask yourself if the end of the data is an alternative delimiter. For instance, if you are separating values which are delimited by the sequence \\ and you write the pattern:

  find any ++ => stuff "\\"

you will miss the last item in the sequence, because it is not followed by \\. To grab the last item, change the pattern to specify the end of the data as an alternate delimiter:

  find any ++ => stuff ("\\" | value-end)

It is important to understand how the any ++ operator works. Two examples illustrate its properties.

First, consider the data {} and the pattern "{" any ++ "}". The pattern will not match the data because any ++ must match at least one character before the delimiter.

Second, consider the data OXX and the pattern "O" any ++ "X". The pattern will not match this data either. Although there is one character, the first "X", followed by the delimiter, the second "X", the pattern does not match because the any ++ finds its delimiter, the first "X", before it has consumed any data (as in the first example). It never looks at the second "X".

any ++ is useful in those situation where you are certain that there is data before the delimiter character, or where you do not want to match at all if there is no data before the next delimiter. In choosing between ++ and ** you should also be aware of the properties of the following pattern:

  find any ** => data lookahead ("\\" | value-end)

This pattern attempts to match data up to a delimiter, without consuming the delimiter. Since the delimiter is not consumed (because of the lookahead) and because the pattern can match zero characters as long as they are followed by \\ (because ** matches for zero or more characters), this rule will probably fire twice. The first time it will consume data up to the delimiter \\. The second time it will be at the delimiter and will fire again (unless a previous rule matches \\). It will not fire a third time because OmniMark does not permit two consecutive zero-length pattern matches.

Rewriting the code with ++ solves the problem:

  find any ++ => lookahead ("\\" | value-end)

Other possible solutions include rewriting the pattern to allow it to consume either a leading or trailing delimiter:

  find any ** => data "\\"
  
  find "\\" any ** => data  lookahead ("\\" | value-end)

Defining your own character classes

You can define your own character classes. For example:

  find ["+-*/"]
     output "found an arithmetic operator%n"

This find rule would fire if any one of the four arithmetic operators was encountered in the input data.

Compound character classes can be created using except or |:

  find [any except "}"]

The find rule above would match any character except for a right brace.

This find rule would match any one of the arithmetic operators or a single digit:

  find ["+-*/" | digit]

This one would match any of the arithmetic operators or any digit except zero (0):

  find ["+-*/" | digit except "0"]

A backslash (\) can be used as a short-hand for except: the previous example can be written

  find ["+-*/" | digit \ "0"]

Zero-length pattern matching

The occurrence indicators ? and * allow for a pattern to succeed if it is matched zero (or more) times. In effect, this means that these patterns always match, since the zero in zero or more really means that the pattern succeeds even if it is not found in the data.

This is very useful behavior when there is an optional element in a pattern. For example, this pattern matches a currency amount in dollars whether or not cents are specified:

  find "$" digit+ ("." digit{2})?

The sub-pattern ("." digit{2})? will match a cents amount like .34 if it exists, but if it does not, the pattern succeeds anyway. The pattern always matches. Sometimes it matches zero characters.

Because a pattern can succeed while matching zero characters, a rule can fire without consuming any data:

  find ("$" digit+ "." digit{2})?

The entire pattern above has a zero-or-one occurrence indicator. While it will match a currency value if one exists, it will also match zero characters at any point in the input. This means that it will fire whenever no previous pattern fires, no matter where it is in the data.

Since no data has been consumed, the pattern matching context has not changed and the rule would then fire again and again. However, OmniMark does not let this happen. OmniMark does not allow two consecutive zero-length pattern matches.

Once any pattern has matched zero characters, all rules in the current scan are prevented from matching zero characters until at least one character has been consumed. You can remove this restriction using the null pattern modifier.

Related Topics