|
|||||
Pattern matching | |||||
OmniMark allows you to search for particular strings in input data using find rules. For example, the following find rule will fire if the string "Hamlet:" is encountered in the input:
find "Hamlet:" output "<b>Hamlet</b>: "
Using this method, however, you would have to write a separate find rule for each character name you wanted to enclose in HTML bold tags. For example:
find "Hamlet:" output "<b>Hamlet</b>: " find "Horatio:" output "<b>Horatio</b>: " find "Bernardo:" output "<b>Bernardo</b>: "
As you can imagine, this is a pretty inefficient way to program.
This is where OmniMark "patterns" come in. OmniMark has rich, built-in, pattern-matching capabilities which allow you to match strings by way of a more abstract "model" of a string rather than matching a specific string. For example:
find letter+ ":"
This find
rule will match any string that contains any number of letters followed immediately by a colon.
Unfortunately, the pattern described in this find rule isn't specific enough to flawlessly match only character names. It will match any string of letters that is followed by a colon that appears anywhere in the text, meaning that words in the middle of sentences will be matched.
Words that appear in the middle of sentences rarely begin with an uppercased letter, while names usually do. This allows us to
add further detail to our find rule:
find uc letter+ ":"
This find rule matches any string that begins with an uppercase letter (uc
) followed by at least one other letter (letter+
) and a colon (":").
If we were actually trying to mark up an ASCII copy of "Hamlet", however, our find rule would only match character names that contain a single word, such as "Hamlet", "Ophelia", or "Horatio". Only the second part of two-part names would be matched, so the names of "Queen Gertrude", "Lord Polonius", and so forth, would be incorrectly marked up.
In order to match these more complex names as well as the single-word names, we'll have to further refine our find rule:
find uc letter+ (white-space+ uc letter+)? ":"
In this version of the find
rule, the pattern can match a second word prior to the colon. The pattern (white-space+ uc letter+)?
can match one or more white-space characters followed by an uppercase letter and one or more letters. All of this allows the find rule to match character names that consist of one or two words.
If you wanted to match a series of three numbers, you could use the following pattern:
find digit {3}
To match a date that occurs in the" yy/mm/dd" format, the following pattern could be used:
find digit {2} "/" digit {2} "/" digit {2}
A postal code could be matched with the following pattern:
find letter digit letter "-" digit letter digit
The letter
and uc
keywords that are used to create the patterns shown above are called "character classes". OmniMark provides a variety of these built-in character classes:
letter
-- matches a single letter character, uppercase or lowercase
uc
-- matches a single uppercased letter
lc
-- matches a single lowercased letter
digit
-- matches a single digit (0-9)
space
-- matches a single space character
blank
-- matches a single space or tab character
white-space
-- matches a single space, tab, or newline
character
any-text
-- matches any single character except for a
newline
any
-- matches any single character
Any pattern can be modified through the use of occurrence operators:
+
(one or more)
*
(zero or more)
?
(zero or one)
So, as shown in the find rules above, for example, letter+
matches one or more letters, letter*
matches zero or more letters, and uc?
matches zero or one uppercase letter.
It is also possible for you to define your own customized character classes. For example:
find ["+-*/"] output "found an arithmetic operator%n"
This find rule would fire if any one of the four arithmetic operators was encountered in the input data.
Compound character classes can be created using the except
or or
keywords:
find [any except "}"]
The find rule above would match any character except for a right brace.
This find rule would match any one of the arithmetic operators or a single digit:
find ["+-*/" or digit]
This one would match any of the arithmetic operators or any digit except zero ("0"):
find ["+-*/" or digit except "0"]
---- |