Looking ahead

The keyword lookahead in a compound pattern precedes data to be recognized but not consumed by the pattern-matching process. For example, the pattern

  digit+ lookahead blank* "+"

matches a string of digits that is followed by optional spaces and tabs and then a plus sign. However, only the digits are selected. The white space characters, if any, and plus sign remain in the source and can be selected by other patterns.

lookahead can also be used to verify that selected data is not followed by input matching a given pattern. For example,

  digit+ lookahead not letter

selects a string of digits as long as the digits are not immediately followed by letters. Note that only one letter needs to be found for the lookahead test to fail, so there is no need to put a "+" following letter in the example above.

Positive and negative lookahead can be combined in one pattern. For example, in data files for the TeX formatter, instructions (called "control sequences") consist of a backslash followed by letters.

The control sequence to end a paragraph is \par. However, standard control sequences such as \parskip or \parindent as well as programmer-defined macro names can begin with the same string.

Suppose paragraphs consist only of letters, punctuation, and space characters. In other words, suppose that no control sequences occur within a paragraph. The following pattern matches paragraph text terminated by the \par control sequence; it fails to match input terminated by another control sequence beginning with the characters \par:

  [letter | ".,!?" | blank]+ lookahead "\par" not letter

Recall that any pattern can be enclosed in parentheses and used as a subpattern. lookahead patterns can be used in this way. For example,

  ((lookahead not "xyz") any)+

matches any input string that does not contain the sequence "xyz" as a substring. Note that both sets of parentheses are necessary. Without the inner set, any becomes part of the lookahead pattern. Without the outer set, the lookahead is not repeated as successive characters are selected.

The above example works in the following manner, beginning at the current point in the file, the data content, or the data being scanned:

  1. If the next three characters are "xyz", then the lookahead pattern fails, and the pattern terminates.
  2. If the next three characters do not match "xyz", or if there are less than three characters left, then the lookahead pattern succeeds. The current position is not advanced.
  3. If there are no more characters, then the pattern any will fail, and the whole pattern terminates.
  4. Otherwise, the pattern any matches the next character.
  5. The current point in the input is advanced a single character.
  6. The "+" indicator causes the above steps to be repeated.

If the pattern any has matched at least one character, then the pattern succeeds. Otherwise, it fails.

Prerequisite Concepts
Related Topics