Lookaheads are an advanced, rarely used feature of the regular expression syntax, that allows portions of evaluated text to be validated and matched internally by the regex engine algorithm, but yet remain excluded from the externally-returned match. However, if this internal lookahead match fails, then the externally returned portion of the match is immediately cancelled even if it would match otherwise.
Lookaheads are called this way because the regex engine algorithm “looks ahead” at them to check if they match, but passes them over without inclusion into the externally-returned match.
The syntax for a regex lookahead looks like this:
(?=internally-matched-lookahead-regex)
Example regular expression with a lookahead matching a dash-separated phone number with a 408 area code:
/^(?=408)\d{3}-\d{3}-\d{4}$/g
In the expression above, the 408 area code is matched 2 times, first
internally, per the lookahead, as digits 4
,
0
, and 8
in that order, and then externally as
part of a generic phone number — a set of 3 digits, followed by dash,
followed by another 3 digits, followed by another dash, and then by 4
more digits.
At the conclusion of the internal lookahead match, the regex engine algorithm returns to the point in the string where it started validating the lookahead (the beginning of the string in this case), and goes on from there to validate the conventional second match. The second match is just a generic phone number, but it would be cancelled if the number had an area code other than 408, because the lookahead would fail. Had the lookahead been removed from this expression, then the new expression would match any generic dash-separated 10-digit phone number regardless of its area code.
Regex lookaheads are rarely used because most validations can be accomplished without them. For example, the expression above can be simplified to just:
/^408-\d{3}-\d{4}$/g
However, certain situations related to enforcing string lengths can be very challenging to resolve without utilizing regex lookaheads.
For a trivial hypothetical example, imagine that the objective is to validate a variable-length token, consisting of a series of digits separated by a single dash, and that the total length of such a token must not exceed 10 characters. So the following tokens would be considered valid:
While the upper limit on the number of characters is 10, the lower limit would be 3, as that’s the shortest possible length to accommodate 2 single digits with a dash in-between.
And the following tokens below would be considered invalid, either because they contain characters other than digits, are not separated by a dash, have multiple dashes, and/or because their lengths exceed 10 characters:
Enforcing just that the token must be composed only of digits, that must be separated by a single dash, is fairly easy. The corresponding regular expression would be:
/^\d+-\d+$/g
And enforcing just the 10 character length limit on such a token would also be easy, with this regular expression:
/^[\d-]{3,10}$/g
However, this last expression above does not enforce the requirement that one and only one dash separate the 2 groups of digits. Enforcing both the token format, and the token length with a single regex expression is not trivial. Consider for example, a regex like this:
/^\d{1,8}-\d{1,8}$/g
The above regex would enforce a maximum length of 17 rather than 10, which is not what we want. Here’s another pathetic attempt:
/^(\d+-\d+){3,10}$/g
The above would not work either, as it would validate multiple concatenated tokens, enforcing the limit on the number of these concatenated tokens, rather than on the number of characters in the token. — The curly brackets syntax enforces how many of the previous group, not how many of the characters in that group.
As is typically the case with perplexing problems of all sorts, there
usually exists a simple, but less than ideal brute force method that
will break the impasse, and this case is no exception. Here it would
involve the conditional OR operator denoted by the pipe character
‘|
‘ syntax. It gives us this unwieldy monster expression
that does work, but which can’t be considered practical for problems of
this sort as it can grow exponentially large:
/^((\d{1,5}-\d{1,4})|(\d{6}-\d{1,3})|(\d{7}-\d{1,2})|(\d{8}-\d)|(\d{1,4}-\d{5})|(\d{1,3}-\d{6})|(\d{1,2}-\d{7})|(\d-\d{8}))$/g
However, by using regex lookaheads, it is possible to combine multiple
regular expressions into one. In essence, a regular expression
lookahead can be used like a conditional AND
operator,
which is sadly missing from the regex syntax in explicit form. (Perhaps
regex lookaheads would make an explicit AND
operator
superfluous.)
Combining the two parts of this validation problem into a single expression with a regex lookahead would look like this:
/^(?=[\d-]{3,10}$)\d+-\d+$/g
In the expression above, the regex engine algorithm first checks to see if the token contains only digits and dashes (in any order), and that the token is between 3 and 10 characters long. If the length of the token is either less than 3 or greater than 10, then the lookahead fails, failing the rest of the match along with it. And so this takes care of the string length enforcement component of the problem.
A critical part of the solution above is the dollar sign $
at the conclusion of the lookahead (in addition to the regular dollar
sign at the conclusion of the whole expression). This lookahead dollar
sign is there to indicate to the regex engine that the string must
terminate immediately following the 10th character. This little caveat
is easy to miss, but doing so would break the enforcement of the
10-character limit in this case.
If the purpose of any lookahead is to enforce a whole string length limit, then it must terminate with a dollar sign to tell the regex engine that the string must end there.
In this blog, I demonstrated how the regular expression lookahead
feature can be used as an implicit conditional AND
operator, and utilized to validate whole string length with simplicity
and elegance.
And a single regular expression can contain multiple lookaheads, which
can also be separated by the conditional OR
operators.
Such a scheme would allow varying the numeric length enforced depending
on various conditions within the string.
Beyond that, the usefulness of regex lookaheads is not limited to only validating whole string lengths. With a minor tweak, lookaheads can also validate lengths of sub-tokens within larger strings — this would involve inserting one or more lookaheads at various positions within the expression depending on the number and placement of the sub-tokens being validated, and replacing the lookahead terminating dollar sign used in the examples in this blog with an expression matching some other boundary within the string.
Have fun with lookaheads!
Copyright (c) 2010-2018 Marat Nepomnyashy
Except where otherwise noted, this webpage is licensed under a Creative Commons Attribution 3.0 Unported License.
Background wallpaper by Patrick Hoesly, used under Creative Commons Attribution 2.0 Generic License.