Regular Expression Matching
Interview Prep
The problem
Implement regex matching with two metacharacters:
.matches any single character.*matches zero or more of the preceding element.
The match must cover the entire string (not just a
substring). The classic ambiguity: "a*" means "zero or
more a's", treated as a single pattern unit.
Why this is famously tricky
The * isn't a postfix on the preceding character in
isolation — it's a quantifier that creates a variable-length
match. The recursion has to decide, at each step, whether to
"consume zero copies of the starred pattern" (move past it in the
pattern) or "consume one copy" (move forward in the string but
keep the same pattern position). Almost everyone gets one of those
two branches wrong on the first try.
Recursive solution (for intuition)
def is_match_recursive(s: str, p: str) -> bool:
if not p:
return not s
first = bool(s) and (p[0] == s[0] or p[0] == '.')
if len(p) >= 2 and p[1] == '*':
# Either consume 0 chars of '<char>*' and skip it, or consume 1 char if first matches
return is_match_recursive(s, p[2:]) or (first and is_match_recursive(s[1:], p))
return first and is_match_recursive(s[1:], p[1:])
Correct, exponential time on pathological inputs (the classic
"aaaa...a", "a*a*a*...a*" blowup). Memoising on the
pair (i, j) of suffix indices is the standard fix —
same recurrence, polynomial time.
Pattern: 2D DP on prefix lengths
Define dp[i][j] = does s[:i] fully match
p[:j]? Three cases for each cell:
-
p[j−1] == s[i−1]orp[j−1] == '.': depends ondp[i−1][j−1](consume one char from both). -
p[j−1] == '*': two sub-cases — "zero copies of the starred unit" givesdp[i][j−2]; "one or more" givesdp[i−1][j]provided the preceding pattern char matchess[i−1]. -
Else:
False.
Edge case the unwary trip on: an empty string can match patterns
like "a*b*c*". Initialize dp[0][j]
accordingly by propagating along even-indexed '*'
columns.
Bottom-up DP
def is_match(s: str, p: str) -> bool:
"""dp[i][j] = does s[:i] match p[:j]? O(m*n)."""
m, n = len(s), len(p)
dp = [[False] * (n + 1) for _ in range(m + 1)]
dp[0][0] = True
# Handle empty s vs patterns like a*, a*b*, a*b*c*
for j in range(1, n + 1):
if p[j - 1] == '*':
dp[0][j] = dp[0][j - 2]
for i in range(1, m + 1):
for j in range(1, n + 1):
if p[j - 1] == '*':
# zero occurrences of <char>*
zero = dp[i][j - 2]
# one or more: only if the preceding pattern char matches s[i-1]
prev = p[j - 2]
one_or_more = dp[i - 1][j] and (prev == s[i - 1] or prev == '.')
dp[i][j] = zero or one_or_more
elif p[j - 1] == '.' or p[j - 1] == s[i - 1]:
dp[i][j] = dp[i - 1][j - 1]
# else: stays False
return dp[m][n] Trace
s = "aab", p = "c*a*b"
dp is (4 x 6). rows = s prefix length 0..3, cols = p prefix length 0..5.
Initialization:
dp[0][0] = True.
dp[0][2] = dp[0][0] = True (c* matches empty)
dp[0][4] = dp[0][2] = True (c*a* matches empty)
dp[0][1] = dp[0][3] = dp[0][5] = False
Filling row i=1 (s="a"):
j=1 p="c" : 'c' != 'a' -> False
j=2 p="c*" : zero=dp[1][0]=False, one+ requires prev='c' to match 'a' -> False
j=3 p="c*a": p[2]='a' matches s[0]='a' -> dp[1][3] = dp[0][2] = True
j=4 p="c*a*": zero=dp[1][2]=False, one+: prev='a' matches 'a', dp[0][4]=True -> True
j=5 p="c*a*b": p[4]='b' vs s[0]='a' -> False
Filling row i=2 (s="aa"):
... eventually dp[2][4] = True (c*a* matches "aa")
dp[2][5] False (no 'b' yet)
Filling row i=3 (s="aab"):
j=5 p="c*a*b": p[4]='b' matches s[2]='b' -> dp[3][5] = dp[2][4] = True.
return dp[3][5] = True ✓ Complexity
- Time:
O(m · n). - Space:
O(m · n)for the DP table.O(n)with a rolling row.
Variations worth knowing
- Wildcard Matching (LeetCode 44): different
semantics —
?matches any single character (like.);*matches any sequence, including empty (it's not a quantifier!). Same DP skeleton, simpler recurrence for the star. - NFA construction (Thompson's algorithm): the
industrial-strength approach used by real regex engines. Build a
nondeterministic automaton from the pattern, run it against the
string.
O(m · n)deterministically; handles full regex syntax. - Backtracking regex engines: the route taken by
PCRE / Python's
remodule. Supports backreferences, lookaround. Worst-case exponential — see "catastrophic backtracking" and ReDoS attacks. - Edit Distance: different problem, same DP grid shape. Useful sanity check that the "i × j prefix grid" is the right mental model for both.