Regular Expression Matching

Interview Prep

LegendaryDynamic Programmingdpstrings

The problem

Implement regex matching with two metacharacters:

. matches any single character.
* matches zero or more of the preceding element.

The match must cover the entire string (not just a substring). The classic ambiguity: "a*" means "zero or more a's", treated as a single pattern unit.

Why this is famously tricky

The * isn't a postfix on the preceding character in isolation — it's a quantifier that creates a variable-length match. The recursion has to decide, at each step, whether to "consume zero copies of the starred pattern" (move past it in the pattern) or "consume one copy" (move forward in the string but keep the same pattern position). Almost everyone gets one of those two branches wrong on the first try.

Recursive solution (for intuition)

def is_match_recursive(s: str, p: str) -> bool:
    if not p:
        return not s
    first = bool(s) and (p[0] == s[0] or p[0] == '.')
    if len(p) >= 2 and p[1] == '*':
        # Either consume 0 chars of '<char>*' and skip it, or consume 1 char if first matches
        return is_match_recursive(s, p[2:]) or (first and is_match_recursive(s[1:], p))
    return first and is_match_recursive(s[1:], p[1:])

Correct, exponential time on pathological inputs (the classic "aaaa...a", "a*a*a*...a*" blowup). Memoising on the pair (i, j) of suffix indices is the standard fix — same recurrence, polynomial time.

Pattern: 2D DP on prefix lengths

Define dp[i][j] = does s[:i] fully match p[:j]? Three cases for each cell:

p[j−1] == s[i−1] or p[j−1] == '.': depends on dp[i−1][j−1] (consume one char from both).
p[j−1] == '*': two sub-cases — "zero copies of the starred unit" gives dp[i][j−2]; "one or more" gives dp[i−1][j] provided the preceding pattern char matches s[i−1].
Else: False.

Edge case the unwary trip on: an empty string can match patterns like "a*b*c*". Initialize dp[0][j] accordingly by propagating along even-indexed '*' columns.

Bottom-up DP

def is_match(s: str, p: str) -> bool:
    """dp[i][j] = does s[:i] match p[:j]?  O(m*n)."""
    m, n = len(s), len(p)
    dp = [[False] * (n + 1) for _ in range(m + 1)]
    dp[0][0] = True

    # Handle empty s vs patterns like a*, a*b*, a*b*c*
    for j in range(1, n + 1):
        if p[j - 1] == '*':
            dp[0][j] = dp[0][j - 2]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if p[j - 1] == '*':
                # zero occurrences of <char>*
                zero = dp[i][j - 2]
                # one or more: only if the preceding pattern char matches s[i-1]
                prev = p[j - 2]
                one_or_more = dp[i - 1][j] and (prev == s[i - 1] or prev == '.')
                dp[i][j] = zero or one_or_more
            elif p[j - 1] == '.' or p[j - 1] == s[i - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            # else: stays False
    return dp[m][n]

Trace

s = "aab", p = "c*a*b"

dp is (4 x 6).  rows = s prefix length 0..3, cols = p prefix length 0..5.

Initialization:
  dp[0][0] = True.
  dp[0][2] = dp[0][0] = True  (c* matches empty)
  dp[0][4] = dp[0][2] = True  (c*a* matches empty)
  dp[0][1] = dp[0][3] = dp[0][5] = False

Filling row i=1 (s="a"):
  j=1 p="c"  : 'c' != 'a' -> False
  j=2 p="c*" : zero=dp[1][0]=False, one+ requires prev='c' to match 'a' -> False
  j=3 p="c*a": p[2]='a' matches s[0]='a' -> dp[1][3] = dp[0][2] = True
  j=4 p="c*a*": zero=dp[1][2]=False, one+: prev='a' matches 'a', dp[0][4]=True -> True
  j=5 p="c*a*b": p[4]='b' vs s[0]='a' -> False

Filling row i=2 (s="aa"):
  ... eventually dp[2][4] = True (c*a* matches "aa")
  dp[2][5] False (no 'b' yet)

Filling row i=3 (s="aab"):
  j=5 p="c*a*b": p[4]='b' matches s[2]='b' -> dp[3][5] = dp[2][4] = True.

return dp[3][5] = True   ✓

Complexity

Time: O(m · n).
Space: O(m · n) for the DP table. O(n) with a rolling row.

Variations worth knowing

Wildcard Matching (LeetCode 44): different semantics — ? matches any single character (like .); * matches any sequence, including empty (it's not a quantifier!). Same DP skeleton, simpler recurrence for the star.
NFA construction (Thompson's algorithm): the industrial-strength approach used by real regex engines. Build a nondeterministic automaton from the pattern, run it against the string. O(m · n) deterministically; handles full regex syntax.
Backtracking regex engines: the route taken by PCRE / Python's re module. Supports backreferences, lookaround. Worst-case exponential — see "catastrophic backtracking" and ReDoS attacks.
Edit Distance: different problem, same DP grid shape. Useful sanity check that the "i × j prefix grid" is the right mental model for both.