Find Median from Data Stream

Interview Prep

HardHeap / Priority Queueheapdesignstreaming

The problem

Design a class supporting two operations on a growing stream of integers: addNum(x) inserts a value; findMedian() returns the median of all values seen so far. addNum must be amortised O(log n); findMedian must be O(1).

Why naive approaches don't fit

A sorted list gives O(1) median but O(n) insertion (need to shift). A BST balances the cost across both operations but is complex to implement and gets ugly with duplicates. The classic clean answer is two heaps: a max-heap of the lower half and a min-heap of the upper half. The median sits right between them.

class MedianFinderSorted:
    """Keep the data sorted; add is O(n), median is O(1)."""
    def __init__(self): self.a = []
    def addNum(self, x):
        # bisect.insort would do this in O(log n) lookup + O(n) shift
        import bisect; bisect.insort(self.a, x)
    def findMedian(self):
        n = len(self.a)
        return self.a[n // 2] if n % 2 else 0.5 * (self.a[n//2 - 1] + self.a[n//2])

Pattern: maintain a partition with heaps

Split the stream into lo (the smaller half, max-heap) and hi (the larger half, min-heap). Two invariants: every value in lo is ≤ every value in hi; the sizes differ by at most one (we let lo hold the extra when the total count is odd). With those invariants:

If the total is odd, the median is the top of lo.
If the total is even, the median is the average of the two tops.

To insert: push to lo, then shuffle its top into hi (preserves the "lo ≤ hi" invariant), then rebalance sizes by moving from hi back to lo if needed.

Solution: two heaps

import heapq

class MedianFinder:
    """Two heaps. Lo is a MAX-heap (stores negated values), hi is a MIN-heap.
    Invariants:
      - All of lo <= all of hi.
      - |lo| - |hi| ∈ {0, 1}.    (lo gets the extra when total is odd)
    Median is lo's top, or the average of both tops.
    """
    def __init__(self):
        self.lo = []   # max-heap (negated)
        self.hi = []   # min-heap

    def addNum(self, x: int) -> None:
        # 1) Tentatively push to lo (max-heap) and shuffle its top to hi
        heapq.heappush(self.lo, -x)
        heapq.heappush(self.hi, -heapq.heappop(self.lo))
        # 2) Rebalance: keep |lo| >= |hi|
        if len(self.hi) > len(self.lo):
            heapq.heappush(self.lo, -heapq.heappop(self.hi))

    def findMedian(self) -> float:
        if len(self.lo) > len(self.hi):
            return -self.lo[0]
        return 0.5 * (-self.lo[0] + self.hi[0])

Python's heapq is min-only, so the max-heap is faked by negating on push and pop. A different language with a real max-heap (or a std::priority_queue<int>) makes this even cleaner.

Trace

Sequence: addNum(1), addNum(2), findMedian, addNum(3), findMedian

State invariants:  lo (max-heap) <= hi (min-heap), |lo| ∈ {|hi|, |hi|+1}.

addNum(1):
  push 1 to lo:        lo=[-1] (i.e. {1}), hi=[]
  shuffle top to hi:   lo=[],              hi=[1]
  rebalance:           lo=[-1],            hi=[]
  state: lo={1}, hi={}.

findMedian -> 1  (|lo|>|hi|, return -lo[0])  [not called yet at this point]

addNum(2):
  push 2 to lo:        lo=[-2,-1]={1,2}, hi=[]
  shuffle top to hi:   lo=[-1]={1},      hi=[2]
  rebalance:           same.            lo={1}, hi={2}.

findMedian -> (1 + 2) / 2 = 1.5

addNum(3):
  push 3 to lo:        lo=[-3,-1]={1,3}, hi=[2]
  shuffle top to hi:   lo=[-1]={1},      hi=[2,3]
  rebalance: |hi| > |lo|, move 2 to lo: lo=[-2,-1]={1,2}, hi=[3].
  state: lo={1,2}, hi={3}.

findMedian -> 2

Complexity

addNum: O(log n) — three heap operations.
findMedian: O(1).
Space: O(n).

Variations worth knowing

Constrained value range (e.g., 0..100): use a bucket count instead — O(1) addNum and O(range) median. Beats the heaps if the range is tiny.
Sliding window median: the heaps approach generalizes but needs lazy deletion of expired elements (or use a balanced BST / SortedList).
Find k-th smallest in a stream: a single bounded heap of size k handles top-k; the median trick generalizes to two heaps with a fractional offset for arbitrary quantiles.
Approximate median / quantiles: for huge streams, the t-digest and Greenwald-Khanna structures give sub-percent error with O(log² n) memory. Standard in observability stacks.