Find Median from Data Stream
Interview Prep
The problem
Design a class supporting two operations on a growing stream of
integers: addNum(x) inserts a value;
findMedian() returns the median of all values seen so
far. addNum must be amortised O(log n);
findMedian must be O(1).
Why naive approaches don't fit
A sorted list gives O(1) median but O(n)
insertion (need to shift). A BST balances the cost across both
operations but is complex to implement and gets ugly with
duplicates. The classic clean answer is two heaps:
a max-heap of the lower half and a min-heap of the upper half. The
median sits right between them.
class MedianFinderSorted:
"""Keep the data sorted; add is O(n), median is O(1)."""
def __init__(self): self.a = []
def addNum(self, x):
# bisect.insort would do this in O(log n) lookup + O(n) shift
import bisect; bisect.insort(self.a, x)
def findMedian(self):
n = len(self.a)
return self.a[n // 2] if n % 2 else 0.5 * (self.a[n//2 - 1] + self.a[n//2]) Pattern: maintain a partition with heaps
Split the stream into lo (the smaller half, max-heap)
and hi (the larger half, min-heap). Two invariants:
every value in lo is ≤ every value in hi;
the sizes differ by at most one (we let lo hold the
extra when the total count is odd). With those invariants:
- If the total is odd, the median is the top of
lo. - If the total is even, the median is the average of the two tops.
To insert: push to lo, then shuffle its top into
hi (preserves the "lo ≤ hi" invariant), then rebalance
sizes by moving from hi back to lo if
needed.
Solution: two heaps
import heapq
class MedianFinder:
"""Two heaps. Lo is a MAX-heap (stores negated values), hi is a MIN-heap.
Invariants:
- All of lo <= all of hi.
- |lo| - |hi| ∈ {0, 1}. (lo gets the extra when total is odd)
Median is lo's top, or the average of both tops.
"""
def __init__(self):
self.lo = [] # max-heap (negated)
self.hi = [] # min-heap
def addNum(self, x: int) -> None:
# 1) Tentatively push to lo (max-heap) and shuffle its top to hi
heapq.heappush(self.lo, -x)
heapq.heappush(self.hi, -heapq.heappop(self.lo))
# 2) Rebalance: keep |lo| >= |hi|
if len(self.hi) > len(self.lo):
heapq.heappush(self.lo, -heapq.heappop(self.hi))
def findMedian(self) -> float:
if len(self.lo) > len(self.hi):
return -self.lo[0]
return 0.5 * (-self.lo[0] + self.hi[0])
Python's heapq is min-only, so the max-heap is faked
by negating on push and pop. A different language with a real
max-heap (or a std::priority_queue<int>) makes
this even cleaner.
Trace
Sequence: addNum(1), addNum(2), findMedian, addNum(3), findMedian
State invariants: lo (max-heap) <= hi (min-heap), |lo| ∈ {|hi|, |hi|+1}.
addNum(1):
push 1 to lo: lo=[-1] (i.e. {1}), hi=[]
shuffle top to hi: lo=[], hi=[1]
rebalance: lo=[-1], hi=[]
state: lo={1}, hi={}.
findMedian -> 1 (|lo|>|hi|, return -lo[0]) [not called yet at this point]
addNum(2):
push 2 to lo: lo=[-2,-1]={1,2}, hi=[]
shuffle top to hi: lo=[-1]={1}, hi=[2]
rebalance: same. lo={1}, hi={2}.
findMedian -> (1 + 2) / 2 = 1.5
addNum(3):
push 3 to lo: lo=[-3,-1]={1,3}, hi=[2]
shuffle top to hi: lo=[-1]={1}, hi=[2,3]
rebalance: |hi| > |lo|, move 2 to lo: lo=[-2,-1]={1,2}, hi=[3].
state: lo={1,2}, hi={3}.
findMedian -> 2 Complexity
- addNum:
O(log n)— three heap operations. - findMedian:
O(1). - Space:
O(n).
Variations worth knowing
- Constrained value range (e.g., 0..100): use a
bucket count instead —
O(1)addNum andO(range)median. Beats the heaps if the range is tiny. - Sliding window median: the heaps approach generalizes but needs lazy deletion of expired elements (or use a balanced BST / SortedList).
- Find k-th smallest in a stream: a single
bounded heap of size
khandles top-k; the median trick generalizes to two heaps with a fractional offset for arbitrary quantiles. - Approximate median / quantiles: for huge
streams, the t-digest and Greenwald-Khanna structures give
sub-percent error with
O(log² n)memory. Standard in observability stacks.