DC: Median

PDF of Eric’s handwritten notes are here.

In this lecture we’ll show a clever algorithm for computing the median of an unsorted list of numbers in linear time.

Median

Given an unsorted list A = [a_1, a_2, \dots, a_n] of size n, find the median of A. For concreteness we’ll define the median as the {\lceil \frac{n}{2} \rceil}^{th} smallest element of the list.

k-Selection

Just as in inductive proofs where one often needs to strengthen the inductive hypothesis and thus prove a stronger statement, here we will give an algorithm for a more general problem:

Given an unsorted list A = [a_1, a_2, \dots, a_n] of size n and an integer k, find the {k}^{th} smallest element of A.

Naive Algorithm

Sort the list A and return the {k}^{th} element of the sorted array. This can be done in O(n \log{n}) by a standard sorting algorithm, such as MergeSort.

Median of Medians Algorithm [BFPRT 1973]

We’ll present an O(n) time algorithm that was devised by Blum, Floyd, Pratt, Rivest and Tarjan in 1973.  The high-level strategy is divide-and-conquer, following the same scheme as in QuickSort.  As in QuickSort we choose a pivot p.  We then partition the input array A into three sets based on the pivot: A_{<p}, A_{=p} and A_{>p}.  Whereas QuickSort then recursively sorts the two subarrays A_{<p} and A_{>p}, in our setting we only need to search in at most one subarray. The choice of the pivot is the main challenge in the design of the fast algorithm.

High-level algorithm:
Here is pseudocode for the outline of the algorithm.

Select(A,k)

  • Choose a pivot p (How do we do this? This will be our main task.)
  • Partition A into three lists:
    A_{<p} = \{a \in A \vert a < p\}, A_{=p} = \{a \in A \vert a = p\}, A_{>p} = \{a \in A \vert a > p\}
  • If k \le \vert A_{<p} \vert, return Select(A_{<p}, k)
    If \vert A_{<p} \vert < k \le \vert A_{<p} \vert + \vert A_{=p} \vert, return p If k > \vert A_{<p} \vert + \vert A_{=p} \vert, return Select(A_{>p}, k-\vert A_{<p} \vert - \vert A_{=p} \vert)

Here is an example:
Consider the input A = [5,2,20,17,11,13,8,9,11]. and suppose we use pivot p=11.
Then: A_{<p}=[5,2,8,9], A_{=p}=[11,11], and A_{>p}=[20,17,13], and therefore:

  • if k \le 4, then recurse on Select(A_{<p}, k).
  • if k=5 or k=6, then output 11.
  • if k>6, then recurse on Select(A_{>p},k-6).

Good pivot

In QuickSort the choice of a bad pivot can lead to a O(n^2) time algorithm.  In our case we also need to be clever with the choice of the pivot.  For example if the pivot is always the smallest element in the array then we only remove one element from our problem at each iteration, which gives an O(n^2) algorithm! This follows from solving the recurrence T(n) \le T(n-1)+O(n)=O(n^2).

Recall the recurrence for MergeSort:  T(n) = 2T(n/2) + O(n) which solves to T(n)=O(nlog{n}).  We’re aiming for an O(n) time algorithm, and hence we’re hoping for a recurrence such as: T(n) = T(n/2) + O(n) which solves to T(n)=O(n).  Notice that if our one subproblem was still of size (3/4)n or even .99n, then our running time is still O(n).
In fact, for any constant a<1 the recurrence:
T(n) = T(an) + O(n) solves to T(n)=O(n).

Our goal is to choose a pivot that guarantees we eliminate a constant fraction of the elements at each iteration.  We will aim to find a pivot that guarantees that the subproblems are of size \leq (3/4)n.  Hence we make the following definition.

Definition (Good Pivot).  We say a pivot p is good if \vert A_{<p} \vert \le \frac{3}{4}n and \vert A_{>p} \vert \le \frac{3}{4}n.

Finding a good pivot

How do we find a good pivot?

Simple Randomized Scheme.  When in doubt, act randomly!  Choose an element r uniformly at random from A and set this random element r to be the pivot p.
What is the probability that r is a good pivot?
Consider a sorted version of (this is just for the analysis, we don’t actually sort A).  Notice that in sorted A, all elements between n/4 and 3n/4 are good pivots.  Thus there are n/2 good pivots.  Therefore:

Pr(r is a good pivot) = \frac{n/2}{n} = \frac{1}{2}.

Notice that we can easily check if a proposed pivot p is good or not: simply compute A_{<p} and A_{>p} and check their sizes.  Thus we have the following randomized procedure to find a good pivot.

  1. Choose a random element from A.
  2. Check if is a good pivot or not.  If it’s a good pivot set p=r and use it.  If not then repeat step 1.

Since is a good pivot with probability 1/2 then in expectation we have to repeat the procedure two times to find a good pivot.  Hence this procedure yields an algorithm whose expected running time is O(n).

Deterministic scheme. Notice that if we find a good pivot in O(n) time then our recurrence will be T(n) = T(.75n) + O(n) so we still have roughly T(.2n) time left at our disposal.   We’ll use the fact that for constants a and b where a+b<1 the following recurrence:

T(n) = T(an) + T(bn) + O(n) solves to T(n) = O(n).

We will aim for an approach that has a=.75 and b=.2 (In your homework you will prove that this case solves to O(n) time).  So we will use this additional time of T(.2n) to find a good pivot.  This suggests a recursive approach:  Choose a subset of A where |S|=n/5, recursively find the median of S, and set p=median(S).

We want to choose a subset  that is representative of the entire set A.  Suppose we set to be the first n/5 elements of A.  In the worst case will be the n/5 smallest elements of and then p=median(S) will be the (n/10)-th smallest element of and hence we’ll have: \vert A_{>p} \vert = .9n.  This gives the recurrence T(n) = T(\frac{n}{5}) + T(\frac{9n}{10}) + O(n) which does not solve to O(n) because 1/5 + 9/10 > 1.

So we need to find a set that is more representative of the entire set A.  We want a set S where for every x\in S, there are few elements of which are ≤ x and a few which are ≥ x.  Consider the first 5 elements of A.  We want to choose a good representative for these 5 elements.  Let’s sort these 5 elements (this takes only O(1) time since there are only 5 elements).  Take the median of these 5 elements.  That’s a good representative since at least 3 elements are ≤ m and at least 3 elements are  ≥ x.  So let’s continue with this approach.

Break into n/5 groups of 5 elements each (assume n is a power of 5 for simplicity so all groups are the same size).  Call the groups G_1, G_2, \dots, G_{n/5}.  Sort each of these groups and let m_i denote the median of the i-th group for i=1,\dots,n/5.  Let S=\{m_1,\dots,m_{n/5}\} denote the set of all n/5 medians.  Finally, recursively run our median finding procedure to find the median of S, and set p=median(S).  We will prove that this p is a good pivot.  We now have our complete algorithm, here is the pseudocode:

FastSelect(A,k)

  • Partition A into \frac{n}{5} groups G_1, G_2, \dots, G_{\frac{n}{5}} of 5 elements each.
  • For i=1 to \frac{n}{5}:
    • Sort G_i and let m_i = Median(G_i)
  • S = [m_1, m_2, \dots, m_{\frac{n}{5}}]
  • p = FastSelect(S, \frac{n}{10})
  • Partition A into three lists A_{<p} = \{a \in A \vert a < p\}, A_{=p} = \{a \in A \vert a = p\}, A_{>p} = \{a \in A \vert a > p\}
  • If k \le \vert A_{<p} \vert, return FastSelect(A_{<p}, k)
    If \vert A_{<p} \vert < k \le \vert A_{<p} \vert + \vert A_{=p} \vert, return p If k > \vert A_{<p} \vert + \vert A_{=p} \vert, return FastSelect(A_{>p}, k-\vert A_{<p} \vert - \vert A_{=p} \vert)

It just remains to prove that p is a good pivot.

Running time analysis.

Claim 1: Pivot p selected above is a good pivot

Proof:
Without loss of generality, assume the groups G_1, G_2, \dots, G_{\frac{n}{5}} are sorted by their medians i.e. m_1 \le m_2 \le \dots \le m_{\frac{n}{5}}.  Note we are not doing this in the algorithm, we are simply relabelling the names of the groups in the proof.  Now we have that the median of S is m_{n/10}.
Consider p = m_{\frac{n}{10}}.

Let’s analyze how many elements of A are guaranteed to be ≤ p.
We have that half of is ≤ p, i.e. m_1 \le m_2 \le \dots \le m_{\frac{n}{10}} \le p.
Also, for each of the first n/10 groups G_1, G_2, \dots, G_\frac{n}{10}\}, we have that m_i is greater than or equal to at least 3 elements in G_i.
Thus, p is greater than or equal to at least \frac{n}{10}\times 3 = \frac{3n}{10} elements of A.
Thus, the number of elements strictly greater than p = \vert A_{>p}\vert \le n - \frac{3n}{10} = \frac{7n}{10}.
Using a similar argument, the number of elements smaller than p = \vert A_{<p}\vert \le \frac{7n}{10}.
This proves that p is a good pivot.

Claim 2: FastSelect(A,k) runs in O(n) time in the worst case

Proof:
Let T(n) denote the time to run FastSelect on a list of size n.
Time required to construct S = O(n)
Time required to find pivot p = T(\frac{n}{5})
Time required to partition A into A_{<p}, A_{=p}, A_{>p} = O(n)
Since p is a good pivot, the time required to recursively solve one subproblem for A_{<p} or A_{>p} is at most T(\frac{3n}{4}) (actually it’s \leq T(\frac{7n}{10}) if you recall the proof of Claim 1).
Thus, we have the recurrence relation T(n) = T(\frac{n}{5}) + T(\frac{3n}{4}) + O(n).
Since \frac{1}{5} + \frac{3}{4} = 0.95 < 1, we have that T(n) = O(n).