A technique for selection sampling (sampling without replacement)

Tags: programming, howtos

Published on
« Previous post: Visualizing the sentiments in U.S. … — Next post: Simple unit tests with C++ and CMake »

Many applications require the selection of a subset of objects from a larger set. For example, suppose you are down-sampling a large data set by choosing the indices that you want to keep at random. If you are satisfied with obtaining duplicate indices (in essence, if you are sampling with replacement), this is a trivial matter in most programming language. Python certainly makes this absurdly easy:

import random

def sample(k,n):
  return [random.choice( [x for x in range(n)] ) for _ in range(k)]

But what about sampling without replacement? More precisely, what if you require the selection of k indices from an array of n values without choosing duplicate values but still want the probability of choosing a certain element to be uniform? Well, if you use Python, you are in luck again:

import random

def sample_without_replacement(k,n):
  return random.sample([x for x in range(n)], k)

But what about other programming languages that may not have similar builtin functionality, such as C++? In this case, a very simple and elegant technique is given by none other than the great Donald E. Knuth. The following algorithm originally was given in The Art of Computer Programming, Volume 2, Section 3.4.2 on p. 137. Briefly put, it works like this:

  1. Set t = 0 and m = 0. m represents the number of items selected so far, while t is the total number of items that have already been traversed.

  2. Generate a random number U that is uniformly distributed between zero and one.

  3. If (N-t)*U >= n-m, skip the current item by increasing t by 1 and going back to step 2. Else, select the current item by increasing both t and m by 1. Afterwards, either go back to step 2 or stop the algorithm (if sufficiently many items have been sampled already).

A very basic implementation of this algorithm (in C++) is given in the gist below:

I do not know about you, but my gut reaction upon seeing this algorithm for the first time was How can this possibly work? So, let us briefly prove the algorithm to be correct. Basically, we have to show that every item has the same probability of being chosen. The key insight is to realize that the probability has to change depending on the number of elements that have already been selected. More precisely, we need to determine the probability of choosing item t+1 if m items have already been selected. This requires some combinatorics. There are

Equation StartBinomialOrMatrix upper N hyphen t Choose n hyphen m EndBinomialOrMatrix ways of choosing `n` items from `N` such that `m` items are picked as the first `t` items. Or, equivalently, these are the number of potential permutations of the *remaining* `n-m` elements. Out of these, we are interested in all the ones that contain item `t+1`—but this is easy, because we can just take that item as granted and count the *remaining* combinations as Equation StartBinomialOrMatrix upper N hyphen t hyphen 1 Choose n hyphen m hyphen 1 EndBinomialOrMatrix , i.e. the number of ways to choose `n-m-1` items out of `N-t-1` ones. The quotient of these two numbers is exactly the probability with which we must choose item `t+1` if we want to have a uniform probability for choosing a certain item. This quotient turns out to be Equation StartFraction n minus m Over upper N minus t EndFraction , which looks familiar. Finally, note that since `U` was chosen to be uniformly distributed between zero and one, the condition `(N-t)*U >= n-m` in the algorithm is satisfied with the required probability. Consequently, this method samples without replacement in the required manner!

If you are interested or require a more efficient algorithm, please refer to the paper An Efficient Algorithm for Sequential Random Sampling by Jeffrey Scott Vitter in ACM Transactions on Mathematical Software, Volume 13, Issue 1, pp. 58–67, for more details. The paper should be available free-of-charge from the link provided above, but there is also a mirror available, thanks to the good folks at INRIA, who require their publications to be kept in an open archive. Time permitting, I might provide a second blog post and implementation about the more efficient algorithm.

That is all for now, until next time!