Bkkbrad s approach is perhaps the most efficient in that the number of records sent out from each mapper is (at most) K. On the other hand, note that it assumes that the sample itself (i.e. K elements) fit in the memory of a single reducer.
When this is not the case, you may be tempted to simply use a fully distributed approach where each matching row is assigned by the mapper a random integer in {1,..,K} and then the reduce phase picks one element for each key (also see this question). The problem with this approach though is that it may be the case that by chance no row is assigned to certain keys, in which case the final sample will have less than K elements. Even though this happens with small probability if K is much smaller than the total number of rows N, it will happen with constant probability if K is a constant fraction of N (say when K=N/3).
A solution that works is the following: suppose we have B buckets and generate a random ordering of the elements first by putting each element in a random bucket and then generating a random ordering in each bucket. The elements in the first bucket are considered smaller (with respect to the ordering) than the elements in the second bucket and so on. Then, if we want to pick a sample of size K, we can collect all of the elements in the first j buckets if they overall contain a number of elements t less than K, and then pick the remaining K-t elements from the next bucket. Here B is a parameter such that N/B elements fit into memory. The key aspect is that buckets can be processed in parallel.
Mapper: Output all qualifying rows, each with a random key (j, r), where j is a random integer in {1,..,B} and r is a random float. In addition, keep track of the number of elements with key less than j (for 1<=j<=B) and transmit this information to the reducers.
Shuffle: Partition over j, and secondary sort over r.
Reducer: Consider bucket j and assume the reducer knows how many elements are in buckets less than j and how many in bucket j (by aggregating the info received by the mappers). If the number of elements in buckets less or equal than j is less or equal than K then output all elements in bucket j; if the number of elements with bucket strictly less than j is t < K then run a reservoir sampling to pick K−t random elements from the bucket; in the remaining case, that is when the number of elements in buckets less than j is at least K, don t output anything.
I m not aware of a simpler solution for this problem, but it would be nice if there was one.
You can find more details here on my blog.