Generating random distributions

最新推荐文章于 2021-03-16 13:10:50 发布

lllcfr1

最新推荐文章于 2021-03-16 13:10:50 发布

阅读量744

点赞数

分类专栏： Trick C++ TopCoder

C++ 同时被 3 个专栏收录

21 篇文章 0 订阅

订阅专栏

Trick

11 篇文章 0 订阅

订阅专栏

TopCoder

7 篇文章 0 订阅

订阅专栏

Related Materials: Generation of Random Variables (Computational Statistics)

http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=randomNumbers
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=randomNumbers2

Generating random distributions
Author
By bmerry
TopCoder Member

One of the features that I have found interesting about the Marathon Match format is that the data is hidden, yet you have a pretty good idea of what it is going to look like. This is because all the cases are randomly generated, and the problem always contains a detailed specification of the generation method. While there is sometimes a visualization tool that can generate test cases for you, other times you'll have to do it yourself. In this tutorial, we'll look at a few probability distributions and how to generate values based on them.

I'll be discussing the issues in terms of C/C++ (my language of choice). Accordingly, some of the initial, low-level stuff won't be applicable to other languages that have more user-friendly random number libraries. However, the later information will be applicable to all.

Uniform scalar and discrete distributions

Let's start with something simple: generating a single variable. The basic building block is the uniform discrete distribution, meaning that each of N possibilities is equally likely. The approach that is often used when you want an arbitrary number and aren't worried about things being exactly uniform is rand() % N, but there are a few problems with this. Firstly, some random number generators are badly designed, with the low-order bits repeating quite quickly. To be safe, I'd recommend that you use something like:
(int) (N * (rand() / (RAND_MAX + 1.0)))

This will first obtain a real number in the range [0, 1), then scale and truncate it. Another problem is that unless N is a factor of RAND_MAX + 1, either method will generate a small amount of bias, particularly if N is quite large. Think about trying to make a choice among 5 options using a 6-sided die: there is no way to map the 6 die rolls to the 5 options in an unbiased way. The solution is the metaphorical equivalent of rolling the die again if a 6 comes up: we identify a useful range of random outputs (a multiple of N), and if anything else comes up, we discard it. The simplest solution is to reject any rand() that is at least RAND_MAX - RAND_MAX % N. This uses as much of the range as possible, and since this is always at least half of it, the average number of calls to rand() is at most 2 and usually far less.

If your favourite language doesn't provide you with a way to generate a random real number in the range [0, 1), you can generate one as rand() / (RAND_MAX + 1.0). This is not perfect, because it will only generate one of typically 2³¹ values, but for most purposes that is good enough.

If I'm writing any kind of data generator, I start by writing utility functions to implement these algorithms, so that I don't need to worry about the details when I'm doing the interesting parts.

Non-uniform discrete distributions

Let's say that you have a loaded die, with specified probabilities for each outcome. This can be simulated by mapping each outcome to a suitable range of the [0, 1) interval, then generating a uniform random number. For example, suppose a die has probabilities of 0.1, 0.2, 0.1, 0.3, 0.15 and 0.15 of coming up 1, 2, 3, 4, 5 or 6. We can map these events to the ranges [0, 0.1), [0.1, 0.3), [0.3, 0.4), [0.4, 0.7), [0.7, 0.85) and [0.85, 1.0). We then generate a random number in [0, 1) with a uniform (unbiased) distribution, and check which range it lies in. If there are many possible events, a binary search can accelerate this lookup.

In some cases, the probabilities will change at run-time. A real-world instance of this is in streaming data compression, where the probability of seeing a piece of text is updated depending on the number of times it has already been seen. A binary indexed tree is ideal for this type of problem.

Non-uniform continuous distributions

The idea of the previous section can be generalized to continuous distributions. As an example, let's suppose that some number lies in the range [0, 1), but the probability of it being x is directly proportional to x(so values close to 1 are more likely). First, let's find out exactly what the probability density function is: it's going to have the form p(x) = ax, but we need to choose a so that the integral is 1. Doing a little calculus shows that a = 2. Next, we want the cumulative density function, which is the indefinite integral of the probability density function, in this case c(x) = x². To generate a random number with a given distribution, the procedure is to start with a uniform random number in the range [0, 1), then pass it through the inverse of the cumulative distribution function. In this case, we just take sqrt(r), where r is a uniform random number in [0, 1).

This works when the cumulative density function is both known and reasonably easy to invert (although binary search can always be used for the inversion). Unfortunately, the most important distribution, the Gaussian or normal distribution, does not have a formula for its cumulative density function. Curiously, it is easier to generate twonormally distributed random numbers. The following algorithm will do this. I won't provide a proof, but if you want to try to prove it yourself, think about integrating the bivariate Gaussian distribution in polar form.

Pick two random numbers x and y uniformly from the interval [-1, 1].

Let r² = x² + y².

If r² > 1, go back to step 1.

Let u = sqrt(-2 log(r²) / r²)x.

Let v = sqrt(-2 log(r²) / r²)y.

u and v both have the standard normal distribution.

Although you get two random numbers out, I usually discard one rather than try to keep it around for the next time I need a random number. The numbers also have the standard distribution, which has a mean of 0 and a standard deviation of 1. If you need a different mean and/or standard deviation, it is sufficient to scale and bias the number obtained.

Coming soon

So far we've seen how to generate random distributions with a single variable. In part 2, we'll look at generating random collections of elements, such as permutations, combinations and subsets.

Generating random distributions, Part 2

In Part 1, I looked at how to generate a single random number or event. This provides an essential toolbox for test case generation, but many marathon matches require more complicated lists or sets of items, and in Part 2 we'll see how to tackle those.

I will be using Scruffle, the first-round problem from TCCC 2007, as an example where several of the techniques may be applied. Since this problem did not provide a test case generator, it was a real advantage to be able to write your own.

Random seeds

Before getting into the details, I should first explain at a very high level what a random number generator does. Since it is very difficult to generate true randomness, library functions are used to generate a stream of numbers that appear to be more or less random. The random number generator holds some internal state, and each call to rand() generates a new random number and updates the state. There is a lot of mathematical theory that goes into the design of a good random number generator, but for our purposes the key point is that the stream of random numbers is entirely determined by the initial state, known as the seed.

Any decent random number library should have a means to explicitly specify the seed. The default varies; in C++ there is a fixed default seed, while in other languages the seed is initialised from some external source of variability, such as the time of day. If you use random numbers in your submission, then explicitly setting a fixed seed will ensure that your program will make the same random decisions every time, making it possible to compare different variants of the program.

You can make different variants of your program more comparable by always consuming a fixed number of random numbers in each piece of code, even if this means throwing some of them away. If you don't, then a change in an early piece of code may desynchronise the random number streams in the two versions of the program, thus affecting all random numbers chosen later. In Scruffle, for example, some number of blocks must be made special (obstacles, double-word or triple-word blocks), and the rest each receive a random weight from 0 to 9. Based on this principle, I picked a random weight for every block, even when I knew it would not be used. When I later discovered that I was generating the wrong number of triple-word blocks due to a copy-paste error, I could fix the bug and know that everything else about the generated test cases would be the same. While this isn't so important when generating test cases off-line, if you incorporate randomness into a submission then this would reduce the amount of noise in your results due to randomness.

I should clarify what I mean by a “fixed” number of random numbers. The algorithm I gave for generating the standard normal distribution in part 1 checks if the random numbers it generates falls in an acceptable range, and if not, it picks some new numbers. However, as long as this function is always called at the same point in the random number stream, it will generate the same numbers initially, and so it will always consume a fixed amount of the random number stream. The cases you should try to avoid are those where you branch on something that is likely to change between versions of your code, particularly a tuning parameter, and you generate a different number of random numbers in each branch.

Random permutations

Let's suppose that you have a list of objects, and you need them shuffled into a random order (like shuffling a deck of cards), with all possible orders being equally likely. A novice attempt might be to repeatedly swap pairs of randomly-chosen elements. However, elements will always be slightly more likely to remain in their original positions. (RandomSwaps from SRM 338 demonstrates this quite nicely). Even worse, if the number of swaps is fixed, then 50% of the possible permutations have a zero chance of being used. This is because the number of pairs that are out of their original order changes parity with every swap.

Fortunately, there is a simple and efficient algorithm that gives a uniform distribution. Assume there are N elements to be arranged in an array. Any one of them could be the first element, so pick one at random and swap it to the front. You've now changed the relative order of the remaining elements, but the original order is irrelevant anyway. Now, we are left with N - 1 elements, which we need to permute, and it is not hard to see that all (N - 1)! permutations ought to be equally likely. So we can apply the same procedure recursively until all the elements are permuted.

Random subsets

In the Scruffle problem, we need to label a certain number of squares as obstacles. The wording of the problem implicitly suggests one approach: pick a block at random, and if it has already been marked, then try again. This works well enough in this case, but there a number of drawbacks:

The number of random numbers you generate depends on both N and K.
You need some auxiliary data structure to efficiently determine whether a selected item has already been chosen.
If the items are generated by some sequential process (reading in from a large file, repeatly calling a library function, or generated from some formula like the Fibonacci sequence), then you need to either store all N items in memory or use O(N) auxiliary memory.
If K is very close to N, then you will need to generate O(Nlog N) random numbers on average.
In this case, there are several practical alternatives, with some more applicable to certain situations than others.

Pick items to exclude

If K is very close to N, then it is cheaper to pick N - K items to leave out that K items to put in. However, this does not overcome any of the other drawbacks.

Extend the permutation algorithm

If the N items are already available in an array, we can take a random permutation of them, then take the first K of this permutation. In fact, our permutation algorithm generates the permutation one element at a time, so we only need to generate the first K of them. This gives an O(K) algorithm, although if we want to preserve the original order of the items then O(N) time and space is required.

Decide for each item

If K is much smaller than N and we either need to preserve the original order of the N items or generate them sequentially, we may be able to afford O(N) time but not O(N) extra space. In this case we can iterate over the N items in turn, deciding for each whether it will be chosen or not. The naive approach might be to pick each item with probability K over N, but this would only give a total of K items on average. A small adjustment makes this idea work. Let's suppose there are M items still left to consider (including the current one), and we still need to pick L of them. Then they're all equally likely, so we pick the current item with probability L over M. The idea can fairly easily be extended to multiple events, as was the case in the marathon match (we needed obstacles, double-word and triple-word blocks).

It is also possible to pick a random set of K items from N items when N isn't known in advance. In Scruffle, a word could be chosen multiple times from the dictionary, but imagine that this was not case. Then this allows us to simply pull a line at a time out of a dictionary, without doing a first pass to determine the size of the dictionary, or storing the whole dictionary in memory. We keep a list of K chosen items, which at every stage is a valid random sample of the words we have read so far. Let's say we had N-1 words loaded, and we now read an Nth one. With probability K/N, pick this word, and use it to replace one of the previously selected words chosen at random. Proving that every possible subset has equal probability is left as an exercise for the reader. Note that while the list of chosen items is not ordered, it is also not uniformly random. If you want the items in random order, you will need to randomly permute them as described above.

Which is right?

In most cases, any of these algorithms will work. You can then consider things like implementation time, run-time performance (e.g., if you have to choose 5 items out of 1000 many times, you don't want an O(N) algorithm), and which algorithm is most likely to generate a fixed number of random numbers (for example, K might be a tuning parameter, in which case you might prefer to generate N random numbers).

Conclusions

The example tests in a marathon match are seldom a representative sample, since very small cases are included to help people get started, as well as very large cases to test for boundary conditions. There are also seldom enough example cases to exercise all the subtleties of the problem, and the four-hour wait between full submissions makes it impractical for use during development. Being able to write a local test case generator is thus vital to success in a marathon match.

lllcfr1

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Generating random distributions

Related Materials: Generation of Random Variables (Computational Statistics)http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=randomNumbershttp://community.topcoder.com/tc?modul
复制链接

扫一扫