Going Back to the Monkey Typewriter Problem

I began this blog with two articles on the Monkey Typewriter Theorem and, in the last one, I’ve said that there should be a possibility to stop the Monte Carlo simulation at exactly the needed point (when the required precision has been attained).

Truth is there may be at least one method to do this.

Let’s suppose that our simulation is made in order to obtain a single real value (the theory presented here can be extrapolated to the multidimensional case with a little pain and some tricks but it works). We will consider the Monte Carlo simulation as a process in which we experimentally measure a natural measure.

Suppose we name the result of the i-th measurement $x_i$. We’ll note $a_x$ the real value (the one we cannot obtain) and we will take

$\displaystyle\widetilde{x_n} = \sum_{i=0}^{n}{x_i}$

to be the average of the n computations (n runs of the Monte Carlo simulation). By the Law of Large Numbers (the same law that stays behind the Monte Carlo simulation) we can be certain that, with a high probability $p$

$a_x \in [\widetilde{x_n} - z_l\sigma(\widetilde{x_n}), \widetilde{x_n} + z_l\sigma(\widetilde{x_n})]$

where $z_l$ and $p$ are strongly related to each other. In fact, $a_x$ is uniformly distributed in this interval with a pdf of constant value $C$. By integrating this pdf we obtain

$1 = \displaystyle\int_{-\infty}^{\infty}{Cda_x} = 2Cz_l\sigma(\widetilde{x_n})$

from which one can obtain the $C$ value.

Now, let’s turn our attention to a concept known as the information entropy. We will use this concept to obtain a relation which will enable us to stop from making further simulations. By using it’s general definition, we have

$H(p(x)) = -a\displaystyle\int_{-\infty}^{\infty}P(x)log_b(P(x)\Delta x)dx$

$H(p(x)) = -a\displaystyle\int_{\widetilde{x_n}-z_l\sigma(\widetilde{x_n})}^{\widetilde{x_n}+z_l\sigma(\widetilde{x_n})}Clog_b(C\Delta x)dx$

$H(p(x)) = -2az_l\sigma(\widetilde{x_n})Clog_b(C\Delta x)$

By taking into account the formula for $C$ above, we obtain

$H(p(x)) = a log_b(2z_l\sigma(\widetilde{x_n})) - a log_b(\Delta x)$

$H_n = a log_b \displaystyle\frac{\sigma_n}{\Delta x}$

which turns out to be something similar to what we were looking for (in the last line I’ve changed the notations a little, without changing the names and the meanings, only to show that what we have obtained depends strongly on the number of measurements we have undertaken). In fact, by turning our attention to the (apparent) information we obtained in the last measurement

$I_{app} = H_{n-1} - H_{n}$

we see that, because $\sigma_{n} < \sigma _{n-1}$ and $\sigma_{n} > 0$, our last result is always positive and decreasing.

Thus, if someone would be able to deduce a relationship between the precision of the measurement and the value of $I_{app}$, we could stop the Monte Carlo process earlier or decide – algorithmically, thus eliminating any need for human intervention – to continue for another 1000 (random value) runs.

However, there’s a glitch: one single malformed data will affect our computation too badly. Thus, we need something to eliminate any outliers from the early stages of computation.

And we need some code results.