Random Number Generating Functions and Properties of the ...

RANDOM NUMBER GENERATING FUNCTIONS AND PROPERTIES OF THE LINEAR CONGRUENTIAL METHOD

Jacqueline Borgert

Co-advised by Dr. Michael Jury, Department of Mathematics, Dr. Lauren McIntyre, Department of Molecular Genetics and Microbiology

Introduction

Simulation experiments are a widely used tool in both statistical and scientific research, presenting a method for validating or comparing statistical methods and generating large amounts of data under controlled conditions. Statistical research relies on simulation studies for testing or comparing performance measures of statistical methods, including bias, power of a test, and type I error rates [27]. In a scientific study, computer simulation software allows a user to generate data according to a specified model and observe a process or conduct an experiment. Simulating data that reflects the random variation found in real experiments is often achieved by random number generation, a process that introduces stochastic variation in the output that imitates the properties of numbers drawn from a specified distribution. A random number is defined as a value in a set with a probability of being selected from the total population based on the model desired; further, a random number is an instance of an unbiased random variable [2]. Generating these random variates is itself a mathematical problem, requiring methods to obtain sequences of independent numbers with a specified distribution. While it is impossible to produce truly random numbers, since any mechanism for generating data is deterministic, we instead rely on "pseudo-random number generators." A pseudo-random number generator is a mathematical algorithm that will return the same sequence of variates under a given distribution from a given input [2]. The input required need not be a single value, but can instead be a vector or multiple vectors. This input is known as the "seed," a string of bits used to initialize the generator. Varying the seed will vary the resulting set of `random numbers.' Seeds should be stored for reproducibility of results, as the same sequence of output can be obtained from its instantiating seed value.

The outputs, or `pseudo random numbers' should provide sufficient unpredictability for the statistical purpose [2]; that is, the sequence of outputs should be statistically independent and be consistent with the specified distribution at a desired confidence level under empirical testing. Every number generating algorithm has a "period length", or the number of random variates (output sequence) it produces from a given seed before the sequence begins to repeat [9]. While the deterministic nature of algorithms prevents output sequences from being truly random or independent, a set of criteria can be applied to select generators with properties that will produce a sequence of output numbers with an apparently independent and random pattern. Thus, we can analyze and compare the properties of generators by developing criteria for the behavior of the output sequence and use these criteria to evaluate the suitability of a particular generator for a particular experiment or scenario. We begin by identifying these desirable properties of a pseudo-random number generator and the criteria used for evaluating their behavior.

Desirable Properties of a Random Number Generating Algorithm

The key property of a pseudo-random number generator is that it produces a sequence of numbers that simulates typical realizations of independent random variables from the desired distribution [17]. Ideally, the output of the selected generator should also mimic the random variable of practical interest in dimension and algebraic structure [10]. Since a sequence from a uniform distribution can be transformed to produce a sequence from most other parametric distributions, it is reasonable to focus on developing a generating algorithm for a uniform distribution [6]. In order to guarantee that a new distribution can be obtained, a property of

number generating algorithms should be that the sequence of seemingly random outputs produced remains seemingly random after transformations are applied. To imitate uniformly distributed random variables, it is required that a generator produces an apparently random uniform distribution and that these numbers are apparently independent from each other. The period length is one aspect that determines randomness, since it dictates how many variables can be produced before the sequence repeats itself identically, while correlation behavior determines independence between output sequences. It is therefore desirable to choose a generator with a long period length and apparent independence of output sequences, so that the next number appearing in the sequence cannot be predicted and the function used to generate the output cannot be estimated by observing values in the sequence.

It is difficult to evaluate apparent randomness and independence. In practice, statistical tests are applied to detect correlation and assess uniformity [17]. These tests serve to evaluate the output generated by a function. As a first concern, a generator should be selected for the application, given the type of data under investigation, the number of output sequences to be simulated, and computational limitations. A generator that is suitable for one application may not be suitable for a different experiment if the required period length or type of data changes. Practical considerations for number generating algorithms include computing memory usage, computational efficiency and reproducibility. Computational efficiency is defined as a measure of the computational resources required for an algorithm to complete its process, including time and space. We will refer to reproducibility as the ability to achieve identical results from a given input value.

Whenever simulations of datasets in excess of several million values are concerned, efficiency of the algorithm is always a factor of practical importance. Statistical experiments typically require sufficient replication of datasets to estimate both parameters and their standard errors for the hypothesis under investigation. As computing power has increased, the scale of simulation has also increased. The random number generating algorithm employed should produce a sufficiently large number of output values, but also do so within the limits of available computing time, CPU, and memory [5]. Reproducibility is a related practical concern, since it would be inefficient and cumbersome to store all the output sequences and downstream results. A reproducible number generating algorithm allows for storing only the seed values used rather than the entire simulation. It is good practice to permanently store and share seed values to ensure that others can replicate the experimental results.

Importance of the Seed Value

The seed value(s), or the input value(s) that instantiates the algorithm, is chosen according to both the requirements of the generator used and the data under investigation. The seed must contain entropy, or randomness. Further, two different seeds should produce sequences of output that show both apparent randomness within each sequence and between distinct sequences so that no predictions about the underlying function can be made from multiple sets of output. This implies that the seed values themselves should be independent and

selected at random. This is a difficult problem. In practice, a generator instantiated with a seed obtained from a computer's entropy source extends the random seed into a sequence of values that appears and behaves with the same randomness as the seed value [15]. Sources of entropy typically include information about the current state of the computer's internal mechanisms [2]. For example, information from the computer's date/time recorder or information on the user's inconsistent behavior when responding to computer prompts can be used for seeding [12]. The seed length required, measured in bits, varies depending on the generating algorithm used. Once a seed is used to instantiate an algorithm, only finitely many output values can be produced before the sequence is repeated, indicating the period length has been exhausted. Plesser and Jahnsen [22] warn that correlation of sequences can occur if seed values are correlated; since the sequence generated from a random seed is a mathematical extension of that seed, correlated seed values will produce correlated sequences. If the need for output values exceeds the period multiple instances of the generating function are required and multiple seed values will be needed (reseeding). Reseeding can help maintain apparent randomness and independence of results. Reseeding with correlated values, such as clock times closely related, will result in correlation among sequences [22]. However, introducing a new source of entropy can avoid this issue [2]. One possibility is to use chained algorithms for your random generating process, where one algorithm is designated to generate random variates for use as seed values in another [2]. Since generating algorithms produce seemingly random outputs, selecting seed values from the output sequence of a different algorithm should eliminate correlations in reseeding. Another method for reseeding is to introduce a new source of entropy when obtaining a new seed value; for example, information from the computer's date/time recorder could be combined with a new source of random data to reseed.

Classical and Recent Pseudo Random Number Generators

Perhaps the most classic example of a pseudo-random number generator are Linear Congruential Generators (LCG), given by

= -1 + ( ) .

While this type of generator can be easily implemented on 32-bit computers, its period length is often limiting and is therefore unsuitable for many "big data" applications. One approach for increasing the period length is to use a combined method of LCGs. This is essentially the result of adding two or more LCGs with different moduli, producing an efficient LCG with a period length equivalent (or approximately equivalent) to LCGs with very large moduli. The Wichmann-Hill generator (see Table 1) is an example of this combined approach.

A generalization of the linear congruential generator is the Multiple Recursive Generator (MRG), based on recurrences of higher order. To illustrate, a Multiple Recursive Generator of order k is based on the kth order linear recursive form:

= 1-1 + ? ? ? +- ( ),

where m and k are positive integers, and a, b = {0, 1, .... m-1}. This generator is very fast and can have period length up to mk - 1. The combined approach can be applied to MRGs, as shown by L'Ecuyer [16], producing a generator with larger period length but with reduced computing speed.

Another improvement to the linear congruential generator is the matrix linear congruential generator, such as the Mersenne-Twister algorithm [19]. The Mersenne-Twister algorithm is an iteration of a fixed linear transformation on a fixed vector space, using bitwise operations for computation. All operations of the Mersenne-Twister algorithm are bitwise operations, which are logical operations performed to calculate values on a computer. Bitwise operations are performed at the bit-level rather than the byte-level, which is a smaller unit of information on a computer and can thus save time and computing power. The Mersenne-Twister generator requires as initial seeds vectors , . . . ,-, w-dimensional vectors over the field 2 = {0,1} defined by machine words of size w. The algorithm produces a sequence of uniform pseudorandom integers between 0 and 2 - 1 in the form of word vectors, which result in a real number in [0,1] when divided by 2 - 1. The recurrence is of the form:

+ = + (|+), = {0, 1, . . . },

where n is an integer denoting the degree of the recurrence, m an integer 1 m n, and A is a constant w x w matrix with elements from 2. With k = 0, the algorithm generates , and +, +, ... with k = 1, 2, ... Observe that denotes the "upper w ? r" bits of , where r is an integer 0 r w ? 1, and +denotes the "lower r bits" of +. Thus, and +are w ? r bit and r bit vectors, respectively. Then, (|+) is a vector obtained by the concatenation of these two vectors. The concatenation is then multiplied on the right by matrix A. Finally, is the bitwise addition by modulo two [19]. The matrix form and bitwise operations make the Mersenne-Twister extremely efficient compared to classical generators and is now the default number generator in programming software such as R and Python.

Theoretical Evaluation and Testing of Pseudo Random Number Generators

Evaluation of number generating algorithms include tests of both theoretical and empirical properties. Theoretical testing of the algorithm derives an expected period length and predicts the algebraic structure of the sequence of output values from evaluation of the function. Tests of uniformity and independence based on the predicted algebraic structure exist for certain types of generators and evaluate the function over its entire period length [14]. We will also consider statistical properties of the random variable simulated by the generated sequence.

The period length of a number generating function is defined as the number of unique output values that the function generates before a repetition is observed. Period length can be determined theoretically by analyzing the function. Because it is not ideal to construct a generator with less than full period length, it is worthwhile to determine conditions that guarantee maximal period length. Theoretical tests of uniformity and independence begin with the assumption that the period length is m.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download