About Random Numbers

最新推荐文章于 2024-08-07 22:25:18 发布

DB架构

最新推荐文章于 2024-08-07 22:25:18 发布

阅读量392

点赞数

分类专栏： Statistical Methods 文章标签：大数据 python 机器学习

本文链接：https://blog.csdn.net/u011868279/article/details/125458583

版权

Statistical Methods 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

这篇教程探讨了机器学习中随机性的来源，包括数据的随机性、评估的随机性和算法的随机性。重点介绍了伪随机数生成器的概念，并展示了如何在Python中使用random和NumPy库生成随机数。通过设置种子可以控制随机数序列，以确保可重复性。此外，控制随机性对于评估模型性能和避免过拟合至关重要。

摘要由CSDN通过智能技术生成

Randomness is a big part of machine leaning.
Randomness is used as a tool or a feature in preparing data and in learning algorithms that map input data to output data in order to make predictions.
The source of randomness in machine learning is a mathematical trick called a pseudorandom number generator.

After completing this tutorial , you will know

The sources of randomness in applied machine learning with a focus on algorithms.
What a pseudorandom number generator is and how to use them in Python
When to control the sequence of random numbers and when to control-for randomness.

1.1 Tutorial Overview

Randomness in Machine Learning
Pseudorandom Number Generators
Random Numbers with Python
Random Numbers with NumPy
When to Seed the Random Number Generator
How to Control for Randomness
Common Questions

1.2 Randomness in Machine Learning

There are many sources of randomness in applied machine learning. Randomness is used as a tool to help the learning algorithms be more robust and ultimately result in better predictions and more accurate models.

1.2.1 Randomness in Data

There is a random element to the sample of data that we have collected from the domain that we will use to train and evaluate the model.

1.2.2 Randomness in Evaluation

We do not have access to all the observations from the domain. We work with only a small sample of the data. Therefore, we harness randomness when evaluating a model, such as using k-fold cross-validation to fit and evaluate the model on different subsets of the available dataset. We do this to see how the model works on average rather than on a specific set of data.

1.2.3 Randomness in Algorithms

Machine learning algorithms use randomness when learning from a sample of data. This is a feature, where the randomness allows the algorithm to achieve a better performing mapping of the data than if randomness was not used. Randomness is a feature, which allows an algorithm to attempt to avoid overfitting the small training set and generalize to the broader problem. Algorithms that use randomness are often called stochastic algorithms rather than random algorithms. This is because although randomness is used, the resulting model is limited to a more narrow range, e.g. like limited randomness. Some clear examples of randomness used in machine learning algorithms include:

The shuffling of training data prior to each training epoch in stochastic gradient descent.
The random subset of input features chosen for split points in a random forest algorithm.
The random initial weights in an artificial neural network.

1.3 Pseudorandom Number Generators

The source of randomness that we inject into our programs and algorithms is a mathematical trick called a pseudorandom number generator. A random number generator is a system that generates random numbers from a true source of randomness

1.4 Random Numbers with Python

The Python standard library provides a module called random that offers a suite of functions for generating random numbers. Python uses a popular and robust pseudorandom number generator called the Mersenne Twister. In this section, we will look at a number of use cases for generating and using random numbers and randomness with the standard Python API.

1.4.1 Seed The Random Number Generator

The pseudorandom number generator is a mathematical function that generates a sequence of nearly random numbers.. It takes a parameter to start off the sequence, called the seed.The seed() function will seed the pseudorandom number generator, taking an integer value as an argument.If the seed() function is not called prior to using randomness, the default is to use the current system time in milliseconds from epoch (1970). The example below demonstrates seeding the pseudorandom number generator, generates some random numbers, and shows that reseeding the generator will result in the same sequence of numbers being generated.

# seed the pseudorandom number generator
from random import seed
from random import random
# seed random number generator
seed(1)
# generate some random numbers
print(random(),random(),random())
# reset the seed
seed(1)
# generate some random numbers
print(random(),random(),random())

Running the example seeds the pseudorandom number generator with the value 1, generates 3 random numbers, reseeds the generator, and shows that the same three random numbers are generated.

1.4.2 Random Floating Point Values

Random floating point values can be generated using the random() function. Values will be generated in the range between 0 and 1, specifically in the interval [0,1). Values are drawn from a uniform distribution, meaning each value has an equal chance of being drawn. The example below generates 10 random floating point values.

# generate random floating point values
from random import seed
from random import random
# seed random number generator
seed(1)
# generate random numbers between 0-1
for i in range(10):
    value = random()
    print(value)

Running the example generates and prints each random floating point value.

1.4.3 Random Integer Values

Random integer values can be generated with the randint() function. This function takes two arguments: the start and the end of the range for the generated integer values. Random integers are generated within and including the start and end of range values, specifically in the interval [start, end]. Random values are drawn from a uniform distribution. The example below generates 10 random integer values between 0 and 10.

# generate random integer values
from random import seed
from random import randint
# seed random number generator
seed(1)
# generate some integers
for i in range(10):
    value = randint(0, 10)
    print(value)

Running the example generates and prints 10 random integer values.

1.4.4 Random Gaussian Values

Random floating point values can be drawn from a Gaussian distribution using the gauss() function. This function takes two arguments that correspond to the parameters that control the size of the distribution, specifically the mean and the standard deviation. The example below generates 10 random values drawn from a Gaussian distribution with a mean of 0.0 and a standard deviation of 1.0. Note that these parameters are not the bounds on the values and that the spread of the values will be controlled by the bell shape of the distribution, in this case proportionately likely above and below 0.0.

# generate random Gaussian values
from random import seed
from random import gauss
# seed random number generator
seed(1)
# generate some Gaussian values
for i in range(10):
    value = gauss(0,1)
    print(value)

Running the example generates and prints 10 Gaussian random values.

1.4.5 Randomly Choosing From a List

Random numbers can be used to randomly choose an item from a list. For example, if a list had 10 items with indexes between 0 and 9, then you could generate a random integer between 0 and 9 and use it to randomly select an item from the list. The choice() function implements this behavior for you. Selections are made with a uniform likelihood. The example below generates a list of 20 integers and gives five examples of choosing one random item from the list.

# choose a random element from a list
from random import seed
from random import choice
# seed random number generator
seed(1)
# prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# make choices from the sequence
for i in range(5):
    selection = choice(sequence)
    print(selection)

Running the example first prints the list of integer values, followed by five examples of choosing and printing a random value from the list.

1.4.6 Random Subsample From a List

We may be interested in repeating the random selection of items from a list to create a randomly chosen subset. Importantly, once an item is selected from the list and added to the subset, it should not be added again. This is called selection without replacement because once an item from the list is selected for the subset, it is not added back to the original list (i.e. is not made available for re-selection). This behavior is provided in the sample() function that selects a random sample from a list without replacement. The function takes both the list and the size of the subset to select as arguments. Note that items are not actually removed from the original list, only selected into a copy of the list. The example below demonstrates selecting a subset of five items from a list of 20 integers.

# select a random sample without replacement 
from random import seed
from random import sample
# seed random number generator
seed(1)
# prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# select a subset without replacement
subset = sample (sequence, 5)
print(subset)

Running the example first prints the list of integer values, then the random sample is chosen and printed for comparison.

1.4.7 Randomly Shuffle a List

Randomness can be used to shuffle a list of items, like shuffling a deck of cards. The shuffle() function can be used to shuffle a list. The shuffle is performed in place, meaning that the list provided as an argument to the shuffle() function is shuffled rather than a shuffled copy of the list being made and returned. The example below demonstrates randomly shuffling a list of integer values.

# randomly shuffle a sequence
from random import seed
from random import shuffle
# seed random number generator
seed(1)
# prepare a sequence
sequence = [i for i in range(20)]
print(sequence)
# randomly shuffle  the sequence
shuffle(sequence)
print(sequence)

Running the example first prints the list of integers, then the same list after it has been randomly shuffled.

1.5 Random Numbers with NumPy

In machine learning, you are likely using libraries such as scikit-learn and Keras. These libraries make use of NumPy under the covers, a library that makes working with vectors and matrices of numbers very efficient. NumPy also has its own implementation of a pseudorandom number generator and convenience wrapper functions. NumPy also implements the Mersenne Twister pseudorandom number generator. Let’s look at a few examples of generating random numbers and using randomness with NumPy arrays.

1.5.1 Seed The Random Number Generator

The NumPy pseudorandom number generator is different from the Python standard library pseudorandom number generator. Importantly, seeding the Python pseudorandom number generator does not impact the NumPy pseudorandom number generator. It must be seeded and used separately. The seed() function can be used to seed the NumPy pseudorandom number generator, taking an integer as the seed value. The example below demonstrates how to seed the generator and how reseeding the generator will result in the same sequence of random numbers being generated.

# seed the pseudorandom number generator
from numpy.random import seed
from numpy.random import rand
# seed random number generator
seed(1)
# generate some random numbers
print(rand(3))
# reset the seed
seed(1)
# generate some random numbers
print(rand(3))

Running the example seeds the pseudorandom number generator, prints a sequence of random numbers, then reseeds the generator showing that the exact same sequence of random numbers is generated.

1.5.2 Array of Random Floating Point Values

An array of random floating point values can be generated with the rand() NumPy function. If no argument is provided, then a single random value is created, otherwise the size of the array can be specified. The example below creates an array of 10 random floating point values drawn from a uniform distribution.

# generate random floating point values
from numpy.random import seed
from numpy.random import rand
# seed random number generator
seed(1)
# generate random numbers between 0-1
values = rand(10)
print(values)

Running the example generates and prints the NumPy array of random floating point values.

1.5.3 Array of Random Integer Values

An array of random integers can be generated using the randint() NumPy function. This function takes three arguments, the lower end of the range, the upper end of the range, and the number of integer values to generate or the size of the array. Random integers will be drawn from a uniform distribution including the lower value and excluding the upper value, e.g. in the interval [lower, upper). The example below demonstrates generating an array of random integers.

# generate random integer values
from numpy.random import seed
from numpy.random import randint
# seed random number generator
seed(1)
# generate some integers
values = randint(0, 10, 20)
print(values)

Running the example generates and prints an array of 20 random integer values between 0 and 10.

1.5.4 Array of Random Gaussian Value

An array of random Gaussian values can be generated using the randn() NumPy function. This function takes a single argument to specify the size of the resulting array. The Gaussian values are drawn from a standard Gaussian distribution; this is a distribution that has a mean of 0.0 and a standard deviation of 1.0. The example below shows how to generate an array of random Gaussian values.

# generate random Gaussian values
from numpy.random import seed
from numpy.random import randn
# seed random number generator
seed(1)
# generate some Gaussian values
values = randn(10)
print(values)

Running the example generates and prints an array of 10 random values from a standard Gaussian distribution.

Values from a standard Gaussian distribution can be scaled by multiplying the value by the standard deviation and adding the mean from the desired scaled distribution. For example:

Where mean and stdev are the mean and standard deviation for the desired scaled Gaussian distribution and value is the randomly generated value from a standard Gaussian distribution.

1.5.5 Shuffle NumPy Array

A NumPy array can be randomly shuffled in-place using the shuffle() NumPy function. The example below demonstrates how to shuffle a NumPy array.

# random shuffle a sequence
from numpy.random import seed
from numpy.random import shuffle
# seed random number generator
seed(1)
# prepare a sequence
sequence  = [i for i in range(20)]
print(sequence)
# randomly shuffle the sequence
shuffle(sequence)
print(sequence)

Running the example first generates a list of 20 integer values, then shuffles and prints the shuffled array.

1.6 When to Seed the Random Number Generator

There are times during a predictive modeling project when you should consider seeding the random number generator. Let’s look at two cases:

Data Preparation. Data preparation may use randomness, such as a shuffle of the data or selection of values. Data preparation must be consistent so that the data is always prepared in the same way during fitting, evaluation, and when making predictions with the final model.
Data Splits. The splits of the data such as for a train/test split or k-fold cross-validation must be made consistently. This is to ensure that each algorithm is trained and evaluated in the same way on the same subsamples of data.

You may wish to seed the pseudorandom number generator once before each task or once before performing the batch of tasks. It generally does not matter which. Sometimes you may want an algorithm to behave consistently, perhaps because it is trained on exactly the same data each time. This may happen if the algorithm is used in a production environment. It may also happen if you are demonstrating an algorithm in a tutorial environment. In that case, it may make sense to initialize the seed prior to fitting the algorithm.

1.7 How to Control for Randomness

A stochastic machine learning algorithm will learn slightly differently each time it is run on the same data. This will result in a model with slightly different performance each time it is trained. As mentioned, we can fit the model using the same sequence of random numbers each time. When evaluating a model, this is a bad practice as it hides the inherent uncertainty within the model.

A better approach is to evaluate the algorithm in such a way that the reported performance includes the measured uncertainty in the performance of the algorithm. We can do that by repeating the evaluation of the algorithm multiple times with different sequences of random numbers. The pseudorandom number generator could be seeded once at the beginning of the evaluation or it could be seeded with a different seed at the beginning of each evaluation. There are two aspects of uncertainty to consider here:

Data Uncertainty: Evaluating an algorithm on multiple splits of the data will give insight into how the algorithms performance varies with changes to the train and test data.
Algorithm Uncertainty: Evaluating an algorithm multiple times on the same splits of data will give insight into how the algorithm performance varies alone.

1.8 Common Questions

Can I predict random numbers?

You cannot predict the sequence of random numbers, even with a deep neural network.

Will real random numbers lead to better results?

As far as I have read, using real randomness does not help in general, unless you are working with simulations of physical processes.

What about the final model?

The final model is the chosen algorithm and configuration trained on all available training data that you can use to make predictions. The performance of this model will fall within the variance of the evaluated model