回答“关于用python做机器学习工作中的random_state参数到底是个什么意思”

最新推荐文章于 2024-07-07 17:01:14 发布

乌黑浓密的技术员

最新推荐文章于 2024-07-07 17:01:14 发布

阅读量9.4k

点赞数 47

分类专栏：交叉验证文章标签： python

本文链接：https://blog.csdn.net/xigewang_/article/details/119276592

版权

交叉验证专栏收录该内容

1 篇文章

订阅专栏

在Python的机器学习中，KFold交叉验证的random_state参数用于控制数据划分的随机性。当设置为None时，每次运行数据划分都会变化；而设定为具体整数时，如1或2，将确保每次划分相同。random_state不表示数值大小，而是作为洗牌的一种标识。因此，无论设置为42还是160，只要值相同，数据划分就一致，不同则结果不同。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

我们在用python做机器学习的交叉验证工作时，常会遇到random_state参数，比如函数：

KFold(n_splits=5, shuffle=False, random_state=None)

该函数用来做K折交叉验证。

n_splits：折数，int型，默认值为5.

shuffle：对数据进行划分前是否进行洗牌。boolean型

random_state：int, RandomState instance 或 None, 默认为None。直译为“随机状态”。

只有当shuffle=True时，random_state才有意义。

当random_state=None时：

KFold(n_splits=5, shuffle=True, random_state=None)

代表每次数据的划分不一样。

举例：

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

xx=np.arange(25)
kf=KFold(n_splits=5,shuffle=True,random_state=None)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

输出结果为：

train_index:[ 0  1  2  3  4  5  6  8  9 10 11 12 14 15 17 18 19 20 21 23],test_index:[ 7 13 16 22 24]
train_index:[ 0  2  4  5  6  7  8 10 11 12 13 14 15 16 17 19 20 21 22 24],test_index:[ 1  3  9 18 23]
train_index:[ 1  2  3  5  7  8  9 10 11 12 13 14 15 16 18 20 21 22 23 24],test_index:[ 0  4  6 17 19]
train_index:[ 0  1  2  3  4  6  7  9 11 12 13 16 17 18 19 20 21 22 23 24],test_index:[ 5  8 10 14 15]
train_index:[ 0  1  3  4  5  6  7  8  9 10 13 14 15 16 17 18 19 22 23 24],test_index:[ 2 11 12 20 21]

再重新运行一次：

kf=KFold(n_splits=5,shuffle=True,random_state=None)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

输出结果为：

train_index:[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 17 18 19 24],test_index:[16 20 21 22 23]
train_index:[ 1  2  3  4  5  6  7  8  9 11 12 14 15 16 17 20 21 22 23 24],test_index:[ 0 10 13 18 19]
train_index:[ 0  2  4  5  6  7  9 10 11 12 13 14 15 16 18 19 20 21 22 23],test_index:[ 1  3  8 17 24]
train_index:[ 0  1  3  7  8  9 10 12 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 2  4  5  6 11]
train_index:[ 0  1  2  3  4  5  6  8 10 11 13 16 17 18 19 20 21 22 23 24],test_index:[ 7  9 12 14 15]

可以看到，两次的数据划分不一样，每次划分前都重新洗牌一次。

而当为random_state指定一个整数时。

例如指定为“1”：

KFold(n_splits=5, shuffle=True, random_state=1)

代表每次数据的划分一样。

举例：

xx=np.arange(25)
kf=KFold(n_splits=5,shuffle=True,random_state=1)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

输出结果为：

train_index:[ 0  1  2  4  5  6  7  8  9 10 11 12 15 16 18 19 20 22 23 24],test_index:[ 3 13 14 17 21]
train_index:[ 0  1  3  5  6  7  8  9 11 12 13 14 15 16 17 20 21 22 23 24],test_index:[ 2  4 10 18 19]
train_index:[ 0  2  3  4  5  8  9 10 11 12 13 14 15 16 17 18 19 21 23 24],test_index:[ 1  6  7 20 22]
train_index:[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 17 18 19 20 21 22],test_index:[ 0 15 16 23 24]
train_index:[ 0  1  2  3  4  6  7 10 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 5  8  9 11 12]

再重新运行一次：

kf=KFold(n_splits=5,shuffle=True,random_state=1)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

输出结果为：

train_index:[ 0  1  2  4  5  6  7  8  9 10 11 12 15 16 18 19 20 22 23 24],test_index:[ 3 13 14 17 21]
train_index:[ 0  1  3  5  6  7  8  9 11 12 13 14 15 16 17 20 21 22 23 24],test_index:[ 2  4 10 18 19]
train_index:[ 0  2  3  4  5  8  9 10 11 12 13 14 15 16 17 18 19 21 23 24],test_index:[ 1  6  7 20 22]
train_index:[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 17 18 19 20 21 22],test_index:[ 0 15 16 23 24]
train_index:[ 0  1  2  3  4  6  7 10 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 5  8  9 11 12]

可见，当两次都将random_state指定为整数“1”时，两次洗牌的结果一样，数据的划分结果一样。

那么问题来了，我让random_state=2行不行？random_state=160行不行？random_state=1与random_state=2有什么区别？看到有个博文设定random_state=42，我就懵×了……：

这个42代表什么含义？

好吧，那我们研究一下：

我们设定random_state=2，运行一下试试：

kf=KFold(n_splits=5,shuffle=True,random_state=2)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

输出结果：

train_index:[ 1  2  3  4  5  7  8  9 10 11 12 13 15 16 18 19 20 21 22 24],test_index:[ 0  6 14 17 23]
train_index:[ 0  1  2  4  5  6  7  8 10 11 13 14 15 17 18 19 20 21 23 24],test_index:[ 3  9 12 16 22]
train_index:[ 0  2  3  6  7  8  9 11 12 13 14 15 16 17 18 20 21 22 23 24],test_index:[ 1  4  5 10 19]
train_index:[ 0  1  3  4  5  6  8  9 10 11 12 13 14 15 16 17 19 22 23 24],test_index:[ 2  7 18 20 21]
train_index:[ 0  1  2  3  4  5  6  7  9 10 12 14 16 17 18 19 20 21 22 23],test_index:[ 8 11 13 15 24]

还保持random_state=2，再运行一下试试：

kf=KFold(n_splits=5,shuffle=True,random_state=2)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

输出结果：

train_index:[ 1  2  3  4  5  7  8  9 10 11 12 13 15 16 18 19 20 21 22 24],test_index:[ 0  6 14 17 23]
train_index:[ 0  1  2  4  5  6  7  8 10 11 13 14 15 17 18 19 20 21 23 24],test_index:[ 3  9 12 16 22]
train_index:[ 0  2  3  6  7  8  9 11 12 13 14 15 16 17 18 20 21 22 23 24],test_index:[ 1  4  5 10 19]
train_index:[ 0  1  3  4  5  6  8  9 10 11 12 13 14 15 16 17 19 22 23 24],test_index:[ 2  7 18 20 21]
train_index:[ 0  1  2  3  4  5  6  7  9 10 12 14 16 17 18 19 20 21 22 23],test_index:[ 8 11 13 15 24]

看到了吗，两次random_state=2对数据的划分结果一致，两次random_state=1对数据的划分结果一致，random_state=1与random_state=2的数据划分结果不一致。

因此，赋予random_state的整数，不代表数值意义，而只是一种编号，1、2、42或者160都只是一种编号，不代表具体的数值，不分谁大谁小，只代表一种洗牌结果。

参考资料：

附录：

附录里分别跑了两次random_state=42与random_state=160的运行结果，献给强迫症读者：

第一次跑random_state=42：

kf=KFold(n_splits=5,shuffle=True,random_state=42)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

运行结果：

train_index:[ 1  2  3  4  5  6  7  9 10 12 13 14 15 17 18 19 20 21 22 24],test_index:[ 0  8 11 16 23]
train_index:[ 0  2  3  4  6  7  8 10 11 12 14 15 16 17 18 19 20 21 23 24],test_index:[ 1  5  9 13 22]
train_index:[ 0  1  5  6  7  8  9 10 11 13 14 16 17 18 19 20 21 22 23 24],test_index:[ 2  3  4 12 15]
train_index:[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 19 22 23],test_index:[17 18 20 21 24]
train_index:[ 0  1  2  3  4  5  8  9 11 12 13 15 16 17 18 20 21 22 23 24],test_index:[ 6  7 10 14 19]

第二次跑random_state=42：

kf=KFold(n_splits=5,shuffle=True,random_state=42)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

运行结果：

train_index:[ 1  2  3  4  5  6  7  9 10 12 13 14 15 17 18 19 20 21 22 24],test_index:[ 0  8 11 16 23]
train_index:[ 0  2  3  4  6  7  8 10 11 12 14 15 16 17 18 19 20 21 23 24],test_index:[ 1  5  9 13 22]
train_index:[ 0  1  5  6  7  8  9 10 11 13 14 16 17 18 19 20 21 22 23 24],test_index:[ 2  3  4 12 15]
train_index:[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 19 22 23],test_index:[17 18 20 21 24]
train_index:[ 0  1  2  3  4  5  8  9 11 12 13 15 16 17 18 20 21 22 23 24],test_index:[ 6  7 10 14 19]

第一次跑random_state=160：

kf=KFold(n_splits=5,shuffle=True,random_state=160)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

运行结果：

train_index:[ 1  2  3  4  5  7  8  9 10 11 12 14 15 16 17 19 20 22 23 24],test_index:[ 0  6 13 18 21]
train_index:[ 0  1  2  3  4  5  6  7  8 11 12 13 15 16 18 19 21 22 23 24],test_index:[ 9 10 14 17 20]
train_index:[ 0  1  2  6  8  9 10 12 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 3  4  5  7 11]
train_index:[ 0  1  3  4  5  6  7  8  9 10 11 13 14 15 17 18 20 21 22 24],test_index:[ 2 12 16 19 23]
train_index:[ 0  2  3  4  5  6  7  9 10 11 12 13 14 16 17 18 19 20 21 23],test_index:[ 1  8 15 22 24]

第二次跑random_state=160：

kf=KFold(n_splits=5,shuffle=True,random_state=160)
for train_index,test_index in kf.split(xx):
    print('train_index:%s,test_index:%s'%(train_index,test_index))

运行结果：

train_index:[ 1  2  3  4  5  7  8  9 10 11 12 14 15 16 17 19 20 22 23 24],test_index:[ 0  6 13 18 21]
train_index:[ 0  1  2  3  4  5  6  7  8 11 12 13 15 16 18 19 21 22 23 24],test_index:[ 9 10 14 17 20]
train_index:[ 0  1  2  6  8  9 10 12 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 3  4  5  7 11]
train_index:[ 0  1  3  4  5  6  7  8  9 10 11 13 14 15 17 18 20 21 22 24],test_index:[ 2 12 16 19 23]
train_index:[ 0  2  3  4  5  6  7  9 10 11 12 13 14 16 17 18 19 20 21 23],test_index:[ 1  8 15 22 24]

看到了吧，random_state为同一数值时，数据划分结果就一样，random_state为不同数值时，数据划分结果就不一样。不管是1、2、42还是160