我会选择使用
Pandas DataFrame和
numpy.random.choice来执行此操作.通过这种方式,可以轻松地进行随机抽样以生成大小相同的数据集.一个例子:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(7, 4))
data['Healthy'] = [1, 1, 0, 0, 1, 1, 1]
该数据有两个非健康和五个健康样本.要从健康人群中随机挑选两个样本,您可以:
healthy_indices = data[data.Healthy == 1].index
random_indices = np.random.choice(healthy_indices, 2, replace=False)
healthy_sample = data.loc[random_indices]
要自动选择与非健康组相同大小的子样本,您可以执行以下操作:
sample_size = sum(data.Healthy == 0) # Equivalent to len(data[data.Healthy == 0])
random_indices = np.random.choice(healthy_indices, sample_size, replace=False)