python调用随机分层抽样方法_每组python 1：1分层抽样

最新推荐文章于 2023-03-30 22:32:49 发布

weixin_39986741

最新推荐文章于 2023-03-30 22:32:49 发布

阅读量177

点赞数

文章标签： python调用随机分层抽样方法

How can a 1:1 stratified sampling be performed in python?

Assume the Pandas Dataframe df to be heavily imbalanced. It contains a binary group and multiple columns of categorical sub groups.

df = pd.DataFrame({'id':[1,2,3,4,5], 'group':[0,1,0,1,0], 'sub_category_1':[1,2,2,1,1], 'sub_category_2':[1,2,2,1,1], 'value':[1,2,3,1,2]})

display(df)

display(df[df.group == 1])

display(df[df.group == 0])

df.group.value_counts()

For each member of the main group==1 I need to find a single match of group==0 with.

A StratifiedShuffleSplit from scikit-learn will only return a random portion of data, not a 1:1 match.

解决方案

If I understood correctly you could use np.random.permutation:

import numpy as np

import pandas as pd

np.random.seed(42)

df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'group': [0, 1, 0, 1, 0], 'sub_category_1': [1, 2, 2, 1, 1],

'sub_category_2': [1, 2, 2, 1, 1], 'value': [1, 2, 3, 1, 2]})

# create new column with an identifier for a combination of categories

columns = ['sub_category_1', 'sub_category_2']

labels = df.loc[:, columns].apply(lambda x: ''.join(map(str, x.values)), axis=1)

values, keys = pd.factorize(labels)

df['label'] = labels.map(dict(zip(keys, values)))

# build distribution of sub-categories combinations

distribution = df[df.group == 1].label.value_counts().to_dict()

# select from group 0 only those rows that are in the same sub-categories combinations

mask = (df.group == 0) & (df.label.isin(distribution))

# do random sampling

selected = np.ravel([np.random.permutation(group.index)[:distribution[name]] for name, group in df.loc[mask].groupby(['label'])])

# display result

result = df.drop('label', axis=1).iloc[selected]

print(result)

Output

group id sub_category_1 sub_category_2 value

4 0 5 1 1 2

2 0 3 2 2 3

Note that this solution assumes the size of the each possible sub_category combination of group 1 is less than the size of the corresponding sub-group in group 0. A more robust version involves using np.random.choice with replacement:

selected = np.ravel([np.random.choice(group.index, distribution[name], replace=True) for name, group in df.loc[mask].groupby(['label'])])

The version with choice does not have the same assumption as the one with permutation, although it requires at least one element for each sub-category combination.