python三维数据欠采样_如何使用python scikit-learn执行欠采样（正确的方法）？

最新推荐文章于 2024-04-29 20:49:59 发布

weixin_39902107

最新推荐文章于 2024-04-29 20:49:59 发布

阅读量135

点赞数

文章标签： python三维数据欠采样

I am attempting to perform undersampling of the majority class using python scikit learn. Currently my codes look for the N of the minority class and then try to undersample the exact same N from the majority class. And both the test and training data have this 1:1 distribution as a result. But what I really want is to do this 1:1 distribution on the training data ONLY but test it on the original distribution in the testing data.

I am not quite sure how to do the latter as there is some dict vectorization in between, which makes it confusing to me.

# Perform undersampling majority group

minorityN = len(df[df.ethnicity_scan == 1]) # get the total count of low-frequency group

minority_indices = df[df.ethnicity_scan == 1].index

minority_sample = df.loc[minority_indices]

majority_indices = df[df.ethnicity_scan == 0].index

random_indices = np.random.choice(majority_indices, minorityN, replace=False) # use the low-frequency group count to randomly sample from high-frequency group

majority_sample = data.loc[random_indices]

merged_sample = pd.concat([minority_sample, majority_sample], ignore_index=True) # merging all the low-frequency group sample and the new (randomly selected) high-frequency sample together

df = merged_sample

print 'Total N after undersampling:', len(df)

# Declaring variables

X = df.raw_f1.values

X2 = df.f2.values

X3 = df.f3.values

X4 = df.f4.values

y = df.outcome.values

# Codes skipped ....

def feature_noNeighborLoc(locString):

pass

my_dict16 = [{'location': feature_noNeighborLoc(feature_full_name(i))} for i in X4]

# Codes skipped ....

# Dict vectorization

all_dict = []

for i in range(0, len(my_dict)):

temp_dict = dict(

my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()

+ my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()

+ my_dict9[i].items() + my_dict10[i].items()

+ my_dict11[i].items() + my_dict12[i].items() + my_dict13[i].items() + my_dict14[i].items()

+ my_dict19[i].items()

+ my_dict16[i].items() # location feature

)

all_dict.append(temp_dict)

newX = dv.fit_transform(all_dict)

X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

# Fitting X and y into model, using training data

classifierUsed2.fit(X_train, y_train)

# Making predictions using trained data

y_train_predictions = classifierUsed2.predict(X_train)

y_test_predictions = classifierUsed2.predict(X_test)

解决方案

You want to subsample the training samples of one of your categories because you want a classifier that treats all the labels the same.

If you want to do that instead of subsampling you can change the value of the 'class_weight' parameter of your classifier to 'balanced' (or 'auto' for some classifiers) which does the job that you want to do.

You can read the documentation of LogisticRegression classifier as an example. Notice the description of the 'class_weight' parameter here.

By changing that parameter to 'balanced' you won't need to do the subsampling anymore.