python三维数据欠采样_如何使用python scikit-learn执行欠采样(正确的方法)?

I am attempting to perform undersampling of the majority class using python scikit learn. Currently my codes look for the N of the minority class and then try to undersample the exact same N from the majority class. And both the test and training data have this 1:1 distribution as a result. But what I really want is to do this 1:1 distribution on the training data ONLY but test it on the original distribution in the testing data.

I am not quite sure how to do the latter as there is some dict vectorization in between, which makes it confusing to me.

# Perform undersampling majority group

minorityN = len(df[df.ethnicity_scan == 1]) # get the total count of low-frequency group

minority_indices = df[df.ethnicity_scan == 1].index

minority_sample = df.loc[minority_indices]

majority_indices = df[df.ethnicity_scan == 0].index

random_indices = np.random.choice(majority_indices, minorityN, replace=False) # use the low-frequency group count to randomly sample from high-frequency group

majority_sample = data.loc[random_indices]

merged_sample = pd.concat([minority_sample, majority_sample], ignore_index=True) # merging all the low-frequency group sample and the new (randomly selected) high-frequency sample together

df = merged_sample

print 'Total N after undersampling:', len(df)

# Declaring variables

X = df.raw_f1.values

X2 = df.f2.values

X3 = df.f3.values

X4 = df.f4.values

y = df.outcome.values

# Codes skipped ....

def feature_noNeighborLoc(locString):

pass

my_dict16 = [{'location': feature_noNeighborLoc(feature_full_name(i))} for i in X4]

# Codes skipped ....

# Dict vectorization

all_dict = []

for i in range(0, len(my_dict)):

temp_dict = dict(

my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()

+ my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()

+ my_dict9[i].items() + my_dict10[i].items()

+ my_dict11[i].items() + my_dict12[i].items() + my_dict13[i].items() + my_dict14[i].items()

+ my_dict19[i].items()

+ my_dict16[i].items() # location feature

)

all_dict.append(temp_dict)

newX = dv.fit_transform(all_dict)

X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

# Fitting X and y into model, using training data

classifierUsed2.fit(X_train, y_train)

# Making predictions using trained data

y_train_predictions = classifierUsed2.predict(X_train)

y_test_predictions = classifierUsed2.predict(X_test)

解决方案

You want to subsample the training samples of one of your categories because you want a classifier that treats all the labels the same.

If you want to do that instead of subsampling you can change the value of the 'class_weight' parameter of your classifier to 'balanced' (or 'auto' for some classifiers) which does the job that you want to do.

You can read the documentation of LogisticRegression classifier as an example. Notice the description of the 'class_weight' parameter here.

By changing that parameter to 'balanced' you won't need to do the subsampling anymore.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值