python对文本数据进行采样_Python 对不均衡数据进行Over sample（重抽样）

最新推荐文章于 2023-11-15 17:47:18 发布

weixin_39940957

最新推荐文章于 2023-11-15 17:47:18 发布

阅读量328

点赞数

文章标签： python对文本数据进行采样

需要重采样的数据文件(Libsvm format)，如heart_scale

+1 1:0.708333 2:1 3:1 4:-0.320755 5:-0.105023 6:-1 7:1 8:-0.419847 9:-1 10:-0.225806 12:1 13:-1

-1 1:0.583333 2:-1 3:0.333333 4:-0.603774 5:1 6:-1 7:1 8:0.358779 9:-1 10:-0.483871 12:-1 13:1

....

重采样后的数据保存文件(Libsvm format)，这里heart_scale_balance.txt

Python code：

from sklearn.datasets import load_svmlight_file

from sklearn.datasets import dump_svmlight_file

import numpy as np

from sklearn.utils import check_random_state

from scipy.sparse import hstack,vstack

def fit_sample(X, y):

"""Resample the dataset.

"""

label = np.unique(y)

stats_c_ = {}

maj_n = 0

for i in label:

nk = sum(y==i)

stats_c_[i] = nk

if nk > maj_n:

maj_n = nk

maj_c_ = i

# Keep the samples from the majority class

X_resampled = X[y == maj_c_]

y_resampled = y[y == maj_c_]

# Loop over the other classes over picking at random

for key in stats_c_.keys():

# If this is the majority class, skip it

if key == maj_c_:

continue

# Define the number of sample to create

num_samples = int(stats_c_[maj_c_] -stats_c_[key])

# Pick some elements at random

random_state = check_random_state(42)

indx = random_state.randint(low=0, high=stats_c_[key],size=num_samples)

# Concatenate to the majority class

X_resampled = vstack([X_resampled,X[y == key],X[y == key][indx]])

print np.shape(y_resampled),np.shape(y[y == key]),np.shape(y[y == key][indx])

y_resampled = list(y_resampled)+list(y[y == key])+list(y[y == key][indx])

return X_resampled, y_resampled

X_train, y_train = load_svmlight_file("heart_scale")

# Apply the random over-sampling

X_train, y_train = fit_sample(X_train,y_train)

dump_svmlight_file(X_train, y_train,'heart_scale_balance.txt',zero_based=False)

weixin_39940957

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。