python实验指导书pandas答案_python - 如何使用pandas从一个数据框创建测试和训练样本？...-CSDN博客

python - 如何使用pandas从一个数据框创建测试和训练样本？

我有一个相当大的数据集形式的数据集，我想知道如何将数据帧分成两个随机样本(80％和20％)进行训练和测试。

谢谢！

17个解决方案

391 votes

scikit learn's train_test_split是一个很好的。

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

gobrewers14 answered 2019-03-14T22:09:42Z

219 votes

我只想使用numpy的randn：

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

只是为了看到这个有效：

In [15]: len(test)

Out[15]: 21

In [16]: len(train)

Out[16]: 79

Andy Hayden answered 2019-03-14T22:09:18Z

179 votes

熊猫随机样本也会起作用

train=df.sample(frac=0.8,random_state=200)

test=df.drop(train.index)

PagMax answered 2019-03-14T22:10:06Z

24 votes

我会使用scikit-learn自己的training_test_split，并从索引中生成它

from sklearn.cross_validation import train_test_split

y = df.pop('output')

X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)

X.iloc[X_train] # return dataframe train

Napitupulu Jon answered 2019-03-14T22:10:31Z

11 votes

您可以使用以下代码创建测试和训练样本：

from sklearn.model_selection import train_test_split

trainingSet, testSet = train_test_split(df, test_size=0.2)

测试大小可能会根据您要在测试和训练数据集中放入的数据百分比而有所不同。

user1775015 answered 2019-03-14T22:11:02Z

6 votes

有许多有效的答案。再添一个。来自sklearn.cross_validation import train_test_split

#gets a random 80% of the entire set

X_train = X.sample(frac=0.8, random_state=1)

#gets the left out portion of the dataset

X_test = X.loc[~df_model.index.isin(X_train.index)]

Abhi answered 2019-03-14T22:11:26Z

5 votes

您也可以考虑将分层划分为训练和测试集。 Startized division还会随机生成训练和测试集，但这样可以保留原始的比例。这使得训练和测试集更好地反映了原始数据集的属性。

import numpy as np

def get_train_test_inds(y,train_proportion=0.7):

'''Generates indices, making random stratified split into training set and testing sets

with proportions train_proportion and (1-train_proportion) of initial sample.

y is any iterable indicating classes of each observation in the sample.

Initial proportions of classes inside training and

testing sets are preserved (stratified sampling).

'''

y=np.array(y)

train_inds = np.zeros(len(y),dtype=bool)

test_inds = np.zeros(len(y),dtype=bool)

values = np.unique(y)

for value in values:

value_inds = np.nonzero(y==value)[0]

np.random.shuffle(value_inds)

n = int(train_proportion*len(value_inds))

train_inds[value_inds[:n]]=True

test_inds[value_inds[n:]]=True

return train_inds,test_inds

df [train_inds]和df [test_inds]为您提供原始DataFrame df的训练和测试集。

Apogentus answered 2019-03-14T22:11:58Z

2 votes

这是我在需要拆分DataFrame时编写的内容。我考虑过使用Andy的上述方法，但不喜欢我无法准确控制数据集的大小(即，它有时是79，有时是81，等等)。

def make_sets(data_df, test_portion):

import random as rnd

tot_ix = range(len(data_df))

test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))

train_ix = list(set(tot_ix) ^ set(test_ix))

test_df = data_df.ix[test_ix]

train_df = data_df.ix[train_ix]

return train_df, test_df

train_df, test_df = make_sets(data_df, 0.2)

test_df.head()

Anarcho-Chossid answered 2019-03-14T22:12:23Z

2 votes

import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)

Pardhu Gopalam answered 2019-03-14T22:12:41Z

1 votes

您可以使用df.as_matrix()函数并创建Numpy-array并传递它。

Y = df.pop()

X = df.as_matrix()

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

model.fit(x_train, y_train)

model.test(x_test)

kiran6 answered 2019-03-14T22:13:06Z

1 votes

只需从df中选择范围行，就像这样

row_count = df.shape[0]

split_point = int(row_count*1/5)

test_data, train_data = df[:split_point], df[split_point:]

Makio answered 2019-03-14T22:13:31Z

0 votes

如果你希望有一个数据帧和两个数据帧(不是numpy数组)，这应该可以解决问题：

def split_data(df, train_perc = 0.8):

df['train'] = np.random.rand(len(df)) < train_perc

train = df[df.train == 1]

test = df[df.train == 0]

split_data ={'train': train, 'test': test}

return split_data

Johnny V answered 2019-03-14T22:13:55Z

0 votes

如果您想稍后添加列，我认为您还需要获取副本而不是一片数据帧。

msk = np.random.rand(len(df)) < 0.8

train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

Hakim answered 2019-03-14T22:14:20Z

0 votes

这个怎么样？df是我的数据帧

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset

train=df.head(train_size)

#test dataset

test=df.tail(len(df) -train_size)

Akash Jain answered 2019-03-14T22:14:43Z

0 votes

如果您需要根据数据集中的标签列拆分数据，可以使用以下命令：

def split_to_train_test(df, label_column, train_frac=0.8):

train_df, test_df = pd.DataFrame(), pd.DataFrame()

labels = df[label_column].unique()

for lbl in labels:

lbl_df = df[df[label_column] == lbl]

lbl_train_df = lbl_df.sample(frac=train_frac)

lbl_test_df = lbl_df.drop(lbl_train_df.index)

print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))

train_df = train_df.append(lbl_train_df)

test_df = test_df.append(lbl_test_df)

return train_df, test_df

并使用它：

train, test = split_to_train_test(data, 'class', 0.7)

如果要控制拆分随机性或使用某个全局随机种子，也可以传递random_state。

MikeL answered 2019-03-14T22:15:15Z

0 votes

要分成两个以上的类，如训练，测试和验证，可以做到：

probs = np.random.rand(len(df))

training_mask = probs < 0.7

test_mask = (probs>=0.7) & (probs < 0.85)

validatoin_mask = probs >= 0.85

df_training = df[training_mask]

df_test = df[test_mask]

df_validation = df[validatoin_mask]

这将使70％的数据用于培训，15％用于测试，15％用于验证。

AHonarmand answered 2019-03-14T22:15:46Z

0 votes

我的品味更优雅的是创建一个随机列，然后通过它进行拆分，这样我们就可以得到一个适合我们需求的分割，并且是随机的。

def split_df(df, p=[0.8, 0.2]):

import numpy as np

df["rand"]=np.random.choice(len(p), len(df), p=p)

r = [df[df["rand"]==val] for val in df["rand"].unique()]

return r

thebeancounter answered 2019-03-14T22:16:12Z