更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python
Datacamp track: Data Scientist with Python - Course 22 (2)
Exercise
Setting up a train-test split in scikit-learn
Alright, you’ve been patient and awesome. It’s finally time to start training models!
The first step is to split the data into a training set and a test set. Some labels don’t occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count
examples of each label appear in each split: multilabel_train_test_split
.
Feel free to check out the full code for multilabel_train_test_split
here.
You’ll start with a simple model that uses just the numeric columns of your DataFrame when calling multilabel_train_test_split
. The data has been read into a DataFrame df
and a list consisting of just the numeric columns is available as NUMERIC_COLUMNS
.
Instruction
- Create a new DataFrame named
numeric_data_only
by applying the.fillna(-1000)
method to the numeric columns (available in the listNUMERIC_COLUMNS
) ofdf
. - Convert the labels (available in the list
LABELS
) to dummy variables. Save the result aslabel_dummies
. - In the call to
multilabel_train_test_split()
, set thesize
of your test set to be0.2
. Use aseed
of123
. - Fill in the
.info()
method calls forX_train
,X_test
,y_train
, andy_test
.
import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer
#### DEFINE SAMPLING UTILITIES
# First multilabel_sample, which is called by multilabel_train_test_split
def multilabel_sample(y, size=1000, min_count=5, seed=None):
try:
if (np.unique(y).astype(int) != np.array([0, 1])).all():
raise ValueError()
except (TypeError, ValueError):
raise ValueError('multilabel_sample only works with binary indicator matrices')
if (y.sum(axis=0) < min_count).any():
raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')
if size <= 1:
size = np.floor(y.shape[0] * size)
if y.shape[1] * min_count > size:
msg = "Size less than number of columns * min_count, returning {} items instead of {}."
warn(msg.format(y.shape[1] * min_count, size))
size = y.shape[1] * min_count
rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))
if isinstance(y, pd.DataFrame):
choices = y.index
y = y.values
else:
choices = np.arange(y.shape[0])
sample_idxs = np.array([], dtype=choices.dtype)
# first, guarantee > min_count of each label
for j in range(y.shape[1]):
label_choices = choices[y[:, j] == 1]
label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])
sample_idxs = np.unique(sample_idxs)
# now that we have at least min_count of each, we can just random sample
sample_count = size - sample_idxs.shape[0]
# get sample_count indices from remaining choices
remaining_choices = np.setdiff1d(choices, sample_idxs)
remaining_sampled = rng.choice(remaining_choices, size=sample_count