# Datacamp 笔记&代码 Machine Learning with the Experts: School Budgets 第三章 Improving your model

Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 22 (3)

Exercise

# Instantiate pipeline

In order to make your life easier as you start to work with all of the data in your original DataFrame, df, it’s time to turn to one of scikit-learn’s most useful objects: the Pipeline.

For the next few exercises, you’ll reacquaint yourself with pipelines and train a classifier on some synthetic (sample) data of multiple datatypes before using the same techniques on the main dataset.

The sample data is stored in the DataFrame, sample_df, which has three kinds of feature data: numeric, text, and numeric with missing values. It also has a label column with two classes, a and b.

In this exercise, your job is to instantiate a pipeline that trains using the numeric column of the sample data.

Instruction

• Import Pipeline from sklearn.pipeline.
• Create training and test sets using the numeric data only. Do this by specifying sample_df[['numeric']] in train_test_split().
• Instantiate a pipeline as pl by adding the classifier step. Use a name of 'clf' and the same classifier from Chapter 2: OneVsRestClassifier(LogisticRegression()).
• Fit your pipeline to the training data and compute its accuracy to see it in action! Since this is toy data, you’ll use the default scoring method for now. In the next chapter, you’ll return to log loss scoring.
import numpy as np
import pandas as pd

rng = np.random.RandomState(123)

SIZE = 1000

sample_data = {
'numeric': rng.normal(0, 10, size=SIZE),
'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),
'with_missing': rng.normal(loc=3, size=SIZE)
}

sample_df = pd.DataFrame(sample_data)

sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nan

foo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1

val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)

sample_df['label'] = np.where(val > np.median(val), 'a', 'b')


     numeric     text  with_missing label
0 -10.856306               4.433240     b
1   9.973454      foo      4.310229     b
2   2.829785  foo bar      2.469828     a
3 -15.062947               2.852981     b
4  -5.786003  foo bar      1.826475     a

# Import Pipeline
from sklearn.pipeline import Pipeline

# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Split and select numeric data only, no nans
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],
pd.get_dummies(sample_df['label']),
random_state=22)

# Instantiate Pipeline object: pl
pl = Pipeline([
('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - numeric, no nans: ", accuracy)

Accuracy on sample data - numeric, no nans:  0.62


Exercise

# Preprocessing numeric features

What would have happened if you had included the with 'with_missing' column in the last exercise? Without imputing missing values, the pipeline would not be happy (try it and see). So, in this exercise you’ll improve your pipeline a bit by using the Imputer() imputation transformer from scikit-learn to fill in missing values in your sample data.

By default, the imputer transformer replaces NaNs with the mean value of the column. That’s a good enough imputation strategy for the sample data, so you won’t need to pass anything extra to the imputer.

After importing the transformer, you will edit the steps list used in the previous exercise by inserting a (name, transform) tuple. Recall that steps are processed sequentially, so make sure the new tuple encoding your preprocessing step is put in the right place.

The sample_df is in the workspace, in case you’d like to take another look. Make sure to select both numeric columns- in the previous exercise we couldn’t use with_missing because we had no preprocessing step!

Instruction

• Import Imputer from sklearn.preprocessing.
• Create training and test sets by selecting the correct subset of sample_df: 'numeric' and 'with_missing'.
• Add the tuple ('imp', Imputer()) to the correct position in the pipeline. Pipeline processes steps sequentially, so the imputation step should come before the classifier step.
• Complete the .fit() and .score() methods to fit the pipeline to the data and compute the accuracy.
# Import the Imputer object
from sklearn.preprocessing import Imputer

# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing']],
pd.get_dummies(sample_df['label']),
random_state=456)

# Insantiate Pipeline object: pl
pl = Pipeline([
('imp', Imputer()),
('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all numeric, incl nans: ", accuracy)

Accuracy on sample data - all numeric, incl nans:  0.636


Exercise

# Preprocessing text features

Here, you’ll perform a similar preprocessing pipeline step, only this time you’ll use the text column from the sample data.

To preprocess the text, you’ll turn to CountVectorizer()to generate a bag-of-words representation of the data, as in Chapter 2. Using the default arguments, add a (step, transform) tuple to the steps list in your pipeline.

Make sure you select only the text column for splitting your training and test sets.

As usual, your sample_df is ready and waiting in the workspace.

Instruction

• Import CountVectorizer from sklearn.feature_extraction.text.
• Create training and test sets by selecting the correct subset of sample_df: 'text'.
• Add the CountVectorizer step (with the name 'vec') to the correct position in the pipeline.
• Fit the pipeline to the training data and compute its accuracy.
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Split out only the text data
X_train, X_test, y_train, y_test = train_test_split(sample_df['text'],
pd.get_dummies(sample_df['label']),
random_state=456)

# Instantiate Pipeline object: pl
pl = Pipeline([
('vec', CountVectorizer()),
('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - just text data: ", accuracy)


Accuracy on sample data - just text data:  0.808


Exercise

# Multiple types of processing: FeatureUnion

Now that you can separate text and numeric data in your pipeline, you’re ready to perform separate steps on each by nesting pipelines and using FeatureUnion().

These tools will allow you to streamline all preprocessing steps for your model, even when multiple datatypes are involved. Here, for example, you don’t want to impute our text data, and you don’t want to create a bag-of-words with our numeric data. Instead, you want to deal with these separately and then join the results together using FeatureUnion().

In the end, you’ll still have only two high-level steps in your pipeline: preprocessing and model instantiation. The difference is that the first preprocessing step actually consists of a pipeline for numeric data and a pipeline for text data. The results of those pipelines are joined using FeatureUnion().

Instruction

• In the process_and_join_features:
• Add the steps ('selector', get_numeric_data)and ('imputer', Imputer()) to the 'numeric_features' preprocessing step.
• Add the equivalent steps for the text_featurespreprocessing step. That is, use get_text_data and a CountVectorizer step with the name 'vectorizer'.
• Add the transform step process_and_join_features to 'union' in the main pipeline, pl.
• Hit ‘Submit Answer’ to see the pipeline in action!
import numpy as np
import pandas as pd

rng = np.random.RandomState(123)

SIZE = 1000

sample_data = {
'numeric': rng.normal(0, 10, size=SIZE),
'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),
'with_missing': rng.normal(loc=3, size=SIZE)
}

sample_df = pd.DataFrame(sample_data)

sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nan

foo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1

val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)

sample_df['label'] = np.where(val > np.median(val), 'a', 'b')

## Import the pipeline elements from previous exercise
# Import splitting and pipeline objects from sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Import elements for simple pipeline from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import the Imputer object
from sklearn.preprocessing import Imputer

# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Import functional utilities
from sklearn.preprocessing import FunctionTransformer

# Simple selector transforms to be used in FeatureUnion
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['numeric', 'with_missing']], validate=False)


# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Split using ALL data in sample_df
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing', 'text']],
pd.get_dummies(sample_df['label']),
random_state=22)

## Create a FeatureUnion with nested pipeline: process_and_join_features
process_and_join_features = FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)

# Instantiate nested pipeline: pl
pl = Pipeline([
('union', process_and_join_features),
('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear')))
])

# Fit pl to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all data: ", accuracy)

Accuracy on sample data - all data:  0.928


Exercise

# Using FunctionTransformer on the main dataset

In this exercise you’re going to use FunctionTransformeron the primary budget data, before instantiating a multiple-datatype pipeline in the next exercise.

Recall from Chapter 2 that you used a custom function combine_text_columns to select and properly format text data for tokenization; it is loaded into the workspace and ready to be put to work in a function transformer!

Concerning the numeric data, you can use NUMERIC_COLUMNS, preloaded as usual, to help design a subset-selecting lambda function.

You’re all finished with sample data. The original df is back in the workspace, ready to use.

Instruction

• Complete the call to multilabel_train_test_split()by selecting df[NON_LABELS].
• Compute get_text_data by using FunctionTransformer() and passing in combine_text_columns. Be sure to also specify validate=False.
• Use FunctionTransformer() to compute get_numeric_data. In the lambda function, select out the NUMERIC_COLUMNS of x. Like you did when computing get_text_data, also specify validate=False.
# Import pandas, numpy, warn, CountVectorizer
import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer

#### DEFINE SAMPLING UTILITIES ####

# First multilabel_sample, which is called by multilabel_train_test_split
def multilabel_sample(y, size=1000, min_count=5, seed=None):
""" Takes a matrix of binary labels y and returns
the indices for a sample of size size if
size > 1 or size * len(y) if size =< 1.

The sample is guaranteed to have > min_count of
each label.
"""
try:
if (np.unique(y).astype(int) != np.array([0, 1])).all():
raise ValueError()
except (TypeError, ValueError):
raise ValueError('multilabel_sample only works with binary indicator matrices')

if (y.sum(axis=0) < min_count).any():
raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')

if size <= 1:
size = np.floor(y.shape[0] * size)

if y.shape[1] * min_count > size:
msg = "Size less than number of columns * min_count, returning {} items instead of {}."
warn(msg.format(y.shape[1] * min_count, size))
size = y.shape[1] * min_count

rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))

if isinstance(y, pd.DataFrame):
choices = y.index
y = y.values
else:
choices = np.arange(y.shape[0])

sample_idxs = np.array([], dtype=choices.dtype)

# first, guarantee > min_count of each label
for j in range(y.shape[1]):
label_choices = choices[y[:, j] == 1]
label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])

sample_idxs = np.unique(sample_idxs)

# now that we have at least min_count of each, we can just random sample
sample_count = size - sample_idxs.shape[0]

# get sample_count indices from remaining choices
remaining_choices = np.setdiff1d(choices, sample_idxs)
remaining_sampled = rng.choice(remaining_choices, size=sample_count, replace=False)

return np.concatenate([sample_idxs, remaining_sampled])

# Now define multilabel_train_test_split to be used below
def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
""" Takes a features matrix X and a label matrix Y and
returns (X_train, X_test, Y_train, Y_test) where all
classes in Y are represented at least min_count times.
"""
index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
train_set_idxs = np.setdiff1d(index, test_set_idxs)

####

# Labels
LABELS = ['Function',
'Use',
'Sharing',
'Reporting',
'Student_Type',
'Position_Type',
'Object_Type',
'Pre_K',
'Operating_Status']

NUMERIC_COLUMNS = ['FTE', "Total"]

# Convert object to category for LABELS
df[LABELS] = df[LABELS].apply(lambda x: x.astype('category'))

# Define combine_text_columns() for use in sklearn.preprocessing.FunctionTransformer
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
""" Takes the dataset as read in, drops the non-feature, non-text columns and
then combines all of the text columns into a single vector that has all of
the text for a row.

:param data_frame: The data as read in with read_csv (no preprocessing necessary)
:param to_drop (optional): Removes the numeric and label columns by default.
"""
# drop non-text columns that are in the df
to_drop = set(to_drop) & set(data_frame.columns.tolist())
text_data = data_frame.drop(to_drop, axis=1)

# replace nans with blanks
text_data.fillna("", inplace=True)

# joins all of the text items in a row (axis=1)
# with a space in between
return text_data.apply(lambda x: " ".join(x), axis=1)


# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
dummy_labels,
0.2,
seed=123)

# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

# Preprocess the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)


Exercise

# Add a model to the pipeline

You’re about to take everything you’ve learned so far and implement it in a Pipeline that works with the real, DrivenData budget line item data you’ve been exploring.

Surprise! The structure of the pipeline is exactly the same as earlier in this chapter:

• the preprocessing step uses FeatureUnion to join the results of nested pipelines that each rely on FunctionTransformer to select multiple datatypes
• the model step stores the model object

You can then call familiar methods like .fit() and .score() on the Pipeline object pl.

Instruction

• Complete the 'numeric_features'transform with the following steps:
• get_numeric_data, with the name 'selector'.
• Imputer(), with the name 'imputer'.
• Complete the 'text_features'transform with the following steps:
• get_text_data, with the name 'selector'.
• CountVectorizer(), with the name 'vectorizer'.
• Fit the pipeline to the training data.
• Hit ‘Submit Answer’ to compute the accuracy!
# Complete the pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)),
('clf', OneVsRestClassifier(LogisticRegression()))
])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

Accuracy on budget dataset:  0.20384615384615384


Exercise

# Try a different class of model

Now you’re cruising. One of the great strengths of pipelines is how easy they make the process of testing different models.

Until now, you’ve been using the model step ('clf', OneVsRestClassifier(LogisticRegression()))in your pipeline.

But what if you want to try a different model? Do you need to build an entirely new pipeline? New nests? New FeatureUnions? Nope! You just have a simple one-line change, as you’ll see in this exercise.

In particular, you’ll swap out the logistic-regression model and replace it with a random forest classifier, which uses the statistics of an ensemble of decision trees to generate predictions.

Instruction

• Import the RandomForestClassifier from sklearn.ensemble.
• Add a RandomForestClassifier() step named 'clf'to the pipeline.
• Hit ‘Submit Answer’ to fit the pipeline to the training data and compute its accuracy.
# Import random forest classifer
from sklearn.ensemble import RandomForestClassifier

# Edit model step in pipeline
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)),
('clf', RandomForestClassifier())
])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

Accuracy on budget dataset:  0.28076923076923077


Exercise

# Can you adjust the model or parameters to improve accuracy?

You just saw a substantial improvement in accuracy by swapping out the model. Pipelines are amazing!

Can you make it better? Try changing the parameter n_estimators of RandomForestClassifier(), whose default value is 10, to 15.

Instruction

• Import the RandomForestClassifier from sklearn.ensemble.
• Add a RandomForestClassifier() step with n_estimators=15 to the pipeline with a name of 'clf'.
• Hit ‘Submit Answer’ to fit the pipeline to the training data and compute its accuracy.
# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Add model step to pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)),
('clf', RandomForestClassifier(n_estimators=15))
])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

Accuracy on budget dataset:  0.3230769230769231


04-02 256
10-30 2787
03-04 3633
12-09 6046
08-01 24