更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python
Datacamp track: Data Scientist with Python - Course 22 (4)
Exercise
Deciding what’s a word
Before you build up to the winning pipeline, it will be useful to look a little deeper into how the text features will be processed.
In this exercise, you will use CountVectorizer
on the training data X_train
(preloaded into the workspace) to see the effect of tokenization on punctuation.
Remember, since CountVectorizer
expects a vector, you’ll need to use the preloaded function, combine_text_columns
before fitting to the training data.
Instruction
- Create
text_vector
by preprocessingX_train
usingcombine_text_columns
. This is important, or else you won’t get any tokens! - Instantiate
CountVectorizer
astext_features
. Specify the keyword argumenttoken_pattern=TOKENS_ALPHANUMERIC
. - Fit
text_features
to thetext_vector
. - Hit ‘Submit Answer’ to print the first 10 tokens.
# Import pandas, numpy, warn, CountVectorizer
import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer
#### DEFINE SAMPLING UTILITIES
# First multilabel_sample, which is called by multilabel_train_test_split
def multilabel_sample(y, size=1000, min_count=5, seed=None):
try:
if (np.unique(y).astype(int) != np.array([0, 1])).all():
raise ValueError()
except (TypeError, ValueError):
raise ValueError('multilabel_sample only works with binary indicator matrices')
if (y.sum(axis=0) < min_count).any():
raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')
if size <= 1:
size = np.floor(y.shape[0] * size)
if y.shape[1] * min_count > size:
msg = "Size less than number of columns * min_count, returning {} items instead of {}."
warn(msg.format(y.shape[1] * min_count, size))
size = y.shape[1] * min_count
rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))
if isinstance(y, pd.DataFrame):
choices = y.index
y = y.values
else:
choices = np.arange(y.shape[0])
sample_idxs = np.array([], dtype=choices.dtype)
# first, guarantee > min_count of each label
for j in range(y.shape[1]):
label_choices = choices[y[:, j] == 1]
label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])
sample_idxs = np.unique(sample_idxs)
# now that we have at least min_count of each, we can just random sample
sample_count = size - sample_idxs.shape[0]
# get sample_count indices from remaining choices
remaining_choices = np.setdiff1d(choices, sample_idxs)
remaining_sampled = rng.choice(remaining_choices, size=sample_count, replace=False)
return np.concatenate([sample_idxs, remaining_sampled])
# Now define multilabel_train_test_split to be used below
def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])
test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
train_set_idxs = np.setdiff1d(index, test_set_idxs)
test_set_mask = index.isin(test_set_idxs)
train_set_mask = ~test_set_mask
return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])
#### ####
# Load data
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2533/datasets/TrainingSetSample.csv', index_col=0)
# Labels
LABELS = ['Function',
'Use',
'Sharing',
'Reporting',
'Student_Type',
'Position_Type',
'Object_Type',
'Pre_K',
'Operating_Status']
NUMERIC_COLUMNS = ['FTE', "Total"]
# get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]
# Convert object to category for LABELS
df[LABELS] = df[LABELS].apply(lambda x: x.astype('category'))
# Define combine_text_columns() for use in sklearn.preprocessing.FunctionTransformer
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
""" Takes the dataset as read in, drops the non-feature, non-text columns and
then combines all of the text columns into a single vector that has all of
the text for a row.
:param data_frame: The data as read in with read_csv (no preprocessing necessary)
:param to_drop (optional): Removes the numeric and label columns by default.
"""
# drop non-text columns that are in the df
to_drop = set(to_drop) & set(data_frame.columns.tolist())
text_data = data_frame.drop(to_drop, axis=1)
# replace nans with blanks
text_data.fillna("", inplace=True)
# joins all of the text items in a row (axis=1)
# with a space in between
return text_data.apply(lambda x: " ".join(x), axis=1)
# TRAIN TEST SPLIT
# get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])
# split into train, test
# X_train, X_test, y_train, y_test = multilabel_train_test_split(
# df[NON_LABELS],
# dummy_labels,
# 0.2,
# min_count=3,
# seed=43)
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
dummy_labels,
size=0.2,
seed=123)
# Load path to pred
PATH_TO_PREDICTIONS = "predictions.csv"
# Load path to holdout
#PATH_TO_HOLDOUT_DATA = "https://s3.amazonaws.com/assets.datacamp.com/production/course_2826/datasets/TestSetSample.csv"
PATH_TO_HOLDOUT_LABELS = "https://s3.amazonaws.com/assets.datacamp.com/production/course_2826/datasets/TestSetLabelsSample.csv"
fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_2826/datasets/TestSetSample.csv'
from urllib.request import urlretrieve
urlretrieve(fn, 'HoldoutData.csv')
# SCORING UTILITIES
BOX_PLOTS_COLUMN_INDICES = [range(37),
range(37,48),
range(48,51),
range(51,76),
range(76,79),
range(79,82),
range(82,87),
range(87,96),
range(96,104)]
def _multi_multi_log_loss(predicted,
actual,
class_column_indices=BOX_PLOTS_COLUMN_INDICES,
eps=1e-15):
""" Multi class version of Logarithmic Loss metric as implemented on
DrivenData.org
"""
class_scores = np.ones(len(class_column_indices), dtype=np.float64)
# calculate log loss for each set of columns that belong to a class:
for k, this_class_indices in enumerate(class_column_indices):
# get just the columns for this class
preds_k = predicted[:, this_class_indices].astype(np.float64)
# normalize so probabilities sum to one (unless sum is zero, then we clip)
preds_k /= np.clip(preds_k.sum(axis=1).reshape(-1, 1), eps, np.inf)
actual_k = actual[:, this_class_indices]
# shrink predictions so
y_hats = np.clip(preds_k, eps, 1 - eps)
sum_logs = np.sum(actual_k * np.log(y_hats))
class_scores[k] = (-1.0 / actual.shape[0]) * sum_logs
return np.average(class_scores)
def score_submission(pred_path=PATH_TO_PREDICTIONS, holdout_path=PATH_TO_HOLDOUT_LABELS):
# this happens on the backend to get the score
holdout_labels = pd.get_dummies(
pd.read_csv(holdout_path, index_col=0)
.apply(lambda x: x.astype('category'), axis=0)
)
preds = pd.read_csv(pred_path, index_col=0)
# make sure that format is correct
assert (preds.columns == holdout_labels.columns).all()
assert (preds.index == holdout_labels.index).all()
return _multi_multi_log_loss(preds.values, holdout_labels.values)
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Create the text vector
text_vector = combine_text_columns(X_train)
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
# Instantiate the CountVectorizer: text_features
text_features = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)
# Fit text_features to the text vector
text_features.fit(text_vector)
# Print the first 10 tokens
print(text_features.get_feature_names()[:10])
['00a', '12', '1st', '2nd', '4th', '5th', '70h', '8', 'a', 'aaps']
Exercise
N-gram range in scikit-learn
In this exercise you’ll insert a CountVectorizer
instance into your pipeline for the main dataset, and compute multiple n-gram features to be used in the model.
In order to look for ngram relationships at multiple scales, you will use the ngram_range
parameter as Peter discussed in the video.
Special functions: You’ll notice a couple of new steps provided in the pipeline in this and many of the remaining exercises. Specifically, the dim_red
step following the vectorizer
step , and the scale
step preceeding the clf
(classification) step.
These have been added in order to account for the fact that you’re using a reduced-size sample of the full dataset in this course. To make sure the models perform as the expert competition winner intended, we have to apply a dimensionality reduction technique, which is what the dim_red
step does, and we have to scale the features to lie between -1 and 1, which is what the scale
step does.
The dim_red
step uses a scikit-learn function called SelectKBest()
, applying something called the chi-squared test to select the K “best” features. The scale
step uses a scikit-learn function called MaxAbsScaler()
in order to squash the relevant features into the interval -1 to 1.
You won’t need to do anything extra with these functions here, just complete the vectorizing pipeline steps below. However, notice how easy it was to add more processing steps to our pipeline!
Instruction
- Import
CountVectorizer
fromsklearn.feature_extraction.text
. - Add a
CountVectorizer
step to the pipeline with the name'vectorizer'
.- Set the token pattern to be
TOKENS_ALPHANUMERIC
. - Set the
ngram_range
to be(1, 2)
.
- Set the token pattern to be
# Import pipeline
from sklearn.pipeline import Pipeline
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Import other preprocessing modules
from sklearn.preprocessing import Imputer
from sklearn.feature_selection import chi2, SelectKBest
# Select 300 best features
chi_k = 300
# Import functional utilities
from sklearn.preprocessing import FunctionTransformer, MaxAbsScaler
from sklearn.pipeline import FeatureUnion
# Perform preprocessing
get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
# Instantiate pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
ngram_range=(1, 2))),
('dim_red', SelectKBest(chi2, chi_k))
]))
]
)),
('scale', MaxAbsScaler()),
('clf', OneVsRestClassifier(LogisticRegression()))
])
Exercise
Implement interaction modeling in scikit-learn
It’s time to add interaction features to your model. The PolynomialFeatures
object in scikit-learn does just that, but here you’re going to use a custom interaction object, SparseInteractions
. Interaction terms are a statistical tool that lets your model express what happens if two features appear together in the same row.
SparseInteractions
does the same thing as PolynomialFeatures
, but it uses sparse matrices to do so. You can get the code for SparseInteractions
at this GitHub Gist.
PolynomialFeatures
and SparseInteractions
both take the argument degree
, which tells them what polynomial degree of interactions to compute.
You’re going to consider interaction terms of degree=2
in your pipeline. You will insert these steps after the preprocessing steps you’ve built out so far, but before the classifier steps.
Pipelines with interaction terms take a while to train (since you’re making n features into n-squared features!), so as long as you set it up right, we’ll do the heavy lifting and tell you what your score is!
Instruction
- Add the interaction terms step using
SparseInteractions()
withdegree=2
. Give it a name of'int'
, and make sure it is after the preprocessing step but before scaling.
# Import pandas, numpy, warn, CountVectorizer
import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer
#### DEFINE SPARSE INTERACTIONS CLASS FOR PIPELINE ####
from sklearn.base import BaseEstimator, TransformerMixin
from scipy import sparse
from itertools import combinations
class SparseInteractions(BaseEstimator, TransformerMixin):
def __init__(self, degree=2):
self.degree = degree
def fit(self, X, y=None):
return self
def transform(self, X):
if not sparse.isspmatrix_csc(X):
X = sparse.csc_matrix(X)
if hasattr(X, "columns"):
self.orig_col_names = X.columns
else:
self.orig_col_names = np.array([str(i) for i in range(X.shape[1])])
spi = self._create_sparse_interactions(X)
return spi
def get_feature_names(self):
return self.feature_names
def _create_sparse_interactions(self, X):
out_mat = []
self.feature_names = self.orig_col_names.tolist()
for sub_degree in range(2, self.degree + 1):
for col_ixs in combinations(range(X.shape[1]), sub_degree):
self.feature_names.append("_".join(self.orig_col_names[list(col_ixs)]))
out = X[:, col_ixs[0]]
for j in col_ixs[1:]:
out = out.multiply(X[:, j])
out_mat.append(out)
return sparse.hstack([X] + out_mat)
#### END SPARSE INTERACTIONS CLASS FOR PIPELINE ####
# Instantiate pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
ngram_range=(1, 2))),
('dim_red', SelectKBest(chi2, chi_k))
]))
]
)),
('int', SparseInteractions(degree=2)),
('scale', MaxAbsScaler()),
('clf', OneVsRestClassifier(LogisticRegression()))
])
Exercise
Implementing the hashing trick in scikit-learn
In this exercise you will check out the scikit-learn implementation of HashingVectorizer
before adding it to your pipeline later.
As you saw in the video, HashingVectorizer
acts just like CountVectorizer
in that it can accept token_pattern
and ngram_range
parameters. The important difference is that it creates hash values from the text, so that we get all the computational advantages of hashing!
Instruction
- Import
HashingVectorizer
fromsklearn.feature_extraction.text
. - Instantiate the
HashingVectorizer
ashashing_vec
using theTOKENS_ALPHANUMERIC
pattern. - Fit and transform
hashing_vec
usingtext_data
. Save the result ashashed_text
. - Hit ‘Submit Answer’ to see some of the resulting hash values.
# Import HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
# Get text data: text_data
text_data = combine_text_columns(X_train)
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
# Instantiate the HashingVectorizer: hashing_vec
hashing_vec = HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC)
# Fit and transform the Hashing Vectorizer
hashed_text = hashing_vec.fit_transform(text_data)
# Create DataFrame and print the head
hashed_df = pd.DataFrame(hashed_text.data)
print(hashed_df.head())
0
0 -0.160128
1 0.160128
2 -0.480384
3 -0.320256
4 0.160128
Exercise
Build the winning model
You have arrived! This is where all of your hard work pays off. It’s time to build the model that won DrivenData’s competition.
You’ve constructed a robust, powerful pipeline capable of processing training and testing data. Now that you understand the data and know all of the tools you need, you can essentially solve the whole problem in a relatively small number of lines of code. Wow!
All you need to do is add the HashingVectorizer
step to the pipeline to replace the CountVectorizer
step.
The parameters non_negative=True
, norm=None
, andbinary=False
make the HashingVectorizer
perform similarly to the default settings on the CountVectorizer
so you can just replace one with the other.
Instruction
- Import
HashingVectorizer
fromsklearn.feature_extraction.text
. - Add a
HashingVectorizer
step to the pipeline.- Name the step
'vectorizer'
. - Use the
TOKENS_ALPHANUMERIC
token pattern. - Specify the
ngram_range
to be(1, 2)
- Name the step
# Import the hashing vectorizer
from sklearn.feature_extraction.text import HashingVectorizer
# Instantiate the winning model pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
non_negative=True, norm=None, binary=False,
ngram_range=(1,2))),
('dim_red', SelectKBest(chi2, chi_k))
]))
]
)),
('int', SparseInteractions(degree=2)),
('scale', MaxAbsScaler()),
('clf', OneVsRestClassifier(LogisticRegression()))
])