本文为学习笔记,记录了由University of Michigan推出的Coursera专项课程——Applied Data Science with Python中Course Three: Applied Machine Learning in Python全部Assignment代码,已通过测试。
说明:
1. 该课程目前正在更新升级,部分评分系统可能存在问题(依据笔者提交的代码及Discussions中其他Learners的反馈),因此存在部分未通过测试的代码块,而Assignments整体均达到通过要求,仍可提供参考价值;
2. 具体未通过测试的代码块包括:①Assignment 2 Question 6;②Assignment 3 Question 5;
3. 上述两处代码块并没有选择隐去,将在后文中的对应位置进行标注,以提供解题思路;
4. 运行结果不便于展示的,如较长的(DataFrame等)或具体的Classifier(KNN等),将不作展示,参考题目要求中的返回值即可;
5. 具体后续的课程Staff反馈,笔者也将持续关注,尽量及时更新更准确的代码。如有更新,也会在此处加以说明。
目录
Module 1: Fundamentals of Machine Learning - Intro to SciKit Learn
Assignment 1 - Introduction to Machine Learning
Module 2: Supervised Machine Learning - Part 1
Module 4: Supervised Machine Learning - Part 2
Assignment 4 - Predicting and understanding viewer engagement with educational videos
Module 1: Fundamentals of Machine Learning - Intro to SciKit Learn
Assignment 1 - Introduction to Machine Learning
For this assignment, you will be using the Breast Cancer Wisconsin (Diagnostic) Database to create a classifier that can help diagnose patients. First, read through the description of the dataset (below).
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print(cancer.DESCR) # Print the data set description
运行结果:
.. _breast_cancer_dataset: Breast cancer wisconsin (diagnostic) dataset -------------------------------------------- **Data Set Characteristics:** :Number of Instances: 569 :Number of Attributes: 30 numeric, predictive attributes and the class :Attribute Information: - radius (mean of distances from center to points on the perimeter) - texture (standard deviation of gray-scale values) - perimeter - area - smoothness (local variation in radius lengths) - compactness (perimeter^2 / area - 1.0) - concavity (severity of concave portions of the contour) - concave points (number of concave portions of the contour) - symmetry - fractal dimension ("coastline approximation" - 1) The mean, standard error, and "worst" or largest (mean of the three worst/largest values) of these features were computed for each image, resulting in 30 features. For instance, field 0 is Mean Radius, field 10 is Radius SE, field 20 is Worst Radius. - class: - WDBC-Malignant - WDBC-Benign :Summary Statistics: ===================================== ====== ====== Min Max ===================================== ====== ====== radius (mean): 6.981 28.11 texture (mean): 9.71 39.28 perimeter (mean): 43.79 188.5 area (mean): 143.5 2501.0 smoothness (mean): 0.053 0.163 compactness (mean): 0.019 0.345 concavity (mean): 0.0 0.427 concave points (mean): 0.0 0.201 symmetry (mean): 0.106 0.304 fractal dimension (mean): 0.05 0.097 radius (standard error): 0.112 2.873 texture (standard error): 0.36 4.885 perimeter (standard error): 0.757 21.98 area (standard error): 6.802 542.2 smoothness (standard error): 0.002 0.031 compactness (standard error): 0.002 0.135 concavity (standard error): 0.0 0.396 concave points (standard error): 0.0 0.053 symmetry (standard error): 0.008 0.079 fractal dimension (standard error): 0.001 0.03 radius (worst): 7.93 36.04 texture (worst): 12.02 49.54 perimeter (worst): 50.41 251.2 area (worst): 185.2 4254.0 smoothness (worst): 0.071 0.223 compactness (worst): 0.027 1.058 concavity (worst): 0.0 1.252 concave points (worst): 0.0 0.291 symmetry (worst): 0.156 0.664 fractal dimension (worst): 0.055 0.208 ===================================== ====== ====== :Missing Attribute Values: None :Class Distribution: 212 - Malignant, 357 - Benign :Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian :Donor: Nick Street :Date: November, 1995
The object returned by load_breast_cancer()
is a scikit-learn Bunch object, which is similar to a dictionary.
cancer.keys()
运行结果:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
Question 0 (Example)
How many features does the breast cancer dataset have?
This function should return an integer.
# You should write your whole answer within the function provided. The autograder will call
# this function and compare the return value against the correct solution value
def answer_zero():
# This function returns the number of features of the breast cancer dataset, which is an integer.
# The assignment question description will tell you the general format the autograder is expecting
# YOUR CODE HERE
return cancer.data.shape[1]
# raise NotImplementedError()
answer_zero()
# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, check out the discussion forums for any FAQs
运行结果:
30
Question 1
Scikit-learn works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames, so converting the dataset to a DataFrame is not necessary for training this model. Using a DataFrame does however help make many things easier such as munging data, so let's practice creating a classifier with a pandas DataFrame.
Convert the sklearn.dataset cancer
to a DataFrame.
*This function should return a (569, 31)
DataFrame with *
*columns = *
['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension',
'target']
*and index = *
RangeIndex(start=0, stop=569, step=1)
def answer_one():
# YOUR CODE HERE
columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension',
'target']
index = np.arange(0, 569, 1)
data = np.column_stack((cancer.data, cancer.target))
df = pd.DataFrame(data,columns=columns, index=index)
return df
# raise NotImplementedError()
Question 2
What is the class distribution? (i.e. how many instances of malignant
and how many benign
?)
This function should return a Series named target
of length 2 with integer values and index = ['malignant', 'benign']
def answer_two():
# YOUR CODE HERE
df = answer_one()
target = df["target"]
malignant, benign = 0, 0
for i in target:
if i == 0:
malignant += 1
else:
benign += 1
ds = pd.Series([malignant, benign], index=['malignant', 'benign'])
ds.name = 'target'
return ds
# raise NotImplementedError()
运行结果:
malignant 212 benign 357 Name: target, dtype: int64
Question 3
Split the DataFrame into X
(the data) and y
(the labels).
This function should return a tuple of length 2: (X, y)
, where
X
has shape(569, 30)
y
has shape(569,).
def answer_three():
# YOUR CODE HERE
X = answer_one().iloc[:, : -1]
y = answer_one()["target"]
return (X, y)
# raise NotImplementedError()
Question 4
Using train_test_split
, split X
and y
into training and test sets (X_train, X_test, y_train, and y_test)
.
Set the random number generator state to 0 using random_state=0
to make sure your results match the autograder!
This function should return a tuple of length 4: (X_train, X_test, y_train, y_test)
, where
X_train
has shape(426, 30)
X_test
has shape(143, 30)
y_train
has shape(426,)
y_test
has shape(143,)
from sklearn.model_selection import train_test_split
def answer_four():
# YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(answer_three()[0], answer_three()[1],random_state=0)
return (X_train, X_test, y_train, y_test)
# raise NotImplementedError()
Question 5
Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with X_train
, y_train
and using one nearest neighbor (n_neighbors = 1
).
*This function should return a sklearn.neighbors.classification.KNeighborsClassifier
.
from sklearn.neighbors import KNeighborsClassifier
def answer_five():
# YOUR CODE HERE
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(answer_four()[0],answer_four()[2])
return knn
# raise NotImplementedError()
Question 6
Using your knn classifier, predict the class label using the mean value for each feature.
Hint: You can use cancerdf.mean()[:-1].values.reshape(1, -1)
which gets the mean value for each feature, ignores the target column, and reshapes the data from 1 dimension to 2 (necessary for the precict method of KNeighborsClassifier).
def answer_six():
# YOUR CODE HERE
cancerdf = answer_one()
data = cancerdf.mean()[:-1].values.reshape(1, -1)
knn = answer_five()
prediction = knn.predict(data)
return prediction
# raise NotImplementedError()
运行结果:
array([1.])
Question 7
Using your knn classifier, predict the class labels for the test set X_test
.
This function should return a numpy array with shape (143,)
and values either 0.0
or 1.0
.
def answer_seven():
# YOUR CODE HERE
knn = answer_five()
prediction = knn.predict(answer_four()[1])
return prediction
# raise NotImplementedError()
运行结果:
array([1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0.])
Question 8
Find the score (mean accuracy) of your knn classifier using X_test
and y_test
.
This function should return a float between 0 and 1
def answer_eight():
# YOUR CODE HERE
knn = answer_five()
return knn.score(answer_four()[1], answer_four()[3])
# raise NotImplementedError()
运行结果:
0.916083916083916
Module 2: Supervised Machine Learning - Part 1
Assignment 2
In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part 1 of this assignment will look at regression and Part 2 will look at classification.
Part 1 - Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def intro():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
intro()
Question 1
Write a function that fits a polynomial LinearRegression model on the training data X_train
for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)
) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
The figure above shows the fitted models plotted on top of the original data (using plot_one()
).
*This function should return a numpy array with shape `(4, 100)`*
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
degree_predictions = np.zeros((4,100))
# YOUR CODE HERE
degree = [1, 3, 6, 9]
x_reshape = x.reshape(-1, 1)
x_predict = np.linspace(0, 10, 100).reshape(-1, 1)
for i in range(4):
poly = PolynomialFeatures(degree=degree[i])
X_poly = poly.fit_transform(x_reshape)
x_predict_poly = poly.fit_transform(x_predict)
X_train, X_test, y_train, y_split = train_test_split(X_poly, y, random_state=0)
linreg = LinearRegression().fit(X_train, y_train)
degree_predictions[i] = linreg.predict(x_predict_poly)
return degree_predictions
# raise NotImplementedError()
Question 2
Write a function that fits a polynomial LinearRegression model on the training data X_train
for degrees 0 through 9. For each model compute the (coefficient of determination) regression score on the training data as well as the the test data, and return both of these arrays in a tuple.
This function should return a tuple of numpy arrays (r2_train, r2_test)
. Both arrays should have shape (10,)
def answer_two():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
r2_train = np.array([])
r2_test = np.array([])
# YOUR CODE HERE
for n in range(10):
poly = PolynomialFeatures(degree=n)
X_train_poly = poly.fit_transform(X_train.reshape(-1, 1))
X_test_poly = poly.fit_transform(X_test.reshape(-1, 1))
linreg = LinearRegression().fit(X_train_poly, y_train)
r2_train = np.append(r2_train, linreg.score(X_train_poly, y_train))
r2_test = np.append(r2_test, linreg.score(X_test_poly, y_test))
return (r2_train, r2_test)
# raise NotImplementedError()
运行结果:
(array([0. , 0.42924578, 0.4510998 , 0.58719954, 0.91941945, 0.97578641, 0.99018233, 0.99352509, 0.99637545, 0.99803706]), array([-0.47808642, -0.45237104, -0.06856984, 0.00533105, 0.73004943, 0.87708301, 0.9214094 , 0.92021504, 0.63247942, -0.64525256]))
Question 3
Based on the scores from question 2 (degree levels 0 through 9), what degree level corresponds to a model that is underfitting? What degree level corresponds to a model that is overfitting? What choice of degree level would provide a model with good generalization performance on this dataset?
(Hint: Try plotting the scores from question 2 to visualize the relationship)
This function should return a tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization)
def answer_three():
# YOUR CODE HERE
Underfitting, Overfitting, Good_Generalization = 0, 9, 7
return (Underfitting, Overfitting, Good_Generalization)
# raise NotImplementedError()
运行结果:
(0, 9, 7)
Question 4
Training models on high degree polynomial features can result in overfitting. Train two models: a non-regularized LinearRegression model and a Lasso Regression model (with parameters alpha=0.01
, max_iter=10000
, tol=0.1
) on polynomial features of degree 12. Return the score for LinearRegression and Lasso model's test sets.
This function should return a tuple (LinearRegression_R2_test_score, Lasso_R2_test_score)
def answer_four():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics import r2_score
# YOUR CODE HERE
poly = PolynomialFeatures(degree=12)
X_train_poly = poly.fit_transform(X_train.reshape(-1, 1))
X_test_poly = poly.fit_transform(X_test.reshape(-1, 1))
linreg = LinearRegression().fit(X_train_poly, y_train)
lasso = Lasso(alpha=0.01, max_iter=10000, tol=0.1).fit(X_train_poly, y_train)
y_pred1 = linreg.predict(X_test_poly)
y_pred2 = lasso.predict(X_test_poly)
LinearRegression_R2_test_score = r2_score(y_test, y_pred1)
Lasso_R2_test_score = r2_score(y_test, y_pred2)
return (LinearRegression_R2_test_score, Lasso_R2_test_score)
# raise NotImplementedError()
运行结果:
(-4.311967773795515, 0.6051396919570036)
Part 2 - Classification
For this section of the assignment we will be working with the UCI Mushroom Data Set stored in mushrooms.csv
. The data will be used to trian a model to predict whether or not a mushroom is poisonous. The following attributes are provided:
Attribute Information:
- cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
- cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
- cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
- bruises?: bruises=t, no=f
- odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
- gill-attachment: attached=a, descending=d, free=f, notched=n
- gill-spacing: close=c, crowded=w, distant=d
- gill-size: broad=b, narrow=n
- gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
- stalk-shape: enlarging=e, tapering=t
- stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
- stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
- stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
- stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
- stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
- veil-type: partial=p, universal=u
- veil-color: brown=n, orange=o, white=w, yellow=y
- ring-number: none=n, one=o, two=t
- ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
- spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
- population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
- habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d
The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We'll use pd.get_dummies to convert the categorical variables into indicator variables.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
mush_df = pd.read_csv('assets/mushrooms.csv')
mush_df2 = pd.get_dummies(mush_df)
X_mush = mush_df2.iloc[:,2:]
y_mush = mush_df2.iloc[:,1]
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush, y_mush, random_state=0)
Question 5
Using X_train
and y_train
from the preceeding cell, train a DecisionTreeClassifier with default parameters and random_state=0. What are the 5 most important features found by the decision tree?
This function should return a list of length 5 of the feature names in descending order of importance.
def answer_five():
from sklearn.tree import DecisionTreeClassifier
# YOUR CODE HERE
feature_importances = {}
clf = DecisionTreeClassifier(random_state=0).fit(X_train2, y_train2)
importances = clf.feature_importances_
for i in range(len(importances)):
feature_importances[X_mush.columns[i]] = importances[i]
feature_importances_sorted = sorted(feature_importances.items(), key=lambda k: k[1], reverse=True)
feature_importances_top5 = feature_importances_sorted[:5]
features = [feature_importance[0] for feature_importance in feature_importances_top5]
return features
# raise NotImplementedError()
运行结果:
['odor_n', 'stalk-root_c', 'stalk-root_r', 'spore-print-color_r', 'odor_l']
Question 6(未通过测试)
For this question, use the validation_curve
function in sklearn.model_selection
to determine training and test scores for a Support Vector Classifier (SVC
) with varying parameter values.
Create an SVC
with default parameters (i.e. kernel='rbf', C=1
) and random_state=0
. Recall that the kernel width of the RBF kernel is controlled using the gamma
parameter. Explore the effect of gamma
on classifier accuracy by using the validation_curve
function to find the training and test scores for 6 values of gamma
from 0.0001
to 10
(i.e. np.logspace(-4,1,6)
).
For each level of gamma
, validation_curve
will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.
Find the mean score across the three models for each level of gamma
for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.
e.g.
if one of your array of scores is
array([[ 0.5, 0.4, 0.6],
[ 0.7, 0.8, 0.7],
[ 0.9, 0.8, 0.8],
[ 0.8, 0.7, 0.8],
[ 0.7, 0.6, 0.6],
[ 0.4, 0.6, 0.5]])
it should then become
array([ 0.5, 0.73333333, 0.83333333, 0.76666667, 0.63333333, 0.5])
This function should return a tuple of numpy arrays (training_scores, test_scores)
where each array in the tuple has shape (6,)
.
def answer_six():
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
# YOUR CODE HERE
svc = SVC(kernel='rbf', C=1, random_state=0)
param_range = np.logspace(-4, 1, 6)
training_scores, test_scores = validation_curve(svc, X_mush, y_mush, param_name='gamma', param_range=param_range,cv=3)
training_scores, test_scores = np.mean(training_scores, axis=1), np.mean(test_scores, axis=1)
return (training_scores, test_scores)
# raise NotImplementedError()
运行结果(未通过测试):
(array([0.89838749, 0.98104382, 0.99895372, 1. , 1. , 1. ]), array([0.88749385, 0.82951748, 0.84170359, 0.86582964, 0.83616445, 0.51797144]))
Question 7
Based on the scores from question 6, what gamma value corresponds to a model that is underfitting? What gamma value corresponds to a model that is overfitting? What choice of gamma would provide a model with good generalization performance on this dataset?
(Hint: Try plotting the scores from question 6 to visualize the relationship)
This function should return a tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization)
def answer_seven():
# YOUR CODE HERE
gamma = np.logspace(-4, 1, 6)
Underfitting, Overfitting, Good_Generalization = gamma[0], gamma[5], gamma[3]
return (Underfitting, Overfitting, Good_Generalization)
# raise NotImplementedError()
运行结果:
(0.0001, 10.0, 0.1)
Module 3: Evaluation
Assignment 3
import numpy as np
import pandas as pd
Question 1
Import the data from assets/fraud_data.csv
. What percentage of the observations in the dataset are instances of fraud?
This function should return a float between 0 and 1.
def answer_one():
# YOUR CODE HERE
df = pd.read_csv('assets/fraud_data.csv')
df_fraud = df[df['Class'] == 1]
return len(df_fraud)/len(df)
# raise NotImplementedError()
运行结果:
0.016410823768035772
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split
df = pd.read_csv('assets/fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Question 2
Using X_train
, X_test
, y_train
, and y_test
(as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?
This function should a return a tuple with two floats, i.e. (accuracy score, recall score)
.
def answer_two():
from sklearn.dummy import DummyClassifier
from sklearn.metrics import recall_score
# YOUR CODE HERE
dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
y_predict = dummy_majority.predict(X_test)
accuracy_score = dummy_majority.score(X_test, y_test)
recall_score = recall_score(y_test, y_predict)
return (accuracy_score, recall_score)
# raise NotImplementedError()
运行结果:
(0.9852507374631269, 0.0)
Question 3
Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?
This function should a return a tuple with three floats, i.e. (accuracy score, recall score, precision score)
.
def answer_three():
from sklearn.metrics import recall_score, precision_score
from sklearn.svm import SVC
# YOUR CODE HERE
svm = SVC().fit(X_train, y_train)
y_predict = svm.predict(X_test)
accuracy_score = svm.score(X_test, y_test)
recall_score = recall_score(y_test, y_predict)
precision_score = precision_score(y_test, y_predict)
return (accuracy_score, recall_score, precision_score)
# raise NotImplementedError()
运行结果:
(0.9900442477876106, 0.35, 0.9333333333333333)
Question 4
Using the SVC classifier with parameters {'C': 1e9, 'gamma': 1e-07}
, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.
This function should return a confusion matrix, a 2x2 numpy array with 4 integers.
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
# YOUR CODE HERE
svm = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
y_scores = svm.decision_function(X_test) > -220
confusion = confusion_matrix(y_test, y_scores)
return confusion
# raise NotImplementedError()
运行结果:
array([[5320, 24], [ 14, 66]])
Question 5(未通过测试)
Train a logisitic regression classifier with default parameters using X_train and y_train.
For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).
Looking at the precision recall curve, what is the recall when the precision is 0.75
?
Looking at the roc curve, what is the true positive rate when the false positive rate is 0.16
?
This function should return a tuple with two floats, i.e. (recall, true positive rate)
.
# YOUR CODE HERE
def answer_five():
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_curve
logreg = LogisticRegression().fit(X_train, y_train)
y_proba_lr = logreg.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_proba_lr)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_proba_lr)
for pre, rec in zip(precision, recall):
if pre == 0.75:
recall_score = rec
min = 1
for fpr in fpr_lr:
if np.abs(fpr - 0.16) < min:
min = np.abs(fpr - 0.16)
fpr_close = fpr
for fpr, tpr in zip(fpr_lr, tpr_lr):
if fpr == fpr_close:
tp_rate = tpr
return (recall_score, tp_rate)
# raise NotImplementedError()
运行结果(未通过测试):
(0.825, 0.925)
Question 6
Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation. (Suggest to use solver='liblinear'
, more explanation here)
'penalty': ['l1', 'l2']
'C':[0.01, 0.1, 1, 10]
From .cv_results_
, create an array of the mean test scores of each parameter combination. i.e.
l1 | l2 | |
---|---|---|
0.01 | ? | ? |
0.1 | ? | ? |
1 | ? | ? |
10 | ? | ? |
This function should return a 4 by 2 numpy array with 8 floats.
Note: do not return a DataFrame, just the values denoted by ?
in a numpy array.
def answer_six():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
# YOUR CODE HERE
logreg = LogisticRegression(solver='liblinear')
grid_values = {'penalty': ['l1', 'l2'], 'C':[0.01, 0.1, 1, 10]}
grid_lr = GridSearchCV(logreg, param_grid=grid_values, scoring='recall', cv=3)
grid_lr.fit(X_train, y_train)
result = grid_lr.cv_results_
mean_test_score = result['mean_test_score']
mean_test_score_reshape = mean_test_score.reshape(4,2)
return mean_test_score_reshape
# raise NotImplementedError()
运行结果:
array([[0.66666667, 0.76086957], [0.80072464, 0.80434783], [0.8115942 , 0.8115942 ], [0.80797101, 0.8115942 ]])
Module 4: Supervised Machine Learning - Part 2
Assignment 4 - Predicting and understanding viewer engagement with educational videos
With the accelerating popularity of online educational experiences, the role of online lectures and other educational video continues to increase in scope and importance. Open access educational repositories such as videolectures.net, as well as Massive Open Online Courses (MOOCs) on platforms like Coursera, have made access to many thousands of lectures and tutorials an accessible option for millions of people around the world. Yet this impressive volume of content has also led to a challenge in how to find, filter, and match these videos with learners. This assignment gives you an example of how machine learning can be used to address part of that challenge.
About the prediction problem
One critical property of a video is engagement: how interesting or "engaging" it is for viewers, so that they decide to keep watching. Engagement is critical for learning, whether the instruction is coming from a video or any other source. There are many ways to define engagement with video, but one common approach is to estimate it by measuring how much of the video a user watches. If the video is not interesting and does not engage a viewer, they will typically abandon it quickly, e.g. only watch 5 or 10% of the total.
A first step towards providing the best-matching educational content is to understand which features of educational material make it engaging for learners in general. This is where predictive modeling can be applied, via supervised machine learning. For this assignment, your task is to predict how engaging an educational video is likely to be for viewers, based on a set of features extracted from the video's transcript, audio track, hosting site, and other sources.
We chose this prediction problem for several reasons:
- It combines a variety of features derived from a rich set of resources connected to the original data;
- The manageable dataset size means the dataset and supervised models for it can be easily explored on a wide variety of computing platforms;
- Predicting popularity or engagement for a media item, especially combined with understanding which features contribute to its success with viewers, is a fun problem but also a practical representative application of machine learning in a number of business and educational sectors.
About the dataset
We extracted training and test datasets of educational video features from the VLE Dataset put together by researcher Sahan Bulathwela at University College London.
We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single educational video, and includes information about diverse properties of the video content as described further below. The target variable is engagement
which was defined as True if the median percentage of the video watched across all viewers was at least 30%, and False otherwise.
Note: Any extra variables that may be included in the training set are simply for your interest if you want an additional source of data for visualization, or to enable unsupervised and semi-supervised approaches. However, they are not included in the test set and thus cannot be used for prediction. Only the data already included in your Coursera directory can be used for training the model for this assignment.
For this final assignment, you will bring together what you've learned across all four weeks of this course, by exploring different prediction models for this new dataset. In addition, we encourage you to apply what you've learned about model selection to do hyperparameter tuning using training/validation splits of the training data, to optimize the model and further increase its performance. In addition to a basic evaluation of model accuracy, we've also provided a utility function to visualize which features are most and least contributing to the overall model performance.
File descriptions assets/train.csv - the training set (Use only this data for training your model!) assets/test.csv - the test set
Data fields
train.csv & test.csv:
title_word_count - the number of words in the title of the video.
document_entropy - a score indicating how varied the topics are covered in the video, based on the transcript. Videos with smaller entropy scores will tend to be more cohesive and more focused on a single topic.
freshness - The number of days elapsed between 01/01/1970 and the lecture published date. Videos that are more recent will have higher freshness values.
easiness - A text difficulty measure applied to the transcript. A lower score indicates more complex language used by the presenter.
fraction_stopword_presence - A stopword is a very common word like 'the' or 'and'. This feature computes the fraction of all words that are stopwords in the video lecture transcript.
speaker_speed - The average speaking rate in words per minute of the presenter in the video.
silent_period_rate - The fraction of time in the lecture video that is silence (no speaking).
train.csv only:
engagement - Target label for training. True if learners watched a substantial portion of the video (see description), or False otherwise.
Evaluation
Your predictions will be given as the probability that the corresponding video will be engaging to learners.
The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).
Your grade will be based on the AUC score computed for your classifier. A model with an AUC (area under ROC curve) of at least 0.8 passes this assignment, and over 0.85 will receive full points.
For this assignment, create a function that trains a model to predict significant learner engagement with a video using asset/train.csv
. Using this model, return a Pandas Series object of length 2309 with the data being the probability that each corresponding video from readonly/test.csv
will be engaging (according to a model learned from the 'engagement' label in the training set), and the video index being in the id
field.
Example:
id
9240 0.401958
9241 0.105928
9242 0.018572
...
9243 0.208567
9244 0.818759
9245 0.018528
...
Name: engagement, dtype: float32
Hints
-
Make sure your code is working before submitting it to the autograder.
-
Print out and check your result to see whether there is anything weird (e.g., all probabilities are the same).
-
Generally the total runtime should be less than 10 mins.
-
Try to avoid global variables. If you have other functions besides engagement_model, you should move those functions inside the scope of engagement_model.
-
Be sure to first check the pinned threads in Week 4's discussion forum if you run into a problem you can't figure out.
Extensions
- If this prediction task motivates you to explore further, you can find more details here on the original VLE dataset and others related to video engagement: https://github.com/sahanbull/VLE-Dataset
import warnings
warnings.filterwarnings("ignore")
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # Do not change this value: required to be compatible with solutions generated by the autograder.
def engagement_model():
rec = None
# YOUR CODE HERE
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_set = pd.read_csv('assets/train.csv')
train_set = train_set.set_index('id')
train_set = train_set.drop('normalization_rate', axis=1)
X_train = train_set.iloc[: , : -1]
y_train = train_set.iloc[: , -1]
test_set = pd.read_csv('assets/test.csv')
test_set = test_set.set_index('id')
X_test = test_set.drop('normalization_rate', axis=1)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
clf = MLPClassifier(hidden_layer_sizes=[20, 20], alpha=5,solver='lbfgs',
random_state=0, activation='logistic').fit(X_train_scaled, y_train)
predict_proba = clf.predict_proba(X_test_scaled)[:, 1]
test_set['engagement'] = predict_proba
rec = test_set['engagement']
return rec
运行结果:
id 9240 0.014100 9241 0.060007 9242 0.086591 9243 0.933297 9244 0.014579 ... 11544 0.022987 11545 0.009579 11546 0.014023 11547 0.888931 11548 0.018817 Name: engagement, Length: 2309, dtype: float64
(注:结果符合AUC大于0.85的要求,即满分要求。)