Coursera | Applied Data Science with Python 专项课程 | Applied Machine Learning in Python

NJ_Xavier

已于 2023-08-16 20:27:54 修改

阅读量3.7k

点赞数 27

分类专栏： Python学习笔记文章标签： python scikit-learn 分类回归

于 2023-01-11 23:48:00 首次发布

本文链接：https://blog.csdn.net/NJ_Xavier/article/details/128647809

版权

Python学习笔记专栏收录该内容

8 篇文章

订阅专栏

本文为学习笔记，记录了由University of Michigan推出的Coursera专项课程——Applied Data Science with Python中Course Three: Applied Machine Learning in Python全部Assignment代码，已通过测试。

说明：

1. 该课程目前正在更新升级，部分评分系统可能存在问题（依据笔者提交的代码及Discussions中其他Learners的反馈），因此存在部分未通过测试的代码块，而Assignments整体均达到通过要求，仍可提供参考价值；

2. 具体未通过测试的代码块包括：①Assignment 2 Question 6；②Assignment 3 Question 5；

3. 上述两处代码块并没有选择隐去，将在后文中的对应位置进行标注，以提供解题思路；

4. 运行结果不便于展示的，如较长的（DataFrame等）或具体的Classifier（KNN等），将不作展示，参考题目要求中的返回值即可；

5. 具体后续的课程Staff反馈，笔者也将持续关注，尽量及时更新更准确的代码。如有更新，也会在此处加以说明。

Module 1: Fundamentals of Machine Learning - Intro to SciKit Learn

Assignment 1 - Introduction to Machine Learning

Module 2: Supervised Machine Learning - Part 1

Module 4: Supervised Machine Learning - Part 2

Assignment 4 - Predicting and understanding viewer engagement with educational videos

About the prediction problem

Module 1: Fundamentals of Machine Learning - Intro to SciKit Learn

Assignment 1 - Introduction to Machine Learning

For this assignment, you will be using the Breast Cancer Wisconsin (Diagnostic) Database to create a classifier that can help diagnose patients. First, read through the description of the dataset (below).

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print(cancer.DESCR) # Print the data set description

运行结果：

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radius, field
        10 is Radius SE, field 20 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

The object returned by load_breast_cancer() is a scikit-learn Bunch object, which is similar to a dictionary.

cancer.keys()

运行结果：

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

Question 0 (Example)

How many features does the breast cancer dataset have?

This function should return an integer.

# You should write your whole answer within the function provided. The autograder will call
# this function and compare the return value against the correct solution value
def answer_zero():
    # This function returns the number of features of the breast cancer dataset, which is an integer. 
    # The assignment question description will tell you the general format the autograder is expecting
    
    # YOUR CODE HERE
    return cancer.data.shape[1]
    # raise NotImplementedError()
answer_zero()
# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, check out the discussion forums for any FAQs

运行结果：

Question 1

Scikit-learn works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames, so converting the dataset to a DataFrame is not necessary for training this model. Using a DataFrame does however help make many things easier such as munging data, so let's practice creating a classifier with a pandas DataFrame.

Convert the sklearn.dataset cancer to a DataFrame.

*This function should return a (569, 31) DataFrame with *

*columns = *

['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension',
'target']

*and index = *

RangeIndex(start=0, stop=569, step=1)

def answer_one():
    # YOUR CODE HERE
    columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
               'mean smoothness', 'mean compactness', 'mean concavity',
               'mean concave points', 'mean symmetry', 'mean fractal dimension',
               'radius error', 'texture error', 'perimeter error', 'area error',
               'smoothness error', 'compactness error', 'concavity error',
               'concave points error', 'symmetry error', 'fractal dimension error',
               'worst radius', 'worst texture', 'worst perimeter', 'worst area',
               'worst smoothness', 'worst compactness', 'worst concavity',
               'worst concave points', 'worst symmetry', 'worst fractal dimension',
               'target']
    index = np.arange(0, 569, 1)
    data = np.column_stack((cancer.data, cancer.target))
    df = pd.DataFrame(data,columns=columns, index=index)
    return df
    # raise NotImplementedError()

Question 2

What is the class distribution? (i.e. how many instances of malignant and how many benign?)

This function should return a Series named target of length 2 with integer values and index = ['malignant', 'benign']

def answer_two():
    # YOUR CODE HERE
    df = answer_one()
    target = df["target"]
    malignant, benign = 0, 0
    for i in target:
      if i == 0:
        malignant += 1
      else:
        benign += 1
    ds = pd.Series([malignant, benign], index=['malignant', 'benign'])
    ds.name = 'target'
    return ds
    # raise NotImplementedError()

运行结果：

malignant    212
benign       357
Name: target, dtype: int64

Question 3

Split the DataFrame into X (the data) and y (the labels).

This function should return a tuple of length 2: (X, y), where

X has shape (569, 30)
y has shape (569,).

def answer_three():
    # YOUR CODE HERE
    X = answer_one().iloc[:, : -1]
    y = answer_one()["target"]
    return (X, y)
    # raise NotImplementedError()

Question 4

Using train_test_split, split X and y into training and test sets (X_train, X_test, y_train, and y_test).

Set the random number generator state to 0 using random_state=0 to make sure your results match the autograder!

This function should return a tuple of length 4: (X_train, X_test, y_train, y_test), where

X_train has shape (426, 30)
X_test has shape (143, 30)
y_train has shape (426,)
y_test has shape (143,)

from sklearn.model_selection import train_test_split

def answer_four():
    # YOUR CODE HERE
    X_train, X_test, y_train, y_test = train_test_split(answer_three()[0], answer_three()[1],random_state=0)
    return (X_train, X_test, y_train, y_test)
    # raise NotImplementedError()

Question 5

Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with X_train, y_train and using one nearest neighbor (n_neighbors = 1).

*This function should return a sklearn.neighbors.classification.KNeighborsClassifier.

from sklearn.neighbors import KNeighborsClassifier

def answer_five():
    # YOUR CODE HERE
    knn = KNeighborsClassifier(n_neighbors=1)
    knn.fit(answer_four()[0],answer_four()[2])
    return knn
    # raise NotImplementedError()

Question 6

Using your knn classifier, predict the class label using the mean value for each feature.

Hint: You can use cancerdf.mean()[:-1].values.reshape(1, -1) which gets the mean value for each feature, ignores the target column, and reshapes the data from 1 dimension to 2 (necessary for the precict method of KNeighborsClassifier).

def answer_six():
    # YOUR CODE HERE
    cancerdf = answer_one()
    data = cancerdf.mean()[:-1].values.reshape(1, -1)
    knn = answer_five()
    prediction = knn.predict(data)
    return prediction
    # raise NotImplementedError()

运行结果：

array([1.])

Question 7

Using your knn classifier, predict the class labels for the test set X_test.

This function should return a numpy array with shape (143,) and values either 0.0 or 1.0.

def answer_seven():
    # YOUR CODE HERE
    knn = answer_five()
    prediction = knn.predict(answer_four()[1])
    return prediction
    # raise NotImplementedError()

运行结果：

array([1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1.,
       0., 0., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 0., 1., 0.,
       1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 1., 1., 0., 0.,
       1., 0., 1., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0.,
       0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 0.,
       1., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 0., 0., 1., 1., 1., 0.])

Question 8

Find the score (mean accuracy) of your knn classifier using X_test and y_test.

This function should return a float between 0 and 1

def answer_eight():
    # YOUR CODE HERE
    knn = answer_five()
    return knn.score(answer_four()[1], answer_four()[3])
    # raise NotImplementedError()

运行结果：

0.916083916083916

Module 2: Supervised Machine Learning - Part 1

Assignment 2

In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part 1 of this assignment will look at regression and Part 2 will look at classification.

Part 1 - Regression

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10


X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)

def intro():
    %matplotlib notebook

    plt.figure()
    plt.scatter(X_train, y_train, label='training data')
    plt.scatter(X_test, y_test, label='test data')
    plt.legend(loc=4);

intro()

Question 1

Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.

The figure above shows the fitted models plotted on top of the original data (using plot_one()).

*This function should return a numpy array with shape `(4, 100)`*

def answer_one():
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    
    degree_predictions = np.zeros((4,100))
    
    # YOUR CODE HERE
    degree = [1, 3, 6, 9]
    x_reshape = x.reshape(-1, 1)
    x_predict = np.linspace(0, 10, 100).reshape(-1, 1)
    for i in range(4):
        poly = PolynomialFeatures(degree=degree[i])
        X_poly = poly.fit_transform(x_reshape)
        x_predict_poly = poly.fit_transform(x_predict)
        X_train, X_test, y_train, y_split = train_test_split(X_poly, y, random_state=0)
        linreg = LinearRegression().fit(X_train, y_train)
        degree_predictions[i] = linreg.predict(x_predict_poly)
    return degree_predictions
    # raise NotImplementedError()

Question 2

Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 0 through 9. For each model compute the $R^2$ (coefficient of determination) regression score on the training data as well as the the test data, and return both of these arrays in a tuple.

This function should return a tuple of numpy arrays (r2_train, r2_test). Both arrays should have shape (10,)

def answer_two():
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.metrics import r2_score

    r2_train = np.array([])
    r2_test = np.array([])
    
    # YOUR CODE HERE
    for n in range(10):
        poly = PolynomialFeatures(degree=n)
        X_train_poly = poly.fit_transform(X_train.reshape(-1, 1))
        X_test_poly = poly.fit_transform(X_test.reshape(-1, 1))
        linreg = LinearRegression().fit(X_train_poly, y_train)
        r2_train = np.append(r2_train, linreg.score(X_train_poly, y_train))
        r2_test = np.append(r2_test, linreg.score(X_test_poly, y_test))
    return (r2_train, r2_test)
    # raise NotImplementedError()

运行结果：

(array([0.        , 0.42924578, 0.4510998 , 0.58719954, 0.91941945,
        0.97578641, 0.99018233, 0.99352509, 0.99637545, 0.99803706]),
 array([-0.47808642, -0.45237104, -0.06856984,  0.00533105,  0.73004943,
         0.87708301,  0.9214094 ,  0.92021504,  0.63247942, -0.64525256]))

Question 3

Based on the $R^2$ scores from question 2 (degree levels 0 through 9), what degree level corresponds to a model that is underfitting? What degree level corresponds to a model that is overfitting? What choice of degree level would provide a model with good generalization performance on this dataset?

(Hint: Try plotting the $R^2$ scores from question 2 to visualize the relationship)

This function should return a tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization)

def answer_three():
    # YOUR CODE HERE
    Underfitting, Overfitting, Good_Generalization = 0, 9, 7
    return (Underfitting, Overfitting, Good_Generalization)
    # raise NotImplementedError()

运行结果：

(0, 9, 7)

Question 4

Training models on high degree polynomial features can result in overfitting. Train two models: a non-regularized LinearRegression model and a Lasso Regression model (with parameters alpha=0.01, max_iter=10000, tol=0.1) on polynomial features of degree 12. Return the $R^2$ score for LinearRegression and Lasso model's test sets.

This function should return a tuple (LinearRegression_R2_test_score, Lasso_R2_test_score)

def answer_four():
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import Lasso, LinearRegression
    from sklearn.metrics import r2_score
    
    # YOUR CODE HERE
    poly = PolynomialFeatures(degree=12)
    X_train_poly = poly.fit_transform(X_train.reshape(-1, 1)) 
    X_test_poly = poly.fit_transform(X_test.reshape(-1, 1)) 
    linreg = LinearRegression().fit(X_train_poly, y_train)
    lasso = Lasso(alpha=0.01, max_iter=10000, tol=0.1).fit(X_train_poly, y_train)
    y_pred1 = linreg.predict(X_test_poly)
    y_pred2 = lasso.predict(X_test_poly)
    LinearRegression_R2_test_score = r2_score(y_test, y_pred1)
    Lasso_R2_test_score = r2_score(y_test, y_pred2)
    return (LinearRegression_R2_test_score, Lasso_R2_test_score)
    # raise NotImplementedError()

运行结果：

(-4.311967773795515, 0.6051396919570036)

Part 2 - Classification

For this section of the assignment we will be working with the UCI Mushroom Data Set stored in mushrooms.csv. The data will be used to trian a model to predict whether or not a mushroom is poisonous. The following attributes are provided:

Attribute Information:

cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
bruises?: bruises=t, no=f
odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
gill-attachment: attached=a, descending=d, free=f, notched=n
gill-spacing: close=c, crowded=w, distant=d
gill-size: broad=b, narrow=n
gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
stalk-shape: enlarging=e, tapering=t
stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
veil-type: partial=p, universal=u
veil-color: brown=n, orange=o, white=w, yellow=y
ring-number: none=n, one=o, two=t
ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We'll use pd.get_dummies to convert the categorical variables into indicator variables.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


mush_df = pd.read_csv('assets/mushrooms.csv')
mush_df2 = pd.get_dummies(mush_df)

X_mush = mush_df2.iloc[:,2:]
y_mush = mush_df2.iloc[:,1]


X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush, y_mush, random_state=0)

Question 5

Using X_train and y_train from the preceeding cell, train a DecisionTreeClassifier with default parameters and random_state=0. What are the 5 most important features found by the decision tree?

This function should return a list of length 5 of the feature names in descending order of importance.

def answer_five():
    from sklearn.tree import DecisionTreeClassifier
    
    # YOUR CODE HERE
    feature_importances = {}
    clf = DecisionTreeClassifier(random_state=0).fit(X_train2, y_train2)
    importances = clf.feature_importances_
    for i in range(len(importances)):
        feature_importances[X_mush.columns[i]] = importances[i]
    feature_importances_sorted = sorted(feature_importances.items(), key=lambda k: k[1], reverse=True)
    feature_importances_top5 = feature_importances_sorted[:5]
    features = [feature_importance[0] for feature_importance in feature_importances_top5]
    return features
    # raise NotImplementedError()

运行结果：

['odor_n', 'stalk-root_c', 'stalk-root_r', 'spore-print-color_r', 'odor_l']

Question 6（未通过测试）

For this question, use the validation_curve function in sklearn.model_selection to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values.

Create an SVC with default parameters (i.e. kernel='rbf', C=1) and random_state=0. Recall that the kernel width of the RBF kernel is controlled using the gamma parameter. Explore the effect of gamma on classifier accuracy by using the validation_curve function to find the training and test scores for 6 values of gamma from 0.0001 to 10 (i.e. np.logspace(-4,1,6)).

For each level of gamma, validation_curve will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.

Find the mean score across the three models for each level of gamma for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.

e.g.

if one of your array of scores is

array([[ 0.5,  0.4,  0.6],
       [ 0.7,  0.8,  0.7],
       [ 0.9,  0.8,  0.8],
       [ 0.8,  0.7,  0.8],
       [ 0.7,  0.6,  0.6],
       [ 0.4,  0.6,  0.5]])

it should then become

array([ 0.5,  0.73333333,  0.83333333,  0.76666667,  0.63333333, 0.5])

This function should return a tuple of numpy arrays (training_scores, test_scores) where each array in the tuple has shape (6,).

def answer_six():
    from sklearn.svm import SVC
    from sklearn.model_selection import validation_curve
    # YOUR CODE HERE
    svc = SVC(kernel='rbf', C=1, random_state=0)
    param_range = np.logspace(-4, 1, 6)
    training_scores, test_scores = validation_curve(svc, X_mush, y_mush, param_name='gamma', param_range=param_range,cv=3)
    training_scores, test_scores = np.mean(training_scores, axis=1), np.mean(test_scores, axis=1)
    return (training_scores, test_scores)
    # raise NotImplementedError()

运行结果（未通过测试）：

(array([0.89838749, 0.98104382, 0.99895372, 1.        , 1.        ,
        1.        ]),
 array([0.88749385, 0.82951748, 0.84170359, 0.86582964, 0.83616445,
        0.51797144]))

Question 7

Based on the scores from question 6, what gamma value corresponds to a model that is underfitting? What gamma value corresponds to a model that is overfitting? What choice of gamma would provide a model with good generalization performance on this dataset?

(Hint: Try plotting the scores from question 6 to visualize the relationship)

This function should return a tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization)

def answer_seven():
    # YOUR CODE HERE
    gamma = np.logspace(-4, 1, 6)
    Underfitting, Overfitting, Good_Generalization = gamma[0], gamma[5], gamma[3]
    return (Underfitting, Overfitting, Good_Generalization)
    # raise NotImplementedError()

运行结果：

(0.0001, 10.0, 0.1)

Module 3: Evaluation

Assignment 3

import numpy as np
import pandas as pd

Question 1

Import the data from assets/fraud_data.csv. What percentage of the observations in the dataset are instances of fraud?

This function should return a float between 0 and 1.

def answer_one():
    # YOUR CODE HERE
    df = pd.read_csv('assets/fraud_data.csv')
    df_fraud = df[df['Class'] == 1]
    return len(df_fraud)/len(df)
    # raise NotImplementedError()

运行结果：

0.016410823768035772

# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('assets/fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Question 2

Using X_train, X_test, y_train, and y_test (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

This function should a return a tuple with two floats, i.e. (accuracy score, recall score).

  def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score

    # YOUR CODE HERE
    dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
    y_predict = dummy_majority.predict(X_test)
    accuracy_score = dummy_majority.score(X_test, y_test)
    recall_score = recall_score(y_test, y_predict)
    return (accuracy_score, recall_score)
    # raise NotImplementedError()

运行结果：

(0.9852507374631269, 0.0)

Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

This function should a return a tuple with three floats, i.e. (accuracy score, recall score, precision score).

def answer_three():
    from sklearn.metrics import recall_score, precision_score
    from sklearn.svm import SVC
    
    # YOUR CODE HERE
    svm = SVC().fit(X_train, y_train)
    y_predict = svm.predict(X_test)
    accuracy_score = svm.score(X_test, y_test)
    recall_score = recall_score(y_test, y_predict)
    precision_score = precision_score(y_test, y_predict)
    return (accuracy_score, recall_score, precision_score)
    # raise NotImplementedError()

运行结果：

(0.9900442477876106, 0.35, 0.9333333333333333)

Question 4

Using the SVC classifier with parameters {'C': 1e9, 'gamma': 1e-07}, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

This function should return a confusion matrix, a 2x2 numpy array with 4 integers.

def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC
    
    # YOUR CODE HERE
    svm = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
    y_scores = svm.decision_function(X_test) > -220
    confusion = confusion_matrix(y_test, y_scores)
    return confusion
    # raise NotImplementedError()

运行结果：

array([[5320,   24],
       [  14,   66]])

Question 5（未通过测试）

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is 0.75?

Looking at the roc curve, what is the true positive rate when the false positive rate is 0.16?

This function should return a tuple with two floats, i.e. (recall, true positive rate).

# YOUR CODE HERE
def answer_five():        
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import precision_recall_curve, roc_curve
    
    logreg = LogisticRegression().fit(X_train, y_train)
    y_proba_lr = logreg.predict_proba(X_test)[:, 1]
    precision, recall, thresholds = precision_recall_curve(y_test, y_proba_lr)
    fpr_lr, tpr_lr, _ = roc_curve(y_test, y_proba_lr)
    
    for pre, rec in zip(precision, recall):
        if pre == 0.75:
            recall_score = rec
    
    min = 1
    for fpr in fpr_lr:
        if np.abs(fpr - 0.16) < min: 
            min = np.abs(fpr - 0.16)
            fpr_close = fpr
    for fpr, tpr in zip(fpr_lr, tpr_lr):
        if fpr == fpr_close:
            tp_rate = tpr
            
    return (recall_score, tp_rate)
    # raise NotImplementedError()

运行结果（未通过测试）：

(0.825, 0.925)

Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation. (Suggest to use solver='liblinear', more explanation here)

'penalty': ['l1', 'l2']

'C':[0.01, 0.1, 1, 10]

From .cv_results_, create an array of the mean test scores of each parameter combination. i.e.

	`l1`	`l2`
`0.01`	?	?
`0.1`	?	?
`1`	?	?
`10`	?	?

This function should return a 4 by 2 numpy array with 8 floats.

Note: do not return a DataFrame, just the values denoted by ? in a numpy array.

def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression

    # YOUR CODE HERE
    logreg = LogisticRegression(solver='liblinear')
    grid_values = {'penalty': ['l1', 'l2'], 'C':[0.01, 0.1, 1, 10]}
    grid_lr = GridSearchCV(logreg, param_grid=grid_values, scoring='recall', cv=3)
    grid_lr.fit(X_train, y_train)
    result = grid_lr.cv_results_
    mean_test_score = result['mean_test_score']
    mean_test_score_reshape = mean_test_score.reshape(4,2)
    return mean_test_score_reshape
    # raise NotImplementedError()

运行结果：

array([[0.66666667, 0.76086957],
       [0.80072464, 0.80434783],
       [0.8115942 , 0.8115942 ],
       [0.80797101, 0.8115942 ]])

Module 4: Supervised Machine Learning - Part 2

Assignment 4 - Predicting and understanding viewer engagement with educational videos

With the accelerating popularity of online educational experiences, the role of online lectures and other educational video continues to increase in scope and importance. Open access educational repositories such as videolectures.net, as well as Massive Open Online Courses (MOOCs) on platforms like Coursera, have made access to many thousands of lectures and tutorials an accessible option for millions of people around the world. Yet this impressive volume of content has also led to a challenge in how to find, filter, and match these videos with learners. This assignment gives you an example of how machine learning can be used to address part of that challenge.

About the prediction problem

One critical property of a video is engagement: how interesting or "engaging" it is for viewers, so that they decide to keep watching. Engagement is critical for learning, whether the instruction is coming from a video or any other source. There are many ways to define engagement with video, but one common approach is to estimate it by measuring how much of the video a user watches. If the video is not interesting and does not engage a viewer, they will typically abandon it quickly, e.g. only watch 5 or 10% of the total.

A first step towards providing the best-matching educational content is to understand which features of educational material make it engaging for learners in general. This is where predictive modeling can be applied, via supervised machine learning. For this assignment, your task is to predict how engaging an educational video is likely to be for viewers, based on a set of features extracted from the video's transcript, audio track, hosting site, and other sources.

We chose this prediction problem for several reasons:

It combines a variety of features derived from a rich set of resources connected to the original data;
The manageable dataset size means the dataset and supervised models for it can be easily explored on a wide variety of computing platforms;
Predicting popularity or engagement for a media item, especially combined with understanding which features contribute to its success with viewers, is a fun problem but also a practical representative application of machine learning in a number of business and educational sectors.

About the dataset

We extracted training and test datasets of educational video features from the VLE Dataset put together by researcher Sahan Bulathwela at University College London.

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single educational video, and includes information about diverse properties of the video content as described further below. The target variable is engagement which was defined as True if the median percentage of the video watched across all viewers was at least 30%, and False otherwise.

Note: Any extra variables that may be included in the training set are simply for your interest if you want an additional source of data for visualization, or to enable unsupervised and semi-supervised approaches. However, they are not included in the test set and thus cannot be used for prediction. Only the data already included in your Coursera directory can be used for training the model for this assignment.

For this final assignment, you will bring together what you've learned across all four weeks of this course, by exploring different prediction models for this new dataset. In addition, we encourage you to apply what you've learned about model selection to do hyperparameter tuning using training/validation splits of the training data, to optimize the model and further increase its performance. In addition to a basic evaluation of model accuracy, we've also provided a utility function to visualize which features are most and least contributing to the overall model performance.

File descriptions assets/train.csv - the training set (Use only this data for training your model!) assets/test.csv - the test set

Data fields

train.csv & test.csv:

title_word_count - the number of words in the title of the video.

document_entropy - a score indicating how varied the topics are covered in the video, based on the transcript. Videos with smaller entropy scores will tend to be more cohesive and more focused on a single topic.

freshness - The number of days elapsed between 01/01/1970 and the lecture published date. Videos that are more recent will have higher freshness values.

easiness - A text difficulty measure applied to the transcript. A lower score indicates more complex language used by the presenter.

fraction_stopword_presence - A stopword is a very common word like 'the' or 'and'. This feature computes the fraction of all words that are stopwords in the video lecture transcript.

speaker_speed - The average speaking rate in words per minute of the presenter in the video.

silent_period_rate - The fraction of time in the lecture video that is silence (no speaking).

train.csv only:

engagement - Target label for training. True if learners watched a substantial portion of the video (see description), or False otherwise.

Evaluation

Your predictions will be given as the probability that the corresponding video will be engaging to learners.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model with an AUC (area under ROC curve) of at least 0.8 passes this assignment, and over 0.85 will receive full points.

For this assignment, create a function that trains a model to predict significant learner engagement with a video using asset/train.csv. Using this model, return a Pandas Series object of length 2309 with the data being the probability that each corresponding video from readonly/test.csv will be engaging (according to a model learned from the 'engagement' label in the training set), and the video index being in the id field.

Example:

id
   9240    0.401958
   9241    0.105928
   9242    0.018572
             ...
   9243    0.208567
   9244    0.818759
   9245    0.018528
         ...
   Name: engagement, dtype: float32

Hints

Make sure your code is working before submitting it to the autograder.
Print out and check your result to see whether there is anything weird (e.g., all probabilities are the same).
Generally the total runtime should be less than 10 mins.
Try to avoid global variables. If you have other functions besides engagement_model, you should move those functions inside the scope of engagement_model.
Be sure to first check the pinned threads in Week 4's discussion forum if you run into a problem you can't figure out.

Extensions

If this prediction task motivates you to explore further, you can find more details here on the original VLE dataset and others related to video engagement: https://github.com/sahanbull/VLE-Dataset

import warnings
warnings.filterwarnings("ignore")

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)   # Do not change this value: required to be compatible with solutions generated by the autograder.

def engagement_model():
    rec = None
    
    # YOUR CODE HERE
    from sklearn.neural_network import MLPClassifier
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    
    train_set = pd.read_csv('assets/train.csv')
    train_set = train_set.set_index('id')
    train_set = train_set.drop('normalization_rate', axis=1)
    X_train = train_set.iloc[: , : -1]
    y_train = train_set.iloc[: , -1]
    
    test_set = pd.read_csv('assets/test.csv')
    test_set = test_set.set_index('id')
    X_test = test_set.drop('normalization_rate', axis=1)
    
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    clf = MLPClassifier(hidden_layer_sizes=[20, 20], alpha=5,solver='lbfgs', 
                        random_state=0, activation='logistic').fit(X_train_scaled, y_train)
    predict_proba = clf.predict_proba(X_test_scaled)[:, 1]
    test_set['engagement'] = predict_proba
    rec = test_set['engagement']
    
    return rec

运行结果：

id
9240     0.014100
9241     0.060007
9242     0.086591
9243     0.933297
9244     0.014579
           ...   
11544    0.022987
11545    0.009579
11546    0.014023
11547    0.888931
11548    0.018817
Name: engagement, Length: 2309, dtype: float64

（注：结果符合AUC大于0.85的要求，即满分要求。）