Applied Machine Learning in Python Week-2

Assignment 2

Part 2 - Classification

Here’s an application of machine learning that could save your life! For this section of the assignment we will be working with the UCI Mushroom Data Set stored in mushrooms.csv. The data will be used to train a model to predict whether or not a mushroom is poisonous. The following attributes are provided:

Attribute Information:

cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
bruises?: bruises=t, no=f
odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
gill-attachment: attached=a, descending=d, free=f, notched=n
gill-spacing: close=c, crowded=w, distant=d
gill-size: broad=b, narrow=n
gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
stalk-shape: enlarging=e, tapering=t
stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
veil-type: partial=p, universal=u
veil-color: brown=n, orange=o, white=w, yellow=y
ring-number: none=n, one=o, two=t
ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We’ll use pd.get_dummies to convert the categorical variables into indicator variables.

from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
from adspy_shared_utilities import load_crime_dataset

cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])

# synthetic dataset for simple regression
from sklearn.datasets import make_regression
plt.title('Sample regression problem with one input variable')
X_R1, y_R1 = make_regression(n_samples = 100, n_features=1,
                            n_informative=1, bias = 150.0,
                            noise = 30, random_state=0)
plt.scatter(X_R1, y_R1, marker= 'o', s=50)

# synthetic dataset for more complex regression
from sklearn.datasets import make_friedman1
plt.title('Complex regression problem with one input variable')
X_F1, y_F1 = make_friedman1(n_samples = 100,
                           n_features = 7, random_state=0)

plt.scatter(X_F1[:, 2], y_F1, marker= 'o', s=50)

# synthetic dataset for classification (binary) 
plt.title('Sample binary classification problem with two informative features')
X_C2, y_C2 = make_classification(n_samples = 100, n_features=2,
                                n_redundant=0, n_informative=2,
                                n_clusters_per_class=1, flip_y = 0.1,
                                class_sep = 0.5, random_state=0)
plt.scatter(X_C2[:, 0], X_C2[:, 1], c=y_C2,
           marker= 'o', s=50, cmap=cmap_bold)

# more difficult synthetic dataset for classification (binary) 
# with classes that are not linearly separable
X_D2, y_D2 = make_blobs(n_samples = 100, n_features = 2, centers = 8,
                       cluster_std = 1.3, random_state = 4)
y_D2 = y_D2 % 2
plt.title('Sample binary classification problem with non-linearly separable classes')
plt.scatter(X_D2[:,0], X_D2[:,1], c=y_D2,
           marker= 'o', s=50, cmap=cmap_bold)

# Breast cancer dataset for classification
cancer = load_breast_cancer()
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)

# Communities and Crime dataset
(X_crime, y_crime) = load_crime_dataset()
Question 5

Using X_train2 and y_train2 from the preceeding cell, train a DecisionTreeClassifier with default parameters and random_state=0. What are the 5 most important features found by the decision tree?

As a reminder, the feature names are available in the X_train2.columns property, and the order of the features in X_train2.columns matches the order of the feature importance values in the classifier’s feature_importances_ property.

This function should return a list of length 5 containing the feature names in descending order of importance.
Note: remember that you also need to set random_state in the DecisionTreeClassifier.

def answer_five():
    from sklearn.tree import DecisionTreeClassifier

    clf = DecisionTreeClassifier(random_state=0).fit(X_train2, y_train2)
    top_five = clf.feature_importances_.argsort()[::-1][:5]   
    result = list(X_train2.columns[top_five])

    return # Your answer here


  1. clf.feature_importances_ 是一个np.ndarray类型的结构。
  2. clf.feature_importances_.argsort() 返回该数组升序排序后,各个数字在原数组的序数。
  3. clf.feature_importances_.argsort().[::-1] 我们要最大的五个,所以让已排好序的数组逆序输出,从升序变降序
  4. clf.feature_importances_.argsort().[::-1][:5] 取前五个也就是最大的五个
  5. top_five 是所求数字在原数组的索引,位置,序号, 且我们需要返回一个列表
  6. 所以 result = list(X_train2.columns[top_five])
Question 6

For this question, we’re going to use the validation_curve function in sklearn.model_ selection to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values. Recall that the validation_curve function, in addition to taking an initialized unfitted classifier object, takes a dataset as input and does its own internal train-test splits to compute results.

Because creating a validation curve requires fitting multiple models, for performance reasons this question will use just a subset of the original mushroom dataset: please use the variables X_subset and y_subset as input to the validation curve function (instead of X_mush and y_mush) to reduce computation time.

The initialized unfitted classifier object we’ll be using is a Support Vector Classifier with radial basis kernel. So your first step is to create an SVC object with default parameters (i.e. kernel=‘rbf’, C=1) and random_state=0. Recall that the kernel width of the RBF kernel is controlled using the gamma parameter.

With this classifier, and the dataset in X_subset, y_subset, explore the effect of gamma on classifier accuracy by using the validation_curve function to find the training and test scores for 6 values of gamma from 0.0001 to 10 (i.e. np.logspace(-4,1,6)). Recall that you can specify what scoring metric you want validation_curve to use by setting the “scoring” parameter. In this case, we want to use “accuracy” as the scoring metric.

For each level of gamma, validation_curve will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.

Find the mean score across the three models for each level of gamma for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.


if one of your array of scores is

array([[ 0.5, 0.4, 0.6],
[ 0.7, 0.8, 0.7],
[ 0.9, 0.8, 0.8],
[ 0.8, 0.7, 0.8],
[ 0.7, 0.6, 0.6],
[ 0.4, 0.6, 0.5]])

it should then become

array([ 0.5, 0.73333333, 0.83333333, 0.76666667, 0.63333333, 0.5])’

This function should return one tuple of numpy arrays (training_scores, test_scores) where each array in the tuple has shape (6,).

def answer_six():
    from sklearn.svm import SVC
    from sklearn.model_selection import validation_curve
    # Your code here
    clf = SVC(kernel='rbf', C=1, random_state=0)
    param_range = np.logspace(-4,1,6)
    train_scores, test_scores = validation_curve(SVC(), X_subset, y_subset,
                                            param_range=param_range, cv=3, scoring="accuracy")
    train_scores_mean = np.mean(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    return (train_scores_mean, test_scores_mean) 


  1. 首先建立一个SVC对象,object = sklearn.svm.SVC(C=1.0, kernel=’rbf’, random_state=None), SVM第一个重要参数C设为1,默认也是1, 核函数选RBF。
  2. 其次 考虑gamma, 本题要求从0.0001到10, 使用np.logsapce轻松获得,
    numpy.logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None, axis=0)
    0.0001 = basestart 所以start = -4, 10=basestop 所以 stop = 1, 我们要求有六个数,所以 num = 6, ([-4, -3, -2, -1, 0, 1])
  3. 然后是要求使用 validation_curve,
    sklearn.model_selection.validation_curve(estimator, X, y, param_name, param_range, cv=’warn’, scoring=None)
    estimator 输入一个分类器的对象,cv 默认 3-fold , scoring 取 accuracy
  4. numpy.mean(a, axis=None, dtype=None, keepdims=<‘No’ value>)
    a是数组,axis=0就是按列求平均,axis=1 按行
Question 7

Based on the scores from question 6, what gamma value corresponds to a model that is underfitting (and has the worst test set accuracy)? What gamma value corresponds to a model that is overfitting (and has the worst test set accuracy)? What choice of gamma would be the best choice for a model with good generalization performance on this dataset (high accuracy on both training and test set)?

Hint: Try plotting the scores from question 6 to visualize the relationship between gamma and accuracy. Remember to comment out the import matplotlib line before submission.

This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization) Please note there is only one correct solution.

def answer_seven():
    param_range = np.logspace(-4, 1, 6)
    # Read in the results of answer_six
    training_scores, test_scores = answer_six()
    # Sort the scores
    train_scores_sorted = np.sort(training_scores)
    test_scores_sorted = np.sort(test_scores)
    # Initialize the values
    Underfitting = 0
    Overfitting = 0
    Good_Generalization = 0
    min_train_scores = np.min(training_scores)
    max_train_scores = np.max(training_scores)
    min_test_scores = np.max(test_scores)
    max_test_scores = np.max(test_scores)    
    for gam, data in zip(param_range, zip(training_scores, test_scores)):
        if data[0] <= train_scores_sorted[1] and data[1] <= test_scores_sorted[1]:
            Underfitting = gam
        if data[0] > train_scores_sorted[1] and data[1] <= test_scores_sorted[1]:
            Overfitting = gam
        if data[0] == max_train_scores and data[1] == max_test_scores:
            Good_Generalization = gam
    return Underfitting, Overfitting, Good_Generalization


