Coursera | Applied Machine Learning in Python(University of Michigan)| Assignment2

   所有assignment相关链接:
  Coursera | Applied Machine Learning in Python(University of Michigan)| Assignment1
  Coursera | Applied Machine Learning in Python(University of Michigan)| Assignment2
  Coursera | Applied Machine Learning in Python(University of Michigan)| Assignment3
  Coursera | Applied Machine Learning in Python(University of Michigan)| Assignment4
   有时间(需求)就把所有代码放到github上
   嘿,顺便推广下自己的博客,以后CSDN的文章都会放到自己的博客的。

Coursera | Applied Machine Learning in Python(University of Michigan)| Assignment2


You are currently looking at version 1.3 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.


Assignment 2

In this assignment you’ll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part 1 of this assignment will look at regression and Part 2 will look at classification.

Part 1 - Regression

First, run the following block to set up the variables needed for later sections.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

np.random.seed(0)
n = 15
x = np.linspace(0, 10, n) + np.random.randn(n) / 5
y = np.sin(x) + x / 6 + np.random.randn(n) / 10

X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)


# You can use this function to help you visualize the dataset by
# plotting a scatterplot of the data points
# in the training and test sets.
def part1_scatter():
    import matplotlib.pyplot as plt
    % matplotlib notebook
    plt.figure()
    plt.scatter(X_train, y_train, label='training data')
    plt.scatter(X_test, y_test, label='test data')
    plt.legend(loc=4)


# NOTE: Uncomment the function below to visualize the data, but be sure
# to **re-comment it before submitting this assignment to the autograder**.
part1_scatter()

Question 1

Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.

在这里插入图片描述

The figure above shows the fitted models plotted on top of the original data (using plot_one()).


*This function should return a numpy array with shape `(4, 100)`*
def answer_one():
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures

    result = np.zeros((4, 100))
    test = np.linspace(0, 10, 100)
    for i, degree in enumerate([1, 3, 6, 9]):
        poly = PolynomialFeatures(degree=degree)
        X_poly = poly.fit_transform(X_train.reshape(len(X_train), 1))
        linreg = LinearRegression().fit(X_poly, y_train)
        y = linreg.predict(poly.fit_transform(test.reshape(len(test), 1)))
        result[i, :] = y

    return result
# feel free to use the function plot_one() to replicate the figure
# from the prompt once you have completed question one
def plot_one(degree_predictions):
    import matplotlib.pyplot as plt
    %matplotlib notebook
    plt.figure(figsize=(10, 5))
    plt.plot(X_train, y_train, 'o', label='training data', markersize=10)
    plt.plot(X_test, y_test, 'o', label='test data', markersize=10)
    for i, degree in enumerate([1, 3, 6, 9]):
        plt.plot(np.linspace(0, 10, 100),
                 degree_predictions[i],
                 alpha=0.8,
                 lw=2,
                 label='degree={}'.format(degree))
    plt.ylim(-1, 2.5)
    plt.legend(loc=4)


plot_one(answer_one())

Question 2

Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 0 through 9. For each model compute the R 2 R^2 R2 (coefficient of determination) regression score on the training data as well as the the test data, and return both of these arrays in a tuple.

This function should return one tuple of numpy arrays (r2_train, r2_test). Both arrays should have shape (10,)

def answer_two():
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.metrics.regression import r2_score

    r2_test = np.zeros(10)
    r2_train = np.zeros(10)
    for degree in range(10):
        #train polynomial linear regression
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(X_train.reshape(len(X_train), 1))
        linreg = LinearRegression().fit(X_train_poly, y_train)

        #evaluate the polynomial linear regression
        r2_train[degree] = linreg.score(X_train_poly, y_train)

        X_test_poly = poly.fit_transform(X_test.reshape(len(X_test), 1))
        r2_test[degree] = linreg.score(X_test_poly, y_test)

    return (r2_train, r2_test)
answer_two()
(array([0.        , 0.42924578, 0.4510998 , 0.58719954, 0.91941945,
        0.97578641, 0.99018233, 0.99352509, 0.99637545, 0.99803706]),
 array([-0.47808642, -0.45237104, -0.06856984,  0.00533105,  0.73004943,
         0.87708301,  0.9214094 ,  0.92021504,  0.63247942, -0.64525453]))

Question 3

Based on the R 2 R^2 R2 scores from question 2 (degree levels 0 through 9), what degree level corresponds to a model that is underfitting? What degree level corresponds to a model that is overfitting? What choice of degree level would provide a model with good generalization performance on this dataset? Note: there may be multiple correct solutions to this question.

(Hint: Try plotting the R 2 R^2 R2 scores from question 2 to visualize the relationship between degree level and R 2 R^2 R2)

This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization)

def plot_answer_three():
    import matplotlib.pyplot as plt
    r2_train, r2_test = answer_two()
    degrees = np.arange(0, 10)
    plt.figure()
    plt.plot(degrees, r2_train, degrees, r2_test)


plot_answer_three()
def answer_three():
    Underfitting, Overfitting, Good_Generalization = 0, 9, 7
    return (Underfitting, Overfitting, Good_Generalization)
answer_three()
(0, 9, 7)

Question 4

Training models on high degree polynomial features can result in overly complex models that overfit, so we often use regularized versions of the model to constrain model complexity, as we saw with Ridge and Lasso linear regression.

For this question, train two models: a non-regularized LinearRegression model (default parameters) and a regularized Lasso Regression model (with parameters alpha=0.01, max_iter=10000) on polynomial features of degree 12. Return the R 2 R^2 R2 score for both the LinearRegression and Lasso model’s test sets.

This function should return one tuple (LinearRegression_R2_test_score, Lasso_R2_test_score)

def answer_four():
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import Lasso, LinearRegression
    from sklearn.metrics.regression import r2_score

    poly = PolynomialFeatures(12)
    X_poly = poly.fit_transform(X_train.reshape(len(X_train), 1))
    X_test_poly = poly.fit_transform(X_test.reshape(len(X_test), 1))

    linreg = LinearRegression().fit(X_poly, y_train)
    LinearRegression_R2_test_score = linreg.score(X_test_poly, y_test)

    linlasso = Lasso(alpha=0.01, max_iter=10000).fit(X_poly, y_train)
    Lasso_R2_test_score = linlasso.score(X_test_poly, y_test)

    return (LinearRegression_R2_test_score, Lasso_R2_test_score)
answer_four()
(-4.311988870193713, 0.8406625614749974)

Part 2 - Classification

Here’s an application of machine learning that could save your life! For this section of the assignment we will be working with the UCI Mushroom Data Set stored in mushrooms.csv. The data will be used to train a model to predict whether or not a mushroom is poisonous. The following attributes are provided:

Attribute Information:

  1. cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
  2. cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
  3. cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
  4. bruises?: bruises=t, no=f
  5. odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
  6. gill-attachment: attached=a, descending=d, free=f, notched=n
  7. gill-spacing: close=c, crowded=w, distant=d
  8. gill-size: broad=b, narrow=n
  9. gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
  10. stalk-shape: enlarging=e, tapering=t
  11. stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
  12. stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
  13. stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
  14. stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
  15. stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
  16. veil-type: partial=p, universal=u
  17. veil-color: brown=n, orange=o, white=w, yellow=y
  18. ring-number: none=n, one=o, two=t
  19. ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
  20. spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
  21. population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
  22. habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We’ll use pd.get_dummies to convert the categorical variables into indicator variables.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

mush_df = pd.read_csv('mushrooms.csv')
mush_df2 = pd.get_dummies(mush_df)

X_mush = mush_df2.iloc[:, 2:]
y_mush = mush_df2.iloc[:, 1]

# use the variables X_train2, y_train2 for Question 5
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush,
                                                        y_mush,
                                                        random_state=0)

# For performance reasons in Questions 6 and 7, we will create a smaller version of the
# entire mushroom dataset for use in those questions.  For simplicity we'll just re-use
# the 25% test split created above as the representative subset.
#
# Use the variables X_subset, y_subset for Questions 6 and 7.
X_subset = X_test2
y_subset = y_test2

Question 5

Using X_train2 and y_train2 from the preceeding cell, train a DecisionTreeClassifier with default parameters and random_state=0. What are the 5 most important features found by the decision tree?

As a reminder, the feature names are available in the X_train2.columns property, and the order of the features in X_train2.columns matches the order of the feature importance values in the classifier’s feature_importances_ property.

This function should return a list of length 5 containing the feature names in descending order of importance.

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=0).fit(X_train2, y_train2)
features = []

clf.feature_importances_.argsort()[::-1][:5]
array([ 27,  53,  55, 100,  25], dtype=int32)
def answer_five():
    from sklearn.tree import DecisionTreeClassifier

    clf = DecisionTreeClassifier(random_state=0).fit(X_train2, y_train2)
    features = []

    ind = clf.feature_importances_.argsort()[::-1][:5]

    return X_train2.columns[ind].tolist()
answer_five()
['odor_n', 'stalk-root_c', 'stalk-root_r', 'spore-print-color_r', 'odor_l']

Question 6

For this question, we’re going to use the validation_curve function in sklearn.model_selection to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values. Recall that the validation_curve function, in addition to taking an initialized unfitted classifier object, takes a dataset as input and does its own internal train-test splits to compute results.

Because creating a validation curve requires fitting multiple models, for performance reasons this question will use just a subset of the original mushroom dataset: please use the variables X_subset and y_subset as input to the validation curve function (instead of X_mush and y_mush) to reduce computation time.

The initialized unfitted classifier object we’ll be using is a Support Vector Classifier with radial basis kernel. So your first step is to create an SVC object with default parameters (i.e. kernel='rbf', C=1) and random_state=0. Recall that the kernel width of the RBF kernel is controlled using the gamma parameter.

With this classifier, and the dataset in X_subset, y_subset, explore the effect of gamma on classifier accuracy by using the validation_curve function to find the training and test scores for 6 values of gamma from 0.0001 to 10 (i.e. np.logspace(-4,1,6)). Recall that you can specify what scoring metric you want validation_curve to use by setting the “scoring” parameter. In this case, we want to use “accuracy” as the scoring metric.

For each level of gamma, validation_curve will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.

Find the mean score across the three models for each level of gamma for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.

e.g.

if one of your array of scores is

array([[ 0.5,  0.4,  0.6],
       [ 0.7,  0.8,  0.7],
       [ 0.9,  0.8,  0.8],
       [ 0.8,  0.7,  0.8],
       [ 0.7,  0.6,  0.6],
       [ 0.4,  0.6,  0.5]])

it should then become

array([ 0.5,  0.73333333,  0.83333333,  0.76666667,  0.63333333, 0.5])

This function should return one tuple of numpy arrays (training_scores, test_scores) where each array in the tuple has shape (6,).

def answer_six():
    from sklearn.svm import SVC
    from sklearn.model_selection import validation_curve

    svc = SVC(kernel='rbf', C=1, random_state=0)
    train_scores, test_scores = validation_curve(svc,
                                                 X_subset,
                                                 y_subset,
                                                 param_name='gamma',
                                                 param_range=np.logspace(
                                                     -4, 1, 6),
                                                 scoring='accuracy',
                                                 cv=3)

    train_mscores = train_scores.mean(axis=1)
    test_mscores = test_scores.mean(axis=1)

    return (train_mscores, test_mscores)
answer_six()
(array([0.56646972, 0.93106844, 0.990645  , 1.        , 1.        ,
        1.        ]),
 array([0.56720827, 0.9300837 , 0.98966027, 1.        , 0.99458395,
        0.52240276]))

Question 7

Based on the scores from question 6, what gamma value corresponds to a model that is underfitting (and has the worst test set accuracy)? What gamma value corresponds to a model that is overfitting (and has the worst test set accuracy)? What choice of gamma would be the best choice for a model with good generalization performance on this dataset (high accuracy on both training and test set)? Note: there may be multiple correct solutions to this question.

(Hint: Try plotting the scores from question 6 to visualize the relationship between gamma and accuracy.)

This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization)

train_scores, test_scores = answer_six()
def plot_answer_seven():
    import matplotlib.pyplot as plt
    gamma = np.logspace(-4, 1, 6)
    plt.figure()
    plt.plot(gamma, train_scores, 'b--.', label='train_scores')
    plt.plot(gamma, test_scores, 'g-*', label='test_scores')
    plt.legend()


plot_answer_seven()
def answer_seven():

    param_range = np.logspace(-4, 1, 6)
    Underfitting, Overfitting, Good_Generalization = param_range[
        0], param_range[5], param_range[3]
    return (Underfitting, Overfitting, Good_Generalization)
answer_seven()
(0.0001, 10.0, 0.1)
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Title: Machine Learning in Python: Essential Techniques for Predictive Analysis Author: Michael Bowles Length: 360 pages Edition: 1 Language: English Publisher: Wiley Publication Date: 2015-04-20 ISBN-10: 1118961749 ISBN-13: 9781118961742 Learn a simpler and more effective way to analyze data and predict outcomes with Python Machine Learning in Python shows you how to successfully analyze data using only two core machine learning algorithms, and how to apply them using Python. By focusing on two algorithm families that effectively predict outcomes, this book is able to provide full descriptions of the mechanisms at work, and the examples that illustrate the machinery with specific, hackable code. The algorithms are explained in simple terms with no complex math and applied using Python, with guidance on algorithm selection, data preparation, and using the trained models in practice. You will learn a core set of Python programming techniques, various methods of building predictive models, and how to measure the performance of each model to ensure that the right one is used. The chapters on penalized linear regression and ensemble methods dive deep into each of the algorithms, and you can use the sample code in the book to develop your own data analysis solutions. Machine learning algorithms are at the core of data analytics and visualization. In the past, these methods required a deep background in math and statistics, often in combination with the specialized R programming language. This book demonstrates how machine learning can be implemented using the more widely used and accessible Python programming language. * Predict outcomes using linear and ensemble algorithm families * Build predictive models that solve a range of simple and complex problems * Apply core machine learning algorithms using Python * Use sample code directly to build custom solutions Machine learning doesn't have to be complex and highly specialized. Python makes this technology more acces
Machine Learning in Python: Essential Techniques for Predictive Analysis Paperback: 360 pages Publisher: Wiley; 1 edition (April 27, 2015) Language: English ISBN-10: 1118961749 ISBN-13: 978-1118961742 Learn a simpler and more effective way to analyze data and predict outcomes with Python Machine Learning in Python shows you how to successfully analyze data using only two core machine learning algorithms, and how to apply them using Python. By focusing on two algorithm families that effectively predict outcomes, this book is able to provide full descriptions of the mechanisms at work, and the examples that illustrate the machinery with specific, hackable code. The algorithms are explained in simple terms with no complex math and applied using Python, with guidance on algorithm selection, data preparation, and using the trained models in practice. You will learn a core set of Python programming techniques, various methods of building predictive models, and how to measure the performance of each model to ensure that the right one is used. The chapters on penalized linear regression and ensemble methods dive deep into each of the algorithms, and you can use the sample code in the book to develop your own data analysis solutions. Machine learning algorithms are at the core of data analytics and visualization. In the past, these methods required a deep background in math and statistics, often in combination with the specialized R programming language. This book demonstrates how machine learning can be implemented using the more widely used and accessible Python programming language. * Predict outcomes using linear and ensemble algorithm families * Build predictive models that solve a range of simple and complex problems * Apply core machine learning algorithms using Python * Use sample code directly to build custom solutions Machine learning doesn't have to be complex and highly specialized. Python makes this technology more accessible to a much wider audience, using methods that are simpler, effective, and well tested. Machine Learning in Python shows you how to do this, without requiring an extensive background in math or statistics.
Scala:Applied Machine Learning by Pascal Bugnion English | 23 Feb. 2017 | ISBN-13: 9781787126640 | 1843 Pages | EPUB/PDF (conv) | 33.15 MB Leverage the power of Scala and master the art of building, improving, and validating scalable machine learning and AI applications using Scala's most advanced and finest features. About This Book Build functional, type-safe routines to interact with relational and NoSQL databases with the help of the tutorials and examples provided Leverage your expertise in Scala programming to create and customize your own scalable machine learning algorithms Experiment with different techniques; evaluate their benefits and limitations using real-world financial applications Get to know the best practices to incorporate new Big Data machine learning in your data-driven enterprise and gain future scalability and maintainability Who This Book Is For This Learning Path is for engineers and scientists who are familiar with Scala and want to learn how to create, validate, and apply machine learning algorithms. It will also benefit software developers with a background in Scala programming who want to apply machine learning. What You Will Learn Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters Apply key learning strategies to perform technical analysis of financial markets Understand the principles of supervised and unsupervised learning in machine learning Work with unstructured data and serialize it using Kryo, Protobuf, Avro, and AvroParquet Construct reliable and robust data pipelines and manage data in a data-driven enterprise Implement scalable model monitoring and alerts with Scala In Detail This Learning Path aims to put the entire world of machine learning with Scala in fron
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值