cp3 sTourOfMLClassifiers_stratify_bincount_likelihood_logistic regression_odds ratio_decay_L2_sigmoi

      In this chapter, we will take a tour through a selection of popular and powerful machine learning algorithms that are commonly used in academia as well as in the industry. While learning about the differences between several supervised learning algorithms for classification, we will also develop an intuitive appreciation评估 of their individual strengths and weaknesses. Also, we will take our first steps with the scikit-learn library, which offers a user-friendly interface for using those algorithms efficiently and productively.

     The topics that we will learn about throughout this chapter are as follows:

  • Introduction to robust强健的 and popular algorithms for classification, such as logistic regression, support vector machines, and decision trees
     
  • Examples and explanations using the scikit-learn machine learning library, which provides a wide variety of machine learning algorithms via a user friendly Python API
     
  • Discussions about the strengths and weaknesses of classifiers with linear and non-linear decision boundaries

Choosing a classification algorithm 

     Choosing an appropriate classification algorithm for a particular problem task requires practice: each algorithm has its own quirks 怪癖 and is based on certain assumptions. To restate the "No Free Lunch" theorem: no single classifier works best across all possible scenarios. practice, it is always recommended that you compare the performance of at least a handful of different learning algorithms to select the best model for the particular problem; these may differ in the number of features or samples, the amount of noise in a dataset, and whether the classes are linearly separable or not. 

Eventually, the performance of a classifier, computational power as well as predictive power, depends heavily on the underlying data that are available for learning. The five main steps that are involved in training a machine learning algorithm can be summarized as follows:

  1. Selection of features.
  2. Choosing a performance metric.
  3. Choosing a classifier(OR loss function) and optimization algorithm.
  4. Evaluating the performance of the model.
  5. Tuning the algorithm.

     Since the approach of this book is to build machine learning knowledge step by step, we will mainly focus on the main concepts of the different algorithms in this chapter and revisit topics such as feature selection and preprocessing, performance metrics, and hyperparameter tuning for more detailed discussions later in this book.

First steps with scikit-learn

     In cp2 Training Machine Learning Algorithms for Classification cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客, you learned about two related learning algorithms for classification: the perceptron rule and Adaline(ADAptive LInear NEuron ), which we implemented in Python by ourselves.
##############################################
 the perceptron rule


Rosenblatt's initial perceptron rule is fairly simple and can be summarized by the following steps:
     1. Initialize the weights to 0 or small random numbers.
     2. For each training sample  perform the following steps:
          1. Compute the output value .
          2. Update the weights.
Note: Perceptron will traverse and update the weights of all feature items before entering the next loop(for next instance)

Adaline 

Note: the weight update is calculated based on all samples in the training set 
#OR updating each weight based on the sum of the accumulated errors over all samples xi.
##############################################
     Now we will take a look at the scikit-learn API, which combines a user-friendly interface with a highly optimized implementation of several classification algorithms. However, the scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models. We will discuss this in more detail together with the underlying concepts in Chapter 4, Building Good Training Sets – Data Preprocessing, and Chapter 5, Compressing Data via Dimensionality Reduction.

Training a perceptron via scikit-learn

     To get started with the scikit-learn library, we will train a perceptron model similar to the one that we implemented in cp02_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDescent cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客. For simplicity, we will use the already familiar Iris dataset throughout the following sections. Conveniently, the Iris dataset is already available via scikit-learn, since it is a simple yet popular dataset that is frequently used for testing and experimenting with algorithms. Also, we will only use two features from the Iris flower dataset for visualization purposes.

from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
iris.data[:5]

len(iris.data)  # 150samples

X = iris.data[:,[2,3]]#the petal花瓣 length and petal width
X[:5]

y = iris.target
set(y)

print('Class labels:', np.unique(y))

#The np.unique(y) function returned the three unique class labels stored in iris.target, and as we see, the Iris flower class #names Iris-setosa, Irisversicolor, and Iris-virginica are already stored as integers (here: 0, 1, 2)

#using integer labels is a recommended approach to avoid technical
# glitches差错( and improve computational performance due to a smaller memory footprint;

     To evaluate how well a trained model performs on unseen data, we will further split the dataset into separate training and test datasets. Later in Chapter 5, Compressing Data via Dimensionality Reduction, we will discuss the best practices around model evaluation in more detail:

from sklearn.model_selection import train_test_split

from sklearn.model_selection import train_test_split
# Using the train_test_split function from scikit-learn's model_selection module, 
# we randomly split the X and y arrays into 30 percent test data (45 samples) 
# and 70 percent training data (105 samples).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

     Using the train_test_split function from scikit-learn's cross_validation module, we randomly split the X and y arrays into 30 percent test data (45 samples) and 70 percent training data (105 samples). 

     Note that the train_test_split function already shuffles the training sets internally before splitting; otherwise, all class 0 and class 1 samples would have ended up in the training set, and the test set would consist of 45 samples from class 2. Via the random_state parameter, we provided a fixed random seed (random_state=1) for the internal pseudo-random number generator that is used for shuffling the datasets prior to splitting. Using such a fixed random_state ensures that our results are reproducible[ˌriprə'djusəbl]可再生的,可复写的.

stratification via stratify=y. In this context, stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset. We can use NumPy's bincount function, which counts the number of occurrences发生 of each value in an array, to verify that this is indeed the case

print('Labels counts in y: ', np.bincount(y))

print('Labels counts in y_train: ', np.bincount(y_train))

print('Labels counts in y_train: ', np.bincount(y_train))

     Many machine learning and optimization algorithms also require feature scaling for optimal performance, as we remember from the gradient descent example in Cp2, Training Machine Learning Algorithms for Classification
cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客. Here, we will standardize the features using the StandardScaler class from scikit-learn's preprocessing module:

     Using the preceding code, we loaded the StandardScaler class from the preprocessing module and initialized a new StandardScaler object that we assigned to the sc variable. Using the fit methodStandardScaler estimated the parameters μ (sample mean) and σ (standard deviation) for each feature dimension from the training data. By calling the transform method, we then standardized the training data using those estimated parameters μ (sample mean) and σ (standard deviation) . Note that we used the same scaling parameters to standardize(transform) the test set so that both the values in the training and test dataset are comparable to each other.02_End-to-End Machine Learning Project_StratifiedShuffleSplit_RMSE_MAE_Geographical Data_CaliforniaH_LIQING LIN的博客-CSDN博客

standardization, which gives our data the property of a standard normal distribution, which helps gradient descent learning to converge more quickly. Standardization shifts the mean of each feature so that it is centered at zero and each feature has a standard deviation of 1.08_09_3_Dimensionality Reduction_Mixture Models and EM_K-means_Image segmentation_compression_LIQING LIN的博客-CSDN博客数学期望μ指的是均值(算术平均值, 均值对应正态分布的中间位置)σ为方标准差(方差开平方后得到标准差, 标准差衡量了数据围绕均值分散的程度

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train) #get mu and gamma from training set


X_train_std = sc.transform(X_train) #standardization
X_test_std = sc.transform(X_test)#standardization

     Having standardized the training data, we can now train a perceptron model. Most algorithms in scikit-learn already support multiclass classification by default via the One-vs.-Rest (OvR) OR (one-versus-all (OvA)
03_Classification_import name fetch_mldata_cross_val_plot_digits_ML_Project Checklist_confusion matr_LIQING LIN的博客-CSDN博客)method, which allows us to feed the three flower classes to the
perceptron all at once. The code is as follows: 

from sklearn.linear_model import Perceptron

ppn = Perceptron(max_iter=40, eta0=0.1, random_state=1)
ppn.fit(X_train_std, y_train)

     The scikit-learn interface reminds us of our perceptron implementation in cp02_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDescent cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客: after loading the Perceptron class from the linear_model module, we initialized a new Perceptron object and trained the model via the fit method. Here, the model parameter eta0 is equivalent to the learning rate eta that we used in our own perceptron implementation, and the n_iter parameter defines the number of epochs (passes over the training set). Besides, finding an appropriate learning rate requires some experimentation. If the learning rate is too large, the algorithm will overshoot超过 the global cost minimum.

     If the learning rate is too small, the algorithm requires more epochs until convergence, which can make the learning slow—especially for large datasets. Also, we used the random_state parameter to ensure the reproducibility of the initial shuffling of the training dataset after each epoch.

     Having trained a model in scikit-learn, we can make predictions via the predict method, the code is as follows:

y_pred = ppn.predict(X_test_std)
print( 'Misclassified samples: %d' % (y_test != y_pred).sum() )

     Executing the code, we see that the perceptron misclassifies three out of the 45 flower samples. Thus, the misclassification error on the test dataset is approximately 0.02 or 2. percent (=1/45).

Note

     Instead of the misclassification error, many machine learning practitioners report the classification accuracy of a model,

The scikit-learn library also implements a large variety of different performance metrics that are available via the metrics module. For example, we can calculate the classification accuracy of the perceptron on the test set as follows:

from sklearn.metrics import accuracy_score

print('Accuracy: %.2f' % accuracy_score(y_test, y_pred) )#1-error(0.02) = 0.98 or 98 percent

     Here, y_test are the true class labels and y_pred are the class labels that we predicted previously. Alternatively, each classifier in scikit-learn has a score method, which computes a classifier's prediction accuracy by combining the predict call with accuracy_score as shown here:

print('Accuracy: %.2f' % ppn.score(X_test_std, y_test))

Note: Overfitting means that the model captures the patterns in the training data well, but fails to generalize well to unseen data.

     Finally, we can use our plot_decision_regions function from Cp2
cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客, Training Simple Machine Learning Algorithms for Classification, to plot the decision regions of our newly trained perceptron model and visualize how well it separates the different flower samples. However, let's add a small modification to highlight the samples from the test dataset via small circles:

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt


def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap( colors[:len(np.unique(y))] )

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 #feature 0
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1 #feature 1
    # np.arange(x1_min, x1_max, resolution) : feature0 array({min-1, ..., max+1})
    # np.arange(x2_min, x2_max, resolution) : feature1 array({min-1, ..., max+1})
    
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    #xx1, xx2 are both two dimension array with same shape
    #ravel(): Return a contiguous flattened array(one dimension)
            #feature0 array({min-1, ..., max+1,..,min-1, ..., max+1})
            #feature1 array({min-1, ..., max+1,..,min-1, ..., max+1})
    #np.array([xx1.ravel(), xx2.ravel()]): two dimension array(features, samples)
    Z = classifier.predict( np.array([xx1.ravel(), xx2.ravel()]).T )#(samples, features)
    Z = Z.reshape(xx1.shape)
    
                #axis,axis,height
    plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)): #cl: class label in dataset X
        plt.scatter(x=X[y == cl, 0], #selection
                    y=X[y == cl, 1],
                    alpha=0.8, 
                    c=colors[idx],
                    marker=markers[idx], 
                    label=cl, 
                    edgecolor='black')

    # highlight test examples
    if test_idx:
        # plot all examples
        X_test, y_test = X[test_idx, :], y[test_idx]

        plt.scatter(X_test[:, 0],
                    X_test[:, 1],
                    c='',
                    edgecolor='black',
                    alpha=1.0,
                    linewidth=1,
                    marker='o',
                    s=100, 
                    label='test set')

     With the slight modification that we made to the plot_decision_regions function, we can now specify the indices of the samples that we want to mark on the resulting plots. The code is as follows:

Training a perceptron model using the standardized training data:

X_combined_std = np.vstack((X_train_std, X_test_std)) #since 2D
y_combined = np.hstack((y_train, y_test)) #since 1D

plot_decision_regions(X=X_combined_std, y=y_combined,
                      classifier=ppn, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')

plt.tight_layout()
plt.show()

As we can see in the resulting plot, the three flower classes cannot be perfectly separated by a linear decision boundary 


Note: Perceptron will traverse and update the weights of all feature items before entering the next loop(for next instance)
     Remember from our discussion in Cp2, cp02_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDe cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客, that the perceptron algorithm never converges on datasets that aren't perfectly linearly separable, which is why the use of the perceptron algorithm is typically not recommended in practice. In the following sections, we will look at more powerful linear classifiers that converge to a cost minimum even if the classes are not perfectly linearly separable.

     The Perceptron, as well as other scikit-learn functions and classes, often have additional parameters that we omit for clarity. You can read more about those parameters using the help function in Python (for instance, help(Perceptron)) or by going through the excellent scikit-learn online documentation at http://scikitlearn.org/stable/.

Modeling class probabilities via logistic regression

     Although the perceptron rule offers a nice and easygoing introduction to machine learning algorithms for classification, its biggest disadvantage is that it never converges if the classes are not perfectly linearly separable. The classification task
in the previous section would be an example of such a scenario. Intuitively, we can think of the reason as the weights are continuously being updated since there is always at least one misclassified sample present in each epoch. Of course, you can change the learning rate and increase the number of epochs, but be warned that the perceptron will never converge on this dataset. To make better use of our time, we will now take a look at another simple yet more powerful algorithm for linear and
binary classification problems: logistic regression. Note that, in spite of its name, logistic regression is a model for classification, not regression.

Logistic regression intuition and conditional probabilities

     Logistic regression is a classification model that is very easy to implement but performs very well on linearly separable classes. It is one of the most widely used algorithms for classification in industry. Similar to the perceptron and Adaline, the
logistic regression model in this chapter is also a linear model for binary classification that can be extended to multiclass classification via the OvR technique.

     To explain the idea behind logistic regression as a probabilistic model, let's first introduce the odds ratio, which is the odds in favor of a particular event. The odds ratio几率比 can be written as , where p stands for the probability of the positive event. The term positive event does not necessarily mean good, but refers to the event that we want to predict, for example, the probability that a patient has a certain disease; we can think of the positive event as class label y =1. We can then further define the logit function, which is simply the logarithm of the odds ratio (log-odds)

     Note that log refers to the natural logarithm, as it is the common convention in computer science.The logit function takes input values in the range 0 to 1 and transforms them to values over the entire real number range, which we can use to express a linear relationship between feature values and the log-odds: 
Here,  is the conditional probability that a particular sample belongs to class 1 given its features x.

     Now what we are actually interested in is predicting the probability that a certain sample belongs to a particular class, which is the inverse form of the logit function###The inverse function of logarithm is the exponent###. It is also called the logistic function, sometimes simply abbreviated as sigmoid function due to its characteristic S-shape.
==>==>

Here z is the net input, the linear combination of weights and sample features, 
     Note refers to the bias unit, and is an additional input value that we provide  , which is set equal to 1, .
Now let us simply plot the sigmoid function for some values in the range -7 to 7 to see how it looks:

phi [faɪ] 希腊文的第21个字母

import matplotlib.pyplot as plt
import numpy as np

def sigmoid(z):
    return 1.0 / ( 1.0+np.exp(-z) )

#z is the conditional probability that a particular sample belongs to class 1 given its features x
z=np.arange(-7,7,0.1)
phi_z = sigmoid(z) # the logit function convert to the logistic function

plt.plot(z, phi_z)
plt.axvline(0.0, color='k')###
plt.ylim(-0.1,1.1)
plt.xlabel('z')
plt.ylabel('$\phi(z)$')
#y axis ticks and gridline
plt.yticks([0.0, 0.5, 1.0])
ax=plt.gca()
ax.yaxis.grid(True)
plt.show()

As a result of executing the previous code example, we should now see the S-shaped (sigmoidal) curve: 


     We can see that (OR ) approaches 1 if z goes towards infinity (z ), since the change of  becomes very small for large values of z. Similarly, goes towards 0 for z as the result of an increasingly large denominator. Thus, we conclude that this sigmoid function takes real number values as input and transforms them to values in the range [0, 1] with an intercept at = 0.5 .

     To build some intuition for the logistic regression model, we can relate it to our previous Adaline implementation in cp02_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDescent cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客. In Adaline, we used the identity function = z as the activation function. In logistic regression, this activation function simply becomes the sigmoid function that we defined earlier, which is illustrated in the following figure:

Note: the weight update is calculated based on all samples in the training set 

     The output of the sigmoid function is then interpreted as the probability of a particular sample belonging to class 1, , given its features x parameterized by the weights w. For example, if we compute  for a particular flower sample, it means that the chance that this sample is an Irisversicolor flower is 80 percent. Therefore, the probability that this flower is an Iris-setosa flower can be calculated as  or 20 percent.     

The predicted probability  can then simply be converted into a binary outcome via a threshold function: 

 If we look at the preceding sigmoid plot, this is equivalent to the following (### z ###): 

     In fact, there are many applications where we are not only interested in the predicted class labels, but where estimating the class-membership probability is particularly useful. Logistic regression is used in weather forecasting, for example, to not
only predict if it will rain on a particular day but also to report the chance of rain. Similarly, logistic regression can be used to predict the chance that a patient has a particular disease given certain symptoms, which is why logistic regression enjoys
wide popularity in the field of medicine.

Learning the weights of the logistic cost function

     You learned how we could use the logistic regression model to predict probabilities and class labels; now, let us briefly talk about how we fit the parameters of the model, for instance the weights w. In the previous chapter, we defined the sum-squared-error cost function as follows (y=0, 1):minimize 
==>Maximizeandif y=1,
==>OR Minimizeif y=0
因为sigmoid函数的取值为(0,1),不包含0和1,,可以将其视为类1(y=1)的后验概率p(y=1|x),表示测试点x属于类别1 的概率有多大。==>
==>
if y=1,Maximize  to close to 1, then =1   ==>Maximize
if y=0,Minimize   to close to 0, then =1                  ==>Maximize
==>Maximizefor 1 instance x

     We minimized this function in order to learn the weights w for our Adaline classification model. To explain how we can derive the cost function for logistic regression, let's first define the likelihood L that we want to maximize when we build a logistic regression model, assuming that the individual samples in our dataset are independent of one another. The formula is as follows:
 #probabilities for all instances(from i=1 to n)

     In practice, it is easier to maximize the (natural) log of this equation, which is called the log-likelihood function:
 #Use logarithm to convert multiplication to addition

     Firstly, applying the log function reduces the potential for numerical underflow下溢, which can occur if the likelihoods are very small. Secondly, we can convert the product of factors into a summation of factors, which makes it easier to obtain the derivative of this function via the addition trick, as you may remember from calculus.

     Now we could use an optimization algorithm such as gradient ascent to maximize this log-likelihood function. Alternatively, let's rewrite the log-likelihood as a cost function J that can be minimized using gradient descent as in Cp 2, Training Machine Learning Algorithms for Classification:

     To get a better grasp on this cost function, let's take a look at the cost that we calculate for one single-sample instance:

     Looking at the preceding equation, we can see that the first term becomes zero if y = 0,
and the second term becomes zero if y =1, respectively:

     Let's write a short code snippet to create a plot that illustrates the cost of classifying a single-sample instance for different values of :

def sigmoid(z):
    return 1.0 / ( 1.0+np.exp(-z) )

def cost_1(z):
    return - np.log(sigmoid(z))     # if y=1

def cost_0(z):
    return - np.log(1 - sigmoid(z)) # if y=0

z = np.arange(-10, 10, 0.1)
phi_z = sigmoid(z) #max=0.9999498278353162 or 1; min= 4.5397868702434395e-05 or 0

c1 = [cost_1(x) for x in z] # if y=1
plt.plot(phi_z, c1, label='J(w) if y=1')

c0 = [cost_0(x) for x in z] # if y=0
plt.plot(phi_z, c0, linestyle='--', label='J(w) if y=0')

plt.ylim(0.0, 5.1)
plt.xlim([0, 1])
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

     The resulting plot shows the sigmoid activation  on the x axis, in the range 0 to 1 (the inputs to the sigmoid function were z values in the range -10 to 10) and the associated logistic cost on the y-axis:

   
     We can see that the cost approaches 0 (plain blue line) if we correctly predict that a sample belongs to class 1. Similarly, we can see on the y axis that the cost also approaches 0 if we correctly predict y = 0 (dashed line). However, if the prediction is wrong, the cost goes towards infinity. The moral is that we penalize wrong predictions with an increasingly larger cost.

Converting an Adaline implementation into an algorithm for logistic regression

     If we were to implement logistic regression ourselves, we could simply substitute the cost function J in our Adaline implementation from cp02_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDescent cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客 (and  == ), by the new cost function:

 and and 

     We use this to compute the cost of classifying all training samples per epoch. Also, we need to swap the linear activation function with the sigmoid activation and change the threshold function to return class labels 0 and 1 instead of -1 and 1. If we make those three changes to the Adaline code, we would end up with a working logistic regression implementation, as shown here:

class LogisticRegressionGD(object):
    """Logistic Regression Classifier using gradient descent.
    Parameters
    ------------
    eta : float
        Learning rate (between 0.0 and 1.0)
    n_iter : int
        Passes over the training dataset.
    random_state : int
        Random number generator seed for random weight
        initialization.
        
    Attributes
    -----------
    w_ : 1d-array
        Weights after fitting.
    cost_ : list
        Sum-of-squares cost function value in each epoch.
    """
    def __init__(self, eta=0.05, n_iter=100, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state
    
    def net_input(self, X):
        #calculate net input
        return np.dot(X, self.w_[1:]) + self.w_[0] # z = w^T*X = w_0 + w_1*x_1 +... + w_m*x_m
    
    def activation(self, z):
        # compute logistic sigmoid activation
        # numpy.clip(a, a_min, a_max, out=None)
        return 1. / ( 1. + np.exp( -np.clip(z, -250, 250) ) )
        
    def fit(self, X, y):
        """ Fit training data.
        Parameters
        ----------
        X : {array-like}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of
            samples and
            n_features is the number of features.
        y : array-like, shape = [n_samples]
            Target values.
        Returns
        -------
        self : object
        """
        
        rgen = np.random.RandomState(self.random_state)
                               # mean,  variance,       bias + number of features
        self.w_ = rgen.normal( loc=0.0, scale=0.01, size=1+X.shape[1])
        self.cost_ =[]
        
        for i in range(self.n_iter):
            net_input = self.net_input(X) # z = w^T*X = w_0 + w_1*x_1 +... + w_m*x_m
            output = self.activation(net_input) #logistic sigmoid function
            errors = (y-output)
            #Note: the weight update is calculated based on all samples in the training set 
            # X.T : (n_features, n_samples) #single column matrix
            self.w_[1:] += self.eta* X.T.dot(errors)
            self.w_[0] += self.eta*errors.sum() # self.eta* 1*errors.sum()
            
            # note that we compute the logistic 'cost' now
            # instead of the sum of errors cost
            cost = (-y.dot(np.log(output))
                    -( (1-y).dot(np.log(output)) )
                   )#note dot including a sum action
            self.cost_.append(cost)
            
        return self
    
    def predict(self, X):  
        #OR# np.where( self.activation( self.net_input(X) )>=0.5, 1,0 )
        return np.where( self.net_input(X)>=0.0, 1, 0)#return class label after unit step

     When we fit a logistic regression model, we have to keep in mind that it only works for binary classification tasks. So, let us consider only Iris-setosa and Irisversicolor flowers (classes 0 and 1) and check that our implementation of logistic regression works:

X_train_01_subset = X_train[ (y_train==0)|(y_train==1) ]
y_train_01_subset = y_train[ (y_train==0)|(y_train==1) ]
lrgd = LogisticRegressionGD(eta=0.05, n_iter=1000, random_state=1)
lrgd.fit(X_train_01_subset, y_train_01_subset)

plot_decision_regions(X=X_train_01_subset, y=y_train_01_subset, classifier=lrgd)
plt.xlabel('peta length [standardized]')
plt.ylabel('peta width [standardized]')
plt.legend(loc='upper left')

plt.tight_layout()
plt.show()

The resulting decision region plot looks as follows:

Note
The gradient descent learning algorithm for logistic regression

     Using calculus, we can show that the weight update in logistic regression via gradient descent is equal to the equation that we used in Adaline in Cp2, Training Simple Machine Learning Algorithms for Classification
cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客. However, please note that the following derivation of the gradient descent learning rule is intended for readers who are interested in the mathematical concepts behind the gradient descent learning rule for logistic regression. It is not essential for following the rest of this  chapter.

Let's start by calculating the partial derivative of the log-likelihood function with respect to the jth weight:

for each index i
Before we continue, let's also calculate the partial derivative of the sigmoid function:

Note
     Now, we can re-substitute  and  in our first equation to obtain the following:

     Remember that the goal is to find the weights that maximize the log-likelihood, so that we perform the update for each weight as follows:

     Since we update all weights simultaneously, we can write the general update rule as follows:

We define  as follows: 

     Since maximizing the log-likelihood is equal to minimizing the cost function J that we defined earlier, we can write the gradient descent update rule as follows:

This is equal to the gradient descent rule for Adaline in Cp 2, Training Simple Machine Learning Algorithms for Classification
cp2_TrainingSimpleMachineLearningAlgorithmsForClassification_meshgrid_ravel_contourf_OvA_GradientDes_LIQING LIN的博客-CSDN博客

Training a logistic regression model with scikit-learn

     We just went through useful coding and math exercises in the previous subsection, which helped illustrate the conceptual differences between Adaline and logistic regression. Now, let's learn how to use scikit-learn's more optimized implementation of logistic regression that also supports multi-class settings off the shelf (OvR by default). In the following code example, we will use the sklearn.linear_model.LogisticRegression class as well as the familiar fit method to train the model on all three classes in the standardized flower training dataset:

from sklearn.linear_model import LogisticRegression

# C: Inverse of regularization strength; must be a positive float.
# |      Like in support vector machines, smaller values specify stronger
# |      regularization.
lr = LogisticRegression(C=100.0, random_state=1)
lr.fit(X_train_std, y_train)

# X_combined_std = np.vstack((X_train_std, X_test_std))
# y_combined = np.hstack((y_train, y_test))
plot_decision_regions(X_combined_std, y_combined, classifier=lr, test_idx = range(105,150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc='upper left')
plt.show()

After fitting the model on the training data, we plotted the decision regions, training samples and test samples, as shown here: 

     Looking at the preceding code that we used to train the LogisticRegression model, you might now be wondering, "What is this mysterious parameter C?" We will discuss this parameter in the next subsection, where we first introduce the concepts
of overfitting and regularization. However, before we are moving on to those topics, let's finish our discussion of class-membership probabilities.

     The probability that training examples belong to a certain class can be computed using the predict_proba method. For example, we can predict the probabilities of the first three samples in the test set as follows:

lr.predict_proba(X_test_std[:3,:])

This code snippet returns the following array: 


     The first row corresponds to the class-membership probabilities of the first flower, the second row corresponds to the class-membership probabilities of the third flower, and so forth. Notice that the columns sum all up to one, as expected (you can
confirm this by executing lr.predict_proba(X_test_std[:3, :]).sum(axis=1)). The highest value in the first row is approximately 0.996, which means that the first sample belongs to class three (Iris-virginica) with a predicted probability of 99.6 percent. So, as you may have already noticed, we can get the predicted class labels by identifying the largest column in each row, for example, using NumPy's argmax function:

lr.predict_log_proba(X_test_std[:3,:]).argmax(axis=1)

The returned class indices are shown here (they correspond to Iris-virginica, Iris-setosa, and Iris-setosa): 

     The class labels we obtained from the preceding conditional probabilities is, of course, just a manual approach to calling the predict method directly, which we can quickly verify as follows:

lr.predict(X_test_std[:3,:])

 
     Lastly, a word of caution if you want to predict the class label of a single flower sample: sciki-learn expects a two-dimensional array as data input; thus, we have to convert a single row slice into such a format first. One way to convert a single row entry into a two-dimensional data array is to use NumPy's reshape method to add a new dimension, as demonstrated here:

lr.predict( X_test_std[0,:].reshape(1,-1) )

 

X_test_std[0,:].shape , X_test_std[0,:].reshape(1,-1).shape

  convert a single row entry into a two-dimensional data array

Tackling overfitting via regularization

     Overfitting is a common problem in machine learning, where a model performs well on training data but does not generalize well to unseen data (test data). If a model suffers from overfitting, we also say that the model has a high variance, which can be caused by having too many parameters that lead to a model that is too complex given the underlying data. Similarly, our model can also suffer from underfitting (high bias), which means that our model is not complex enough to capture the pattern in the training data well and therefore also suffers from low performance on
unseen data.

     Although we have only encountered linear models for classification so far, the problem of overfitting and underfitting can be best illustrated by comparing a linear decision boundary to more complex, nonlinear decision boundaries as shown in the following figure:

Note

     Variance measures the consistency (or variability) of the model prediction for a particular sample instance if we were to retrain the model multiple times, for example, on different subsets of the training dataset. We can say that the model is sensitive to the randomness in the training data. In contrast, bias measures how far off the predictions are from the correct values in general if we rebuild the model multiple times on different training datasets; bias is the measure of the systematic error that is not due to randomness.

     One way of finding a good bias-variance tradeoff is to tune the complexity of the model via regularization. Regularization is a very useful method to handle collinearity共线的 (high correlation among features), filter out noise from data, and eventually prevent overfitting. The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter (weight) values. The most common form of regularization is so-called L2 regularization (sometimes also called L2 shrinkage or weight decay), which can be written as follows:

Here, is the so-called regularization parameter.

Note

     Regularization is another reason why feature scaling such as standardization is important. For regularization to work properly, we need to ensure that all our features are on comparable scales.

     The cost function for logistic regression can be regularized by adding a simple regularization term, which will shrink the weights during model training:


     Via the regularization parameter , we can then control how well we fit the training data while keeping the weights small. By increasing the value of , we increase the regularization strength. (Goal: decrease cost, decrease the difference between prediction and y_target value; regularization strength: increase  , increase cost(OR loss), increase the difference between prediction and y_target value, to avoid overfitting(get lower variance))

     The parameter C that is implemented for the LogisticRegression class in scikitlearn comes from a convention in support vector machines, which will be the topic of the next section. The term C is directly related to the regularization parameter

which is its inverse.  

So we can rewrite the regularized cost function of logistic regression as follows: 

     Consequently, decreasing the value of the inverse regularization parameter C means that we are increasing the regularization strength( increae cost(OR loss) ), which we can visualize by plotting the L2-regularization path for the two weight coefficients: 

weights, params = [], []
for c in np.arange(-5, 5):
    lr = LogisticRegression(C=10.**c, random_state=1,
    # ‘sag’ uses a Stochastic Average Gradient descent, (samples>100k)
    # and ‘saga’ uses its improved, unbiased version named SAGA.
    # Limited-memory BFGS : Limited memory quasi-Newton methods
    # lbfgs:拟牛顿法的一种,利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数                        
    # https://blog.csdn.net/Linli522362242/article/details/104280075                        
    # https://blog.csdn.net/Linli522362242/article/details/104403372
    # http://www.seas.ucla.edu/~vandenbe/236C/lectures/qnewton.pdf                        
                           solver="lbfgs", #default # since 'ovr' and L2
                           multi_class="ovr" #Works only for the ‘lbfgs’ solver
                           )
    lr.fit(X_train_std, y_train)
    #  only collected the weight coefficients of the class 2 vs. all classifier
    #Coefficient of the features in class_2(n_classes==1)
    weights.append(lr.coef_[1])#coef_ :array, shape (n_classes, n_features)
    params.append(10.**c)
# Convert
# [array([9.45923160e-05, 5.76506032e-05]),
#  array([0.00094278, 0.0005734 ]),
#  ...
#  array([ 2.4424029 , -2.10629411])]
# To
# array([[ 9.45923160e-05,  5.76506032e-05],
#       [ 9.42782871e-04,  5.73401558e-04],
# ...
# [ 2.44240290e+00, -2.10629411e+00]])
weights = np.array(weights)    
plt.plot(params, weights[:, 0],
         label = "petal length")
plt.plot(params, weights[:, 1], linestyle='--',
         label = "petal width")
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='upper left')
plt.xscale('log') ###########
plt.show()

     By executing the preceding code, we fitted ten logistic regression models with different values for the inverse-regularization parameter C. For the purposes of illustration, we only collected the weight coefficients of the class 2 vs. all classifier. Remember that we are using the OvR technique for multiclass classification

     As we can see in the resulting plot, the weight coefficients shrink if we decrease the parameter C (Increase ), that is, if we increase the regularization strength:

Note

     Since an in-depth coverage of the individual classification algorithms exceeds the scope of this book, I strongly recommend Logistic Regression: From Introductory to Advanced Concepts and Applications, Dr. Scott Menard's, Sage Publications, 2009,
to readers who want to learn more about logistic regression.

Machine Learning
Before building this model, recall that our objective is to minimize the cost function in regularized logistic regression:

     The regularization parameter $\lambda$ is a control on your fitting parameters. As the magnitues of the fitting parameters increase, there will be an increasing penalty on the cost function. This penalty is dependent on the squares of the parameters as well as the magnitude of $\lambda$. Also, notice that the summation after $\lambda$ does not include $\theta_{0}^{2}.$

In this exercise, we will assign $x$ to be all monomials (meaning polynomial terms) of $u$ and $v$ up to the sixth power:

To clarify this notation: we have made a 28-feature vector $x$ where $x_0 = 1, x_1=u, x_2= v,\ldots x_{28} =v^6$.

     Notice that this looks like the cost function for unregularized logistic regression, except that there is a regularization term at the end. We will now minimize this function using Newton's method.

Newton's method 

Machine Learning

Recall that the Newton's Method update rule is

     This is the same rule that you used for unregularized logistic regression in Exercise 4. But because you are now implementing regularization, the gradient  $\nabla_{\theta}(J)$ and the Hessian $H$ have different forms:

     Notice that if you substitute $\lambda = 0$ into these expressions, you will see the same formulas as unregularized logistic regression. Also, remember that in these formulas,

1. $x^{(i)}$ is your feature vector, which is a 28x1 vector in this exercise.

2. $\nabla_{\theta}J$ is a 28x1 vector.

3. $x^{(i)}(x^{(i)})^T$ and $H$ are 28x28 matrices.

4. $y^{(i)}$ and $h_\theta(x^{(i)})$ are scalars.

5. The matrix following $\frac{\lambda}{m}$ in the Hessian formula is a 28x28 diagonal matrix with a zero in the upper left and ones on every other diagonal entry.

Maximum margin classification with support vector machines

cp3 ML Classifiers_2_support vector_Maximum margin_soft margin_C~slack_kernel_Gini_pydot+_Infor Gai_LIQING LIN的博客-CSDN博客

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值