cp7_SelectModel_Ensemble Learning_MajorityVoteClassifier_weight_logistic_get_params_bagging_transAxe

最新推荐文章于 2023-02-17 22:47:56 发布

LIQING LIN

最新推荐文章于 2023-02-17 22:47:56 发布

阅读量2.1k

点赞数 1

分类专栏： pythonMachineLearningInAction

本文链接：https://blog.csdn.net/linli522362242/article/details/109725186

版权

pythonMachineLearningInAction 专栏收录该内容

101 篇文章 1 订阅

订阅专栏

we must note that the bagging algorithm can be an effective approach to reducing the variance(~overfitting) of a model. However, bagging is ineffective in reducing model bias, that is, models that are too simple to capture the trend in the data well. This is why we want to perform bagging on an ensemble of classifiers with low bias, for example, unpruned decision trees.

boosting can lead to a decrease in bias as well as variance compared to bagging modelsIn practice, however, boosting algorithms such as AdaBoost are also known for their high variance, that is, the tendency to overfit the training data (An improvement of AdaBoost to avoid overfitting,

In the previous chapter, we focused on the best practices for tuning and evaluating different models for classification. In this chapter, we will build upon those techniques and explore different methods for constructing a set of classifiers that can often have a better predictive performance than any of its individual members.
We will learn how to do the following:

• Make predictions based on majority voting
• Use bagging to reduce overfitting by drawing random combinations of thetraining dataset with repetition
• Apply boosting to build powerful models from weak learners that learn from their mistakes

Learning with ensembles

The goal of ensemble methods is to combine different classifiers into a metaclassifier that has better generalization performance than each individual classifier alone. For example, assuming that we collected predictions from 10 experts, ensemble methods would allow us to strategically combine those predictions by the 10 experts to come up with a prediction that was more accurate and robust than the predictions by each individual expert. As you will see later in this chapter, there are several different approaches for creating an ensemble of classifiers. This section will introduce a basic explanation of how ensembles work and why they are typically recognized for yielding a good generalization performance.

we will focus on the most popular ensemble methods that use the majority voting principle. Majority voting simply means that we select the class label that has been predicted by the majority of classifiers, that is, received more than 50 percent of the votes. Strictly speaking, the term "majority vote" refers to binary class settings only. However, it is easy to generalize the majority voting principle to multiclass settings, which is called plurality voting. Here, we select the class label that received the most votes (the mode). The following diagram illustrates the concept of majority and plurality voting for an ensemble of 10 classifiers, where each unique symbol (triangle, square, and circle) represents a unique class label:
==

Using the training dataset, we start by training m different classifiers ( ). Depending on the technique, the ensemble can be built from different classification algorithms(This increases the chance that they will make very different types of errors, improving the ensemble's accuracy.), for example, decision trees, support vector machines, logistic regression classifiers, and so on. Alternatively, we can also use the same base classification algorithm, fitting different subsets of the training dataset. One prominent example of this approach is the random forest algorithm, which combines different decision tree classifiers. The following figure illustrates the concept of a general ensemble approach using majority voting ( This majority-vote classifier is called a called hard voting classifier):
<==>https://blog.csdn.net/Linli522362242/article/details/104771157
To predict a class label via simple majority or plurality voting, we can combine the predicted class labels of each individual classifier, , and select the class label, , that received the most votes:

(In statistics, the mode is the most frequent event or result in a set. For example, mode{1, 2, 1, 1, 2, 4, 5, 4} = 1.)

For example, in a binary classification task where class1 = –1 and class2 = +1, we can write the majority vote prediction as follows:

To illustrate why ensemble methods can work better than individual classifiers alone, let's apply the simple concepts of combinatorics组合学. For the following example, we will make the assumption that all n-base classifiers for a binary classification task have an equal error rate, 𝜀 . Furthermore, we will assume that the classifiers are independent and the error rates are not correlated( if all classifiers trained on the same data that will make correlated errors. These classifiers are likely to make the same types of errors, so there will be many majority votes for the wrong class, reducing the ensemble's accuracy. https://blog.csdn.net/Linli522362242/article/details/104771157). Under those assumptions, we can simply express the error probability of an ensemble of base classifiers as a probability mass function（Probability density function, pdf function概率密度函数） of a binomial distribution:

Here, is the binomial coefficient n choose k. In other words, we compute the probability that the prediction of the ensemble is wrong. Now, let's take a look at a more concrete example of 11 base classifiers (n = 11， k=6), where each classifier has an error rate of 0.25 (𝜀 = 0.25 ):

The binomial coefficient

The binomial coefficient refers to the number of ways we can choose subsets of k-unordered elements from a set of size n; thus, it is often called "n choose k." Since the order does not matter here, the binomial coefficient is also sometimes referred to as combination or combinatorial number, and in its unabbreviated form, it is written as follows:

Here, the symbol (!) stands for factorial—for example,
3! = 3 ∙ 2 ∙ 1 = 6 .

As you can see, the error rate of the ensemble (0.034) is much lower than the error rate of each individual classifier (0.25) if all the assumptions are met. Note that, in this simplified illustration, a 50-50 split五五分 by an even number of classifiers, n, is treated as an error, whereas this is only true half of the time. To compare such an idealistic ensemble classifier to a base classifier over a range of different base error rates, let's implement the probability mass function in Python:

from scipy.special import comb
import math

def ensemble_error(n_classifier, error_rate):
    k_start = int( math.ceil(n_classifier/2.) ) #if n_classifier=11, then k_start=6
    probs = [ comb(n_classifier, k) * error_rate**k * (1-error_rate)**(n_classifier-k)
              for k in range(k_start, n_classifier+1) ]
    return sum(probs)

ensemble_error(n_classifier=11, error_rate=0.25)

After we have implemented the ensemble_error function, we can compute the ensemble error rates for a range of different base errors from 0.0 to 1.0 to visualize the relationship between ensemble and base errors in a line graph:

import numpy as np

error_range = np.arange(0.0, 1.01, 0.01) # base errors from 0.0 to 1.0
ens_errors = [ ensemble_error(n_classifier=11, error_rate=error) 
               for error in error_range ]

import matplotlib.pyplot as plt

plt.plot(error_range, ens_errors, label="Ensemble error", lw=2)

plt.plot(error_range, error_range, ls='--', label="Base error", lw=2)

plt.xlabel('Base error')
plt.ylabel('Base/Ensemble error')
plt.legend(loc='upper left')
plt.grid(alpha=0.5)
plt.show()

As you can see in the resulting plot, the error probability of an ensemble is always better than the error of an individual base classifier, as long as the base classifiers perform better than random guessing (𝜀 < 0.5 ). Note that the y axis depicts the base error (dotted line) as well as the ensemble error (continuous line):

Combining classifiers via majority vote

After the short introduction to ensemble learning in the previous section, let's start with a warm-up exercise and implement a simple ensemble classifier for majority voting in Python.

Plurality voting

Although the majority voting algorithm that we will discuss in this section also generalizes to multiclass settings via plurality voting, the term "majority voting" will be used for simplicity, as is often the case in the literature.

Implementing a simple majority vote classifier

The algorithm that we are going to implement in this section will allow us to combine different classification algorithms associated with individual weights for confidence. Our goal is to build a stronger meta-classifier that balances out the individual classifiers' weaknesses on a particular dataset. In more precise mathematical terms, we can write the weighted majority vote as follows:

Here, is a weight associated with a base classifier, ; 𝑦̂ is the predicted class label of the ensemble; A is the set of unique class labels; (Greek chi) is the characteristic function or indicator function, which returns 1 if the predicted class of the jth classifier matches i ( and ). For equal weights, we can simplify this equation and write it as follows:

To better understand the concept of weighting, we will now take a look at a more concrete example. Let's assume that we have an ensemble of three base classifiers,
, and we want to predict the class label, , of a given example, x. Two out of three base classifiers predict the class label 0, and one, , predicts that the example belongs to class 1. If we weight the predictions of each base classifier equally, the majority vote predicts that the example belongs to class 0:

Now, let's assign a weight of 0.6 to , and let's weight and by a coefficient of 0.2:
predicts that the example belongs to class 1

More simply, since 3 × 0.2 = 0.6, we can say that the prediction made by has three times more weight than the predictions by or , which we can write as follows:

To translate the concept of the weighted majority vote into Python code, we can use NumPy's convenient argmax and bincount functions:

         # np.bincount( [0,0,1], weights=[0.2,0.2,0.6] ) ==> array([0.4, 0.6])
np.argmax(np.bincount([0,0,1], weights = [0.2,0.2,0.6]))

As you will remember from the discussion on logistic regression in Cp3, A Tour of Machine Learning Classifiers Using scikit-learn
https://blog.csdn.net/Linli522362242/article/details/96480059, certain classifiers in scikit-learn can also return the probability of a predicted class label via the predict_proba method. Using the predicted class probabilities instead of the class labels for majority voting can be useful if the classifiers in our ensemble are well calibrated['kælɪbreɪtɪd]校准. The modified version of the majority vote for predicting class labels from probabilities can be written as follows:

Here, is the predicted probability of the jth classifier for class label i.

To continue with our previous example, let's assume that we have a binary classification problem with class labels 𝑖 ∈ {0, 1} or and an ensemble of three classifiers .and Let's assume that the classifiers return the following class membership probabilities for a particular example, x:

Using the same weights as previously (0.2, 0.2, and 0.6), we can then calculate the individual class probabilities as follows:

To implement the weighted majority vote based on class probabilities, we can again make use of NumPy, using np.average and np.argmax:

ex = np.array([[0.9, 0.1], # *0.2
               [0.8, 0.2], # *0.2
               [0.4, 0.6]  # *0.6
              ])

# OR # if weights not None, ex.T.dot( [0.2,0.2,0.6] ) ==> array([0.58, 0.42])
#    # if weights=None, ex.T.dot( [1,1,1] )/3 ==> array([0.7, 0.3])
    # np.average( ex, axis=0, weights=None)   ==> array([0.7, 0.3])
p = np.average( ex, axis=0, weights=[0.2,0.2,0.6]) #default weights=None
p

np.argmax(p)

Putting everything together, let's now implement MajorityVoteClassifier in Python:

I've added a lot of comments to the code to explain the individual parts. However, before we implement the remaining methods, let's take a quick break and discuss some of the code that may look confusing at first. We used the BaseEstimator and ClassifierMixin parent classes to get some base functionality for free, including the get_params and set_params methods to set and return the classifier's parameters, as well as the score method to calculate the prediction accuracy.

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator

class MajorityVoteClassifier(BaseEstimator, ClassifierMixin):
    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'} (default='classlabel')
      If 'classlabel' the prediction is based on the argmax of
        class labels. 
      Else if 'probability', the argmax of
        the sum of probabilities is used to predict the class label
        (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers], optional (default=None)
      If a list of `int` or `float` values are provided, the classifiers
      are weighted by importance; 
      Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers, vote='classlabel', weights=None):
        self.classifiers = classifiers
        # estimators = ['a', 'a', 'b' ]
        # _name_estimators(estimators) #  [('a-1', 'a'), ('a-2', 'a'), ('b', 'b')]
        self.named_classifiers = { key:value for key,value in _name_estimators(classifiers) }
        self.vote = vote
        self.weights = weights
    
    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, density matrix}, shape = [n_examples, n_features]
            Matrix of training examples.

        y : array-like, shape = [n_examples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        if self.vote not in ('probability', 'classlabel'):
            raise ValueError("vote must be 'probability' or 'classlabel'"
                             "; got (vote=%r)" % self.vote) 
            # %r是一个万能的格式付，它会将后面给的参数原样打印出来，带有类型信息。
        
        if self.weights and len(self.weights)!=len(self.classifiers):
            raise ValueError("Number of classifiers and weights must be equal"
                             "; got %d num_weights, %d num_classifier"
                             % ( len(self.weights), len(self.classifiers) )
                            )
        # Use LabelEncoder to ensure class labels(indices) start with 0, which
        # is important for np.argmax call in self.predict
        self.labelEncoder_ = LabelEncoder()
        self.labelEncoder_.fit(y)
        self.classes_ = self.labelEncoder_.classes_ # ==> y_label_indices of set(y)
        # for examples
        # le = preprocessing.LabelEncoder()
        # le.fit([1, 2, 2, 6]) ==> LabelEncoder()
        # le.classes_ ==> array([1, 2, 6])
        # le.transform([1, 1, 2, 6]) ==> array([0, 0, 1, 2]...)
        # le.inverse_transform([0, 0, 1, 2]) ==> array([1, 1, 2, 6])
        self.classifiers_ = [] # fitted classifiers
        for clf in self.classifiers:
            # Clone does a deep copy of the model in an estimator without actually copying 
            # attached data. It yields a new estimator with the "same parameters" that has not 
            # been fit on any data.
            fitted_clf = clone(clf).fit( X, self.labelEncoder_.transform(y) ) # training
            self.classifiers_.append(fitted_clf)
        return self

Next, we will add the predict method to predict the class label via a majority vote based on the class labels (hard voting) if we initialize a new MajorityVoteClassifier object with vote='classlabel'. Alternatively, we will be able to initialize the ensemble classifier with vote='probability' (This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes.) to predict the class label based on the class membership probabilities. Furthermore, we will also add a predict_proba method to return the averaged probabilities, which is useful when computing the receiver operating characteristic area under the curve (ROC AUC):

def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, Density matrix}, shape = [n_examples, n_features]
            Matrix of training examples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_examples]
            Predicted class labels.
            
        """
        if self.vote =='probability':
            # self.predict_proba(X) : shape[n_examples, num_classLabels_probabilities]
            maj_vote = np.argmax( self.predict_proba(X), axis=1 ) #axis=1 : column or features
            # return a class_label_index for each new instance from X # a list
        else: # self.vote == 'classlabel'
            #  Collect results from clf.predict calls
            predictions = np.asarray([ clf.predict(X) 
                                       for clf in self.classifiers_ ]).T
            # prediction : a class_label list for each new instance from X
            # [ num_classifiers, n_examples ].T ==> predictions.shape: [ n_examples, num_classifiers]
            maj_vote = np.apply_along_axis( lambda x: np.argmax( 
                                                         np.bincount(x, weights=self.weights) 
                                                      ), 
                                            axis=1, 
                                            arr=predictions # Input array.
                                          ) # return a list of class_label_indx
        maj_vote = self.labelEncoder_.inverse_transform(maj_vote) #decode ==> class_label
        return maj_vote
            
    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_examples, n_features]
            Training vectors, where n_examples is the number of examples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like, shape = [n_examples, n_classes]
            Weighted average probability for each class per example.

        """
        probas = np.asarray([ clf.predict_proba(X) for clf in self.classifiers_ ]) 
        # probas : shape[num_classifiers, n_examples, num_classLabels_probabilities]
         
        avg_proba = np.average( probas, axis=0, weights=self.weights )#along axis_0 * w, sum element_wise
        # avg_proba : shape[n_examples, num_classLabels_probabilities]
        # OR
        #    # probas.shape: [num_classifiers, n_examples, num_classLabels_probabilities]
        #    # ==> [n_examples, num_classLabels_probabilities, num_classifiers]
        #    # via # np.transpose(probas, (1,2,0)) # why (1,2,0) since avg_proba : shape[n_examples, num_classLabels_probabilities]
#         if self.weights :
#             avg_proba = np.transpose(probas, (1,2,0)).dot( self.weights ) 
#         else:
#             avg_proba = np.transpose(probas, (1,2,0)).dot(np.ones(len(self.classifiers)))/len(self.classifiers)
        return avg_proba
    
    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        # sklearn.base.BaseEstimator
        #                          deep:bool, default=True
        #                               If True, will return the parameters for this estimator 
        #                               and contained subobjects that are estimators.
        if not deep:
            return super(MajorityVoteClassifier, self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in self.named_classifiers.items():
                for key, value in step.get_params(deep=True).items():
                    out['%s__%s' % (name, key)]=value
            return out

Also, note that we defined our own modified version of the get_params method to use the _name_estimators function to access the parameters of individual classifiers in the ensemble; this may look a little bit complicated at first, but it will make perfect sense when we use grid search for hyperparameter tuning in later sections.

VotingClassifier in scikit-learn

Although the MajorityVoteClassifier implementation is very useful for demonstration purposes, we implemented a more sophisticated version of this majority vote classifier in scikit-learn. The ensemble classifier is available as sklearn.ensemble.VotingClassifier in scikit-learn version 0.17 and newer.

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator

class MajorityVoteClassifier(BaseEstimator, ClassifierMixin):
    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'} (default='classlabel')
      If 'classlabel' the prediction is based on the argmax of
        class labels. 
      Else if 'probability', the argmax of
        the sum of probabilities is used to predict the class label
        (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers], optional (default=None)
      If a list of `int` or `float` values are provided, the classifiers
      are weighted by importance; 
      Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers, vote='classlabel', weights=None):
        self.classifiers = classifiers
        # estimators = ['a', 'a', 'b' ]
        # _name_estimators(estimators) #  [('a-1', 'a'), ('a-2', 'a'), ('b', 'b')]
        self.named_classifiers = { key:value for key,value in _name_estimators(classifiers) }
        self.vote = vote
        self.weights = weights
    
    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, density matrix}, shape = [n_examples, n_features]
            Matrix of training examples.

        y : array-like, shape = [n_examples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        if self.vote not in ('probability', 'classlabel'):
            raise ValueError("vote must be 'probability' or 'classlabel'"
                             "; got (vote=%r)" % self.vote) 
            # %r是一个万能的格式付，它会将后面给的参数原样打印出来，带有类型信息。
        
        if self.weights and len(self.weights)!=len(self.classifiers):
            raise ValueError("Number of classifiers and weights must be equal"
                             "; got %d num_weights, %d num_classifier"
                             % ( len(self.weights), len(self.classifiers) )
                            )
        # Use LabelEncoder to ensure class labels(indices) start with 0, which
        # is important for np.argmax call in self.predict
        self.labelEncoder_ = LabelEncoder()
        self.labelEncoder_.fit(y)
        self.classes_ = self.labelEncoder_.classes_ # ==> y_label_indices of set(y)
        # for examples
        # le = preprocessing.LabelEncoder()
        # le.fit([1, 2, 2, 6]) ==> LabelEncoder()
        # le.classes_ ==> array([1, 2, 6])
        # le.transform([1, 1, 2, 6]) ==> array([0, 0, 1, 2]...)
        # le.inverse_transform([0, 0, 1, 2]) ==> array([1, 1, 2, 6])
        self.classifiers_ = [] # fitted classifiers
        for clf in self.classifiers:
            # Clone does a deep copy of the model in an estimator without actually copying 
            # attached data. It yields a new estimator with the "same parameters" that has not 
            # been fit on any data.
            fitted_clf = clone(clf).fit( X, self.labelEncoder_.transform(y) )
            self.classifiers_.append(fitted_clf)
        return self
    
    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, Density matrix}, shape = [n_examples, n_features]
            Matrix of training examples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_examples]
            Predicted class labels.
            
        """
        if self.vote =='probability':
            # self.predict_proba(X) : shape[n_examples, num_classLabels_probabilities]
            maj_vote = np.argmax( self.predict_proba(X), axis=1 ) #axis=1 : column or features
            # return a class_label_index for each new instance from X # a list
        else: # self.vote == 'classlabel'
            #  Collect results from clf.predict calls
            predictions = np.asarray([ clf.predict(X) 
                                       for clf in self.classifiers_ ]).T
            # prediction : a class_label list for each new instance from X
            # [ num_classifiers, n_examples ].T ==> predictions.shape: [ n_examples, num_classifiers]
            maj_vote = np.apply_along_axis( lambda x: np.argmax( 
                                                         np.bincount(x, weights=self.weights) 
                                                      ), 
                                            axis=1, 
                                            arr=predictions # Input array.
                                          ) # return a list of class_label_indx
        maj_vote = self.labelEncoder_.inverse_transform(maj_vote) #decode ==> class_label
        return maj_vote
            
    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_examples, n_features]
            Training vectors, where n_examples is the number of examples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like, shape = [n_examples, n_classes]
            Weighted average probability for each class per example.

        """
        probas = np.asarray([ clf.predict_proba(X) for clf in self.classifiers_ ]) 
        # probas : shape[num_classifiers, n_examples, num_classLabels_probabilities]
         
        avg_proba = np.average( probas, axis=0, weights=self.weights )#along axis_0 * w, sum element_wise
        # avg_proba : shape[n_examples, num_classLabels_probabilities]
        # OR
        #    # probas.shape: [num_classifiers, n_examples, num_classLabels_probabilities]
        #    # ==> [n_examples, num_classLabels_probabilities, num_classifiers]
        #    # np.transpose(probas, (1,2,0))
#         if self.weights :
#             avg_proba = np.transpose(probas, (1,2,0)).dot( self.weights ) 
#         else:
#             avg_proba = np.transpose(probas, (1,2,0)).dot(np.ones(len(self.classifiers)))/len(self.classifiers)
        return avg_proba
    
    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        # sklearn.base.BaseEstimator
        #                          deep:bool, default=True
        #                               If True, will return the parameters for this estimator 
        #                               and contained subobjects that are estimators.
        if not deep:
            return super(MajorityVoteClassifier, self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in self.named_classifiers.items():
                for key, value in step.get_params(deep=True).items():
                    out['%s__%s' % (name, key)]=value
            return out

Using the majority voting principle to make predictions

Now it is time to put the MajorityVoteClassifier that we implemented in the previous section into action. But first, let's prepare a dataset that we can test it on. Since we are already familiar with techniques to load datasets from CSV files, we will take a shortcut and load the Iris dataset from scikit-learn's datasets module. Furthermore, we will only select two features, sepal width and petal length, to make the classification task more challenging for illustration purposes. Although our MajorityVoteClassifier generalizes to multiclass problems, we will only classify flower examples from the Iris-versicolor and Iris-virginica classes, with which we will compute the ROC AUC later. The code is as follows:

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
iris

... ...

X,y = iris.data[50:, [1,2]], iris.target[50:] # [1,2] : sepal width and petal length
le= LabelEncoder()
y = le.fit_transform(y) #==>y_label_index_list

Actually, we don't need LabelEncoder object to fit_transfrom(y) since I iris.target[50:] is a integer class list
#####################################################################

Class membership probabilities from decision trees

Note that scikit-learn uses the predict_proba method (if applicable) to compute the ROC AUC score. In cp3, A Tour of Machine Learning Classifiers Using scikit-learn https://blog.csdn.net/Linli522362242/article/details/96480059, we saw how the class probabilities are computed in logistic regression models (a linear model for binary classification that can be extended to multiclass classification via the OvR technique, , (y=0, 1)==>, , (y=0, 1)==>minimize cost function, (y=0, 1),==>Convert to maximize the likelihood, (y=0, 1) , (When phi_z or is close to 1 or 0 (when phi_z is close to 0, 1-phi_z needs to be done), the distinction between 2 categories will become obvious. The product of them (the number of phi_z close to 1 plus the number of phi_z close to 0 equals n) will be decrease more slowly (compared to the result of n phi_z products close to 0.5 will decrease faster))==>maximize Use logarithm to convert multiplication to addition==>minimize.
In decision trees, the probabilities are calculated from a frequency vector that is created for each node at training time. The vector collects the frequency values of each class label computed from the class label distribution at that node. Then, the frequencies are normalized so that they sum up to 1. Similarly, the class labels of the k-nearest neighbors are aggregated to return the normalized class label frequencies in the k-nearest neighbors algorithm. Although the normalized probabilities returned by both the decision tree and k-nearest neighbors classifier may look similar to the probabilities obtained from a logistic regression model, we have to be aware that these are actually not derived from probability mass functions.
#####################################################################

Next, we will split the Iris examples into 50 percent training and 50 percent test data:

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.5,
                                                    random_state=1,
                                                    stratify=y)
y

Using the training dataset, we now will train three different classifiers:

• Logistic regression classifier
• Decision tree classifier
• k-nearest neighbors classifier

We will then evaluate the model performance of each classifier via 10-fold crossvalidation on the training dataset before we combine them into an ensemble classifier:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

clf1 = LogisticRegression(penalty='l2',
                          C=0.001,
                          solver='lbfgs',
                          random_state=1)

clf2 = DecisionTreeClassifier(max_depth=1,
                              criterion='entropy',
                              random_state=0)

                                # https://blog.csdn.net/Linli522362242/article/details/107843678
clf3 = KNeighborsClassifier(n_neighbors=1,
                            p=2,               # The default metric is minkowski, and with p=2 
                            metric='minkowski') # is equivalent to the standard Euclidean metric.

pipe1 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf1]
                 ])
pipe3 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf3]
                 ])

clf_labels=['Logistic regression', 'Decision tree', 'KNN']

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, clf2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,#roc_curve() ==> fpr(FP/(FP+TN), tpr(TP/(TP+FN),thresholds
                             cv=10,             # roc: Receiver operating characteristic
                             scoring='roc_auc') # auc: area under the curve
                             # https://blog.csdn.net/Linli522362242/article/details/109560084
    print('ROC AUC: %0.2f (+/- %0.2f) [%s]' % (scores.mean(), scores.std(), label))

The output that we receive, as shown in the following snippet, shows that the predictive performances of the individual classifiers are almost equal:

You may be wondering why we trained the logistic regression and k-nearest neighbors classifier as part of a pipeline. The reason behind it is that, as discussed in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, both the logistic regression and k-nearest neighbors algorithms (using the Euclidean distance metric) are not scale-invariant, in contrast to decision trees. Although the Iris features are all measured on the same scale (cm), it is a good habit to work with standardized features.

Now, let's move on to the more exciting part and combine the individual classifiers for majority rule voting in our MajorityVoteClassifier:

# Majority Rule (hard) Voting

mv_clf = MajorityVoteClassifier( classifiers=[pipe1, clf2, pipe3] )

clf_labels += ['Majority voting'] # clf_labels=['Logistic regression', 'Decision tree', 'KNN']
all_clf = [pipe1, clf2, pipe3, mv_clf]

for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='roc_auc')
    print("ROC AUC: %0.2f (+/- %0.2f) [%s]" % ( scores.mean(), scores.std(), label ) )

on training set

As you can see, the performance of MajorityVotingClassifier has improved over the individual classifiers in the 10-fold cross-validation evaluation.

Evaluating and tuning the ensemble classifier

In this section, we are going to compute the ROC curves from the test dataset to check that MajorityVoteClassifier generalizes well with unseen data. We must remember that the test dataset is not to be used for model selection; its purpose is merely to report an unbiased estimate of the generalization performance of a classifier system:

from sklearn.metrics import roc_curve
from sklearn.metrics import auc

colors = ['black', 'orange', 'blue', 'red']
linestyles = [':', '--', '-.', '-']
                             # all_clf = [pipe1, clf2, pipe3, mv_clf]
for clf, label, clr, ls in zip(all_clf, clf_labels, colors, linestyles):
    # assuming the label of the positve class is 1
                      # predict_proba : shape[n_examples, num_classLabels_probabilities]
    y_pred = clf.fit(X_train, y_train).predict_proba(X_test)[:,1]
    
    fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=y_pred)
    
    roc_auc = auc(x=fpr, y=tpr)
    
    plt.plot( fpr, tpr, 
              color=clr, linestyle=ls, label='%s (auc = %0.2f)' % (label, roc_auc) )
    
plt.legend(loc='lower right')
plt.plot( [0,1], [0,1],
          ls='--', c='gray', lw=2
        )
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.grid(alpha=0.5)
plt.xlabel('False positive rate (FPR)')
plt.ylabel('True positive rate (FPR)')

plt.show()

As you can see in the resulting ROC, the ensemble classifier also performs well on the test dataset (ROC AUC = 0.95). However, you can see that the logistic regression classifier performs similarly well on the same dataset, which is probably due to the high variance (in this case, sensitivity of how we split the dataset) given the small size of the dataset:

Since we only selected two features for the classification examples, it would be interesting to see what the decision region of the ensemble classifier actually looks like.

Although it is not necessary to standardize the training features prior to model fitting, because our logistic regression and k-nearest neighbors pipelines will automatically take care of it, we will standardize the training dataset so that the decision regions of the decision tree will be on the same scale for visual purposes. The code is as follows:

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)

from itertools import product

all_clf = [pipe1, clf2, pipe3, mv_clf]

x_min = X_train_std[:,0].min()-1
x_max = X_train_std[:,0].max()+1
y_min = X_train_std[:,1].min()-1
y_max = X_train_std[:,1].max()+1

xx,yy = np.meshgrid( np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1) 
                   )
fig, axes = plt.subplots( nrows=2, ncols=2,
                         sharex='col', sharey='row',
                         figsize=(7,5) )
                      # product([0,1], [0,1]) ==> (0,0) , (0,1) , (1,0) , (1,1)
for idx, clf, clf_label in zip(product([0,1], [0,1]),
                        all_clf, clf_labels):
    clf.fit( X_train_std, y_train)
    
    Z = clf.predict( np.c_[ xx.ravel(), yy.ravel() ] )
    Z = Z.reshape( xx.shape )
    
    axes[ idx[0], idx[1] ].contourf(xx, yy, Z, alpha=0.3)
    axes[ idx[0], idx[1] ].scatter( X_train_std[y_train==0, 0], X_train_std[y_train==0, 1],
                                    c='blue', marker='^', s=50)
    axes[ idx[0], idx[1] ].scatter( X_train_std[y_train==1, 0], X_train_std[y_train==1, 1],
                                    c='green', marker='o', s=50)
    axes[ idx[0], idx[1] ].set_title(clf_label)
    
plt.text( -3.5, -5., s='Sepal width [standardized]', ha='center', va='center', fontsize=12)
plt.text( -12.5, 4.5, s='Petal length [standardized]', ha='center', va='center', fontsize=12,
          rotation=90)
plt.show()

Interestingly, but also as expected, the decision regions of the ensemble classifier seem to be a hybrid of the decision regions from the individual classifiers. At first glance, the majority vote decision boundary looks a lot like the decision of the decision tree stump, which is orthogonal to the y axis for 萼片宽度sepal width ≥ 1. However, you can also notice the non-linearity from the k-nearest neighbor classifier mixed in:

Before we tune the individual classifier's parameters for ensemble classification, let's call the get_params method to get a basic idea of how we can access the individual
parameters inside a GridSearchCV object:

mv_clf.get_params()

{'pipeline-1': Pipeline(steps=[('sc', StandardScaler()),
['clf', LogisticRegression(C=0.001, random_state=1)]
]),
'decisiontreeclassifier': DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=0),
'pipeline-2': Pipeline(steps=[('sc', StandardScaler()),
['clf', KNeighborsClassifier(n_neighbors=1)]
]),
'pipeline-1__memory': None,
'pipeline-1__steps': [('sc', StandardScaler()),
['clf', LogisticRegression(C=0.001, random_state=1)]
],
'pipeline-1__verbose': False,
'pipeline-1__sc': StandardScaler(),
'pipeline-1__clf': LogisticRegression(C=0.001, random_state=1),
'pipeline-1__sc__copy': True,
'pipeline-1__sc__with_mean': True,
'pipeline-1__sc__with_std': True,
'pipeline-1__clf__C': 0.001,
'pipeline-1__clf__class_weight': None,
'pipeline-1__clf__dual': False,
'pipeline-1__clf__fit_intercept': True,
'pipeline-1__clf__intercept_scaling': 1,
'pipeline-1__clf__l1_ratio': None,
'pipeline-1__clf__max_iter': 100,
'pipeline-1__clf__multi_class': 'auto',
'pipeline-1__clf__n_jobs': None,
'pipeline-1__clf__penalty': 'l2',
'pipeline-1__clf__random_state': 1,
'pipeline-1__clf__solver': 'lbfgs',
'pipeline-1__clf__tol': 0.0001,
'pipeline-1__clf__verbose': 0,
'pipeline-1__clf__warm_start': False,
'decisiontreeclassifier__ccp_alpha': 0.0,
'decisiontreeclassifier__class_weight': None,
'decisiontreeclassifier__criterion': 'entropy',
'decisiontreeclassifier__max_depth': 1,
'decisiontreeclassifier__max_features': None,
'decisiontreeclassifier__max_leaf_nodes': None,
'decisiontreeclassifier__min_impurity_decrease': 0.0,
'decisiontreeclassifier__min_impurity_split': None,
'decisiontreeclassifier__min_samples_leaf': 1,
'decisiontreeclassifier__min_samples_split': 2,
'decisiontreeclassifier__min_weight_fraction_leaf': 0.0,
'decisiontreeclassifier__presort': 'deprecated',
'decisiontreeclassifier__random_state': 0,
'decisiontreeclassifier__splitter': 'best',
'pipeline-2__memory': None,
'pipeline-2__steps': [('sc', StandardScaler()),
['clf', KNeighborsClassifier(n_neighbors=1)]
],
'pipeline-2__verbose': False,
'pipeline-2__sc': StandardScaler(),
'pipeline-2__clf': KNeighborsClassifier(n_neighbors=1),
'pipeline-2__sc__copy': True,
'pipeline-2__sc__with_mean': True,
'pipeline-2__sc__with_std': True,
'pipeline-2__clf__algorithm': 'auto',
'pipeline-2__clf__leaf_size': 30,
'pipeline-2__clf__metric': 'minkowski',
'pipeline-2__clf__metric_params': None,
'pipeline-2__clf__n_jobs': None,
'pipeline-2__clf__n_neighbors': 1,
'pipeline-2__clf__p': 2,
'pipeline-2__clf__weights': 'uniform'}
################################################################################################

iid: if (the data) is identically distributed across the folds(256+153=409 samples)

# If True, return the average score across folds, [1,1,1....same]weighted by the number of samples in each test set.
# In this case, the data is assumed to be identically distributed across the folds,
# and (during the training)the loss minimized is the total loss per sample(with m features)


print("CV accuracy scores: %s" % scores)

print("CV accuracy: %.3f +/- %.3f" % (np.mean(scores), np.std(scores)))

iid: False ==> (during the training) the loss minimized is the mean loss across the folds.
return the average score across folds, [0.1, 0.9, 0.2,...different]weighted by the number of samples in each test set.

################################################################################################

Based on the values returned by the get_params method, we now know how to access the individual classifier's attributes. Let's now tune the inverse regularization
parameter, C, of the logistic regression classifier and the decision tree depth via a grid search for demonstration purposes:

from sklearn.model_selection import GridSearchCV

params = {'decisiontreeclassifier__max_depth': [1,2],
          'pipeline-1__clf__C': [0.001, 0.1, 100.0]}

grid = GridSearchCV( estimator=mv_clf, 
                     param_grid=params, 
                     cv=10, 
                     iid=False, # bool, default=False # will be removed in version 0.24
                     scoring='roc_auc')
# iid=False, # bool, default=False  # iid: if (the data) is identically distributed across the folds
# If True, return the average score across folds, weighted by the number of samples in each test set. 
#          In this case, the data is assumed to be identically distributed across the folds, 
#          and the loss minimized is the total loss per sample, and not the mean loss across the folds.
grid.fit(X_train, y_train)

grid.cv_results_

{'mean_fit_time': array([ 0.0197011 , 0.01340079, 0.01670096, 0.01440084, 0.01340072, 0.01780105]),
'std_fit_time': array([ 0.00836764, 0.00066333, 0.00045829, 0.00101983, 0.00066337, 0.00240022]),
'mean_score_time': array([0.00590038, 0.00470026, 0.00420022, 0.00450025, 0.00470026, 0.00440023]),
'std_score_time': array([ 0.00186816, 0.00045825, 0.00040001, 0.00050006, 0.00064036, 0.0004899 ]),
'param_decisiontreeclassifier__max_depth': masked_array(data=[1, 1, 1, 2, 2, 2],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'param_pipeline-1__clf__C': masked_array(data=[0.001, 0.1, 100.0, 0.001, 0.1, 100.0],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 0.001},
{'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 0.1},
{'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 100.0},
{'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 0.001},
{'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 0.1},
{'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 100.0}
],
'split0_test_score': array([1., 1., 1., 1., 1., 1.]),
'split1_test_score': array([1., 1., 1., 1., 1., 1.]),
'split2_test_score': array([1., 1., 1., 1., 1., 1.]),
'split3_test_score': array([1., 1., 1., 1., 1., 1.]),
'split4_test_score': array([0.83333333, 0.83333333, 0.66666667, 0.83333333, 0.83333333, 0.66666667]),
'split5_test_score': array([1., 1., 1., 1., 1., 1.]),
'split6_test_score': array([1., 1., 1., 1., 1., 1.]),
'split7_test_score': array([1., 1., 1., 1., 1., 1.]),
'split8_test_score': array([1., 1., 1., 1., 1., 1.]),
'split9_test_score': array([1., 1., 1., 1., 1., 1.]),
'mean_test_score': array([0.98333333, 0.98333333, 0.96666667, 0.98333333, 0.98333333, 0.96666667]),
'std_test_score': array([0.05, 0.05, 0.1 , 0.05, 0.05, 0.1 ]),
'rank_test_score': array([1, 1, 5, 1, 1, 5])
}

After the grid search has completed, we can print the different hyperparameter value combinations and the average ROC AUC scores computed via 10-fold crossvalidation
as follows:

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r" % (grid.cv_results_['mean_test_score'][r],
                                  grid.cv_results_['std_test_score'][r]/2.0,
                                  grid.cv_results_['params'][r]
                                 ))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

As you can see, we get the best cross-validation results when we choose a larger regularization strength (C=0.001, C: Inverse of regularization strength; must be a positive float., increase the regularization strength (small values of C) ), whereas the tree depth does not seem to affect the performance at all, suggesting that a decision stump is sufficient to separate the data.
{'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 0.001},
'mean_test_score': array([0.98333333, ...
'std_test_score': array([0.05, ...
VS
{'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 0.001},
'mean_test_score': array([0.98333333, 0.98333333, 0.96666667, 0.98333333, 0.98333333, 0.96666667]),
'std_test_score': array([0.05, 0.05, 0.1 , 0.05, 0.05, 0.1 ]),

Note
By default, the default setting for refit in GridSearchCV is True (i.e., GridSeachCV(..., refit=True)), which means that we can use the fitted GridSearchCV estimator to make predictions via the predict method, for example:

grid = GridSearchCV(estimator=mv_clf, 
                    param_grid=params, 
                    cv=10, 
                    scoring='roc_auc')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)

In addition, the "best" estimator can directly be accessed via the best_estimator_ attribute.

grid.best_estimator_

grid.best_estimator_.classifiers

import sklearn
sklearn.__version__

Default parameters' setting:

class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True )

class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None

... need to google classifier's for check classifiers' default parameters' setting

for sklearn 0.22

# ** : The parameter list accepts a dictionary mapping
# grid.best_estimator_.get_params() ==>return a dict # similar to above mv_clf.get_params()
# set_params(**params) : Set the parameters of this estimator.
# The method works on simple estimators as well as on nested objects (such as pipelines).
# The latter(nested objects (such as pipelines)) have parameters of the form
# <component>__<parameter> so that it’s possible to update each component of a nested object.
# set the parameters based on grid.best_estimator_.get_params()
# and each parameter is corresponding to the key of the dict( grid.best_estimator_.get_params() )
# and each parameter is assigned with the value of the dict( grid.best_estimator_.get_params() )

mv_clf = grid.best_estimator_
mv_clf.set_params(**grid.best_estimator_.get_params())

mv_clf

To remind ourselves that it is a bad practice to use the test dataset more than once for model evaluation(if we reuse the same test dataset over and over again during model selection, it will become part of our training data and thus the model will be more likely to overfit. A better way of using the holdout method for model selection is to separate the data into three parts: a training dataset, a validation dataset, and a test dataset.The training dataset is used to fit the different models (==>change hyperparameters and repeat on Same Machine Learning Algorithm), and the performance on the validation dataset is then used for the model selection (==>select the best performance with specifed hyperparameters value). The advantage of having a test dataset (for final performance estimate) that the model hasn't seen before during the training and model selection steps is that we can obtain a less biased estimate of its ability to generalize to new data. cp6_Model Eval_Confusion_Hyperpara Tuning_pipeline_variance_bias_ validation_learning curve_strength https://blog.csdn.net/Linli522362242/article/details/109560084 If we want to select among different machine learning algorithms, though, another recommended approach is nested cross-validation...an outer k-fold cross-validation loop(train with optimal parameters...cross_val_score function ) to split the data into training and test folds, and an inner loop(tune parameters) is used to select the model using k-fold cross-validation on the training fold. https://blog.csdn.net/Linli522362242/article/details/109560084), we are not going to estimate the generalization performance of the tuned hyperparameters in this section. We will move on swiftly迅速地 to an alternative approach for ensemble learning: bagging.
#################################################################https://blog.csdn.net/Linli522362242/article/details/104771157

Building ensembles using stacking
<==>

    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, Density matrix}, shape = [n_examples, n_features]
            Matrix of training examples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_examples]
            Predicted class labels.
            
        """
        if self.vote =='probability':
            # self.predict_proba(X) : shape[n_examples, num_classLabels_probabilities]
            maj_vote = np.argmax( self.predict_proba(X), axis=1 ) #axis=1 : column or features
            # return a class_label_index for each new instance from X # a list
        else: # self.vote == 'classlabel'
            #  Collect results from clf.predict calls
            predictions = np.asarray([ clf.predict(X) 
                                       for clf in self.classifiers_ ]).T
            # prediction : a class_label list for each new instance from X
            # [ num_classifiers, n_examples ].T ==> predictions.shape: [ n_examples, num_classifiers]
            maj_vote = np.apply_along_axis( lambda x: np.argmax( 
                                                         np.bincount(x, weights=self.weights) 
                                                      ), 
                                            axis=1, 
                                            arr=predictions # Input array.
                                          ) # return a list of class_label_indx
        maj_vote = self.labelEncoder_.inverse_transform(maj_vote) #decode ==> class_label
        return maj_vote

The majority vote approach we implemented in this section is not to be confused with stacking. The stacking algorithm can be understood as a two-level ensemble, where the first level consists of individual classifiers ### mv_clf = MajorityVoteClassifier( classifiers=[pipe1, clf2, pipe3] ) ### that feed their predictions to the second level, where another classifier (typically logistic regression ~The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called Softmax Regression, or Multinomial Logistic Regression ~Softmax function ==>Softmax Regression classifier prediction https://blog.csdn.net/Linli522362242/article/details/104124771) is fit to the level-one classifier predictions(as input) to make the final predictions(maj_vote=...) . The stacking algorithm has been described in more detail by David H. Wolpert in Stacked generalization, Neural Networks, 5(2):241–259, 1992. Unfortunately, an implementation of this algorithm has not been implemented in scikit-learn at the time of writing; however, this feature is underway. In the meantime, you
can find scikit-learn-compatible implementations of stacking at http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/ and http://rasbt.github.io/mlxtend/user_guide/classifier/ StackingCVClassifier/.
#################################################################

Bagging – building an ensemble of classifiers from bootstrap samples

Bagging is an ensemble learning technique that is closely related to the MajorityVoteClassifier that we implemented in the previous section. However, instead of using the same training dataset to fit the individual classifiers in the ensemble, we draw bootstrap samples (random samples with replacement) from the initial training dataset, which is why bagging装袋法 is also known as bootstrap aggregating自助法聚合 The concept of bagging is summarized in the following diagram:

When sampling is performed without replacement, it is called pasting. The aggregation function is typically the statistical mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression. https://blog.csdn.net/Linli522362242/article/details/104771157
In the following subsections, we will work through a simple example of bagging by hand and use scikit-learn for classifying wine examples.

Bagging in a nutshell

To provide a more concrete example of how the bootstrap aggregating of a bagging classifier works, let's consider the example shown in the following figure. Here, we have seven different training instances (denoted as indices 1-7) that are sampled randomly with replacement in each round of bagging. Each bootstrap sample is then used to fit a classifier, , which is most typically an unpruned decision tree:

As you can see from the previous illustration, each classifier receives a random subset of examples from the training dataset. We denote these random samples obtained via bagging as Bagging round 1, Bagging round 2, and so on. Each subset contains a certain portion of duplicates and some of the original examples don't appear in a resampled dataset at all due to sampling with replacement. Once the individual classifiers are fit to the bootstrap samples, the predictions are combined using majority voting.

Note that bagging is also related to the random forest classifier that we introduced in Cp3, A Tour of Machine Learning Classifiers Using scikit-learn. In fact, random forests are a special case of bagging where we also use random feature subsets when fitting the individual decision trees.

Model ensembles using bagging
Bagging was first proposed by Leo Breiman in a technical report in 1994; he also showed that bagging can improve the accuracy of unstable models and decrease the degree of overfitting. I highly recommend that you read about his research in Bagging predictors, L. Breiman, Machine Learning, 24(2):123–140, 1996, which is freely available online, to learn more details about bagging.

Applying bagging to classify examples in the Wine dataset

To see bagging in action, let's create a more complex classification problem using the Wine dataset that was introduced in Chapter 4, Building Good Training Datasets – Data
Preprocessing. Here, we will only consider the Wine classes 2 and 3, and we will select two features – Alcohol and OD280/OD315 of diluted[dɪˈluːtɪd]（使）稀释 wines:
https://archive.ics.uci.edu/ml/datasets/wine
The attributes are (dontated by Riccardo Leardi, riclea '@' anchem.unige.it )
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline

import pandas as pd

df_wine = pd.read_csv( 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None) #not header
df_wine.columns = ['Class label', 
                   'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 
                   'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 
                   'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df_wine.head(n=100)

# df_wine['Class label'].unique() ==> array([1, 2, 3], dtype=int64)
# drop 1 class 
df_wine = df_wine[ df_wine['Class label']!=1 ] 

y = df_wine['Class label'].values
X = df_wine[ ['Alcohol', 'OD280/OD315 of diluted wines'] ].values # 2D array
X.shape, y.shape

Next, we will encode the class labels into binary format and split the dataset into 80 percent training and 20 percent test datasets, respectively:

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

le = LabelEncoder()
y = le.fit_transform(y) # le.transform([2,3]) ==> array([0, 1], dtype=int64)

X_train, X_test, y_train, y_test = train_test_split( X,y, test_size=0.2, random_state=1, stratify=y)

A BaggingClassifier algorithm is already implemented in scikit-learn, which we can import from the ensemble submodule. Here, we will use an unpruned decision tree as the base classifier and create an ensemble of 500 decision trees fit on different bootstrap samples of the training dataset:

from sklearn.ensemble import BaggingClassifier
# max_depth=None : If None, then nodes are expanded until all leaves are pure or until all leaves contain 
#                  less than min_samples_split samples.
# min_samples_split : int or float, default=2
tree = DecisionTreeClassifier(criterion = 'entropy', random_state=1, max_depth=None)
bag = BaggingClassifier( base_estimator=tree, n_estimators=500,
                         max_samples=1.0,  # The number of samples to draw from X to train each base estimator
                         max_features=1.0, # The number of features to draw from X to train each base estimator
                         bootstrap=True,   # Whether samples are drawn with replacement. #  default=True
                         bootstrap_features= False, # Whether features are drawn with replacement. # default=False
                         n_jobs=1,         # The number of jobs to run in parallel for both fit and predict. 
                         random_state=1
                       )

######################################
NOTE
The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with Decision Trees classifiers.
######################################

Next, we will calculate the accuracy score of the prediction on the training and test datasets to compare the performance of the bagging classifier to the performance of a single unpruned decision tree:

from sklearn.metrics import accuracy_score # 1–error

tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)

print( 'Decision tree train/test accuracies %0.3f/%0.3f' % (tree_train, tree_test) )

Based on the accuracy values that we printed here, the unpruned decision tree predicts all the class labels of the training examples correctly; however, the substantially lower test accuracy indicates high variance (overfitting) of the model:

bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)
bag_train = accuracy_score(y_train, y_train_pred)
bag_test = accuracy_score(y_test, y_test_pred)
print('Bagging train/test accuracies %.3f/%.3f' % (bag_train, bag_test))

Although the training accuracies of the decision tree and bagging classifier are similar on the training dataset (both 100 percent), we can see that the bagging classifier has a slightly better generalization performance, as estimated on the test dataset.

Next, let's compare the decision regions between the decision tree and the bagging classifier:

x1_min = X_train[:, 0].min() -1
x1_max = X_train[:, 0].max() +1
x2_min = X_train[:, 1].min() -1
x2_max = X_train[:, 1].max() +1

xx, yy = np.meshgrid( np.arange(x1_min, x1_max, 0.1),
                      np.arange(x2_min, x2_max, 0.1) )

fig, axes = plt.subplots( nrows=1, ncols=2, sharex='col', sharey='row', figsize=(8,3) )
for idx, clf, clf_label in zip( [0,1], [tree, bag], ['Decision tree', 'Bagging']):
    clf.fit(X_train, y_train)
    
    Z = clf.predict( np.c_[xx.ravel(), yy.ravel()] ) # flatten
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf( xx, yy, Z, alpha=0.3 )
    axes[idx].scatter( X_train[y_train==0, 0], 
                       X_train[y_train==0, 1], c='blue', marker='^')
    axes[idx].scatter( X_train[y_train==1, 0], 
                       X_train[y_train==1, 1], c='green', marker='o' )
    
    axes[idx].set_title(clf_label)
    
axes[0].set_ylabel('Alcohol', fontsize=12)
plt.text(0, -0.2,
         s='OD280/OD315 of diluted wines',
         ha='center', va='center',
         rotation=0, #default
         transform = axes[1].transAxes # the orignal center point of text is set to [0,0] of axea[1] then move to (0, -0.2)
        )

plt.show()

As we can see in the resulting plot, the piece-wise linear decision boundary of the 7-node deep decision tree looks smoother in the bagging ensemble:

compare their variance

https://blog.csdn.net/Linli522362242/article/details/104771157

compares the decision boundary of a single Decision Tree with the decision boundary of a bagging ensemble of 500 trees (n_estimators=500, from the preceding code), both trained on the moons dataset. As you can see, the ensemble’s predictions(Decision tree with Bagging) will likely generalize much better than the single Decision Tree’s predictions: the ensemble has a comparable bias but a smaller variance (it makes roughly the same number of errors on the training set, but the decision boundary is less irregular).

Bootstrapping introduces a bit more diversity in the subsets(training instances randomly sampled from the training set with replacement ) that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but this also means that predictors(classifiers) end up being less correlated so the ensemble’s variance is reduced. Overall, bagging often results in better models, which explains why it is generally preferred. However, if you have spare time and CPU power you can use cross-validation to evaluate both bagging and pasting and select the one that works best.

tree.get_depth()

We only looked at a very simple bagging example in this section. In practice, more complex classification tasks and a dataset's high dimensionality can easily lead to overfitting in single decision trees, and this is where the bagging algorithm can really play to its strengths. Finally, we must note that the bagging algorithm can be an effective approach to reducing the variance(~overfitting) of a model. However, bagging is ineffective in reducing model bias, that is, models that are too simple to capture the trend in the data well. This is why we want to perform bagging on an ensemble of classifiers with low bias, for example, unpruned decision trees.
################################################# https://blog.csdn.net/Linli522362242/article/details/104542381

try depth=3

tree = DecisionTreeClassifier(criterion = 'entropy', random_state=1, max_depth=3)
x1_min = X_train[:, 0].min() -1
x1_max = X_train[:, 0].max() +1
x2_min = X_train[:, 1].min() -1
x2_max = X_train[:, 1].max() +1

xx, yy = np.meshgrid( np.arange(x1_min, x1_max, 0.1),
                      np.arange(x2_min, x2_max, 0.1) )

fig, axes = plt.subplots( nrows=1, ncols=2, sharex='col', sharey='row', figsize=(8,3) )
for idx, clf, clf_label in zip( [0,1], [tree, bag], ['Decision tree', 'Bagging']):
    clf.fit(X_train, y_train)
    
    Z = clf.predict( np.c_[xx.ravel(), yy.ravel()] ) # flatten
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf( xx, yy, Z, alpha=0.3 )
    axes[idx].scatter( X_train[y_train==0, 0], 
                       X_train[y_train==0, 1], c='blue', marker='^')
    axes[idx].scatter( X_train[y_train==1, 0], 
                       X_train[y_train==1, 1], c='green', marker='o' )
    
    axes[idx].set_title(clf_label)
    
axes[0].set_ylabel('Alcohol', fontsize=12)
plt.text(0, -0.2,
         s='OD280/OD315 of diluted wines',
         ha='center', va='center',
         rotation=0, #default
         transform = axes[1].transAxes, # the orignal center point of text is set to [0,0] of axea[1] then move to (0, -0.2)
        )

plt.show()

#tree = DecisionTreeClassifier(criterion = 'entropy', random_state=1, max_depth=3)
#tree = tree.fit(X_train, y_train)

# pip3 install graphviz
from graphviz import Source
from sklearn.tree import export_graphviz
import os
os.environ["PATH"] += os.pathsep + "C:/Graphviz2.38/bin" # " directory" where you intall graphviz
 
export_graphviz(
    tree,
    out_file = os.path.join( "tree.dot"),
    feature_names = ['Alcohol', 'OD280/OD315 of diluted wines'], ###
    class_names = str(y_train),###
    rounded = True,
    filled = True
)
 
Source.from_file("tree.dot")

tree.get_depth()

############################################################ transform = axes[1].transAxes

#...
axes[0].set_ylabel('Alcohol', fontsize=12)
plt.text(0, 0,
         s='0',
         ha='center', va='center',
         rotation=0, #default
         transform = axes[1].transAxes, # the orignal center point of text is set to [0,0] of axea[1] then move to (0, -0.2)
         c='blue',
         fontsize=12 
        )

plt.show()

axes[0].set_ylabel('Alcohol', fontsize=12)
plt.text(0, -0.2,
         s='0',
         ha='center', va='center',
         rotation=0, #default
         transform = axes[1].transAxes, # the orignal center point of text is set to [0,0] of axea[1] then move to (0, -0.2)
         c='blue',
         fontsize=12 
        )

plt.show()

############################################################

Leveraging weak learners via adaptive boosting

In this last section about ensemble methods, we will discuss boosting, with a special focus on its most common implementation: Adaptive Boosting (AdaBoost).

In boosting, the ensemble consists of very simple base classifiers, also often referred to as weak learners, which often only have a slight performance advantage over random guessing—a typical example of a weak learner is a decision tree stump单层决策树. The key concept behind boosting is to focus on training examples that are hard to classify, that is, to let the weak learners subsequently learn from misclassified training examples to improve the performance of the ensemble. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor前面的分类

The following subsections will introduce the algorithmic procedure behind the general concept of boosting and AdaBoost. Lastly, we will use scikit-learn for a practical classification example.

How boosting works

In contrast to bagging, the initial formulation of the boosting algorith-m uses random subsets of training examples drawn from the training dataset without replacement; the original boosting procedure can be summarized in the following four key steps:

1. Draw a random subset (sample) of training examples, , without replacement from the training dataset, D, to train a weak learner, .
2. Draw a second random training subset, , without replacement from the training dataset and add 50 percent of the examples that were previously misclassified to train a weak learner, .
3. Find the training examples, , in the training dataset, D, which and disagree upon（从训练集D中抽取和分类结果不一致的样本生成训练样本集）, to train a third weak learner, .
4. Combine the weak learners , , and via majority voting.

As discussed by Leo Breiman (Bias, variance, and arcing classifiers, L. Breiman, 1996), boosting can lead to a decrease in bias as well as variance compared to bagging models. In practice, however, boosting algorithms such as AdaBoost are also known for their high variance, that is, the tendency to overfit the training data (An improvement of AdaBoost to avoid overfitting, G. Raetsch, T. Onoda, and K. R. Mueller. Proceedings of the International Conference on Neural Information Processing, CiteSeer, 1998).

In contrast to the original boosting procedure described here, AdaBoost uses the complete training dataset to train the weak learners, where the training examples are reweighted in each iteration to build a strong classifier that learns from the mistakes of the previous weak learners in the ensemble.

Before we dive deeper into the specific details of the AdaBoost algorithm, let's take a look at the following figure to get a better grasp of the basic concept behind AdaBoost:
<==>Figure 7-7. AdaBoost sequential training with instance weight updates https://blog.csdn.net/Linli522362242/article/details/104771157

To walk through the AdaBoost illustration step by step, we will start with subfigure 1, which represents a training dataset for binary classification where all training examples are assigned equal weights. Based on this training dataset, we train a decision stump (shown as a dashed line) that tries to classify the examples of the two classes (triangles and circles), as well as possibly minimizing the cost function (or the impurity score in the special case of decision tree ensembles).

For the next round (subfigure 2), we assign a larger weight to the two previously misclassified examples (circles) ###focusing more and more on the hard cases###. Furthermore, we lower the weight of the correctly classified examples. The next decision stump will now be more focused on the training examples that have the largest weights—the training examples that are supposedly hard to classify. The weak learner shown in subfigure 2 misclassifies three different examples from the circle class, which are then assigned a larger weight, as shown in subfigure 3.

Assuming that our AdaBoost ensemble only consists of three rounds of boosting, we then combine the three weak learners trained on different reweighted training subsets by a weighted majority vote, as shown in subfigure 4.

Now that we have a better understanding of the basic concept of AdaBoost, let's take a more detailed look at the algorithm using pseudo code. For clarity, we will denote element-wise multiplication by the cross symbol ( × ) and the dot-product between two vectors by a dot symbol ( ∙ ):

Set the weight vector, w, to uniform weights, where = 1 .
For j in m boosting rounds, do the following:

a. Train a weighted weak learner: = train(𝑿, 𝒚, 𝒘) .

b. Predict class labels: .

c. Compute weighted error rate: 𝜀 = 𝒘 ∙ (𝒚̂ ≠ 𝒚) .

d. Compute coefficient:

e. Update weights: .

f. Normalize weights to sum to 1:
.

3. Compute the final prediction: 𝒚̂ =

Note that the expression (𝒚̂ ≠ 𝒚) in step 2.c. refers to a binary vector consisting of 1s and 0s, where a 1 is assigned if the prediction is incorrect and 0 is assigned otherwise.

Although the AdaBoost algorithm seems to be pretty straightforward, let's walk through a more concrete example using a training dataset consisting of 10 training examples, as illustrated in the following table:

The first column of the table depicts the indices of training examples 1 to 10.
In the second column, you can see the feature values of the individual samples, assuming this is a one-dimensional dataset.
The third column shows the true class label, , for each training sample, , where ∈ {1, −1} .
The initial weights are shown in the fourth column; we initialize the weights uniformly (assigning the same constant value) and normalize them to sum to 1. In the case of the 10-sample training dataset, we therefore assign 0.1 to each weight, , in the weight vector, w.
The predicted class labels, 𝒚̂ , are shown in the fifth column, assuming that our splitting criterion is 𝑥 ≤ 3.0 .
The last column of the table then shows the updated weights based on the update rules that we defined in the pseudo code.

Since the computation of the weight updates may look a little bit complicated at first, we will now follow the calculation step by step. We will start by computing the weighted error rate, 𝜀 , as described in step 2c:
𝜀 = 𝒘 ∙ (𝒚̂ ≠ 𝒚), 1 is assigned if the prediction is incorrect and 0 is assigned otherwise

Next, we will compute the coefficient, — shown in step 2d—which will later be used in step 2e to update the weights, as well as for the weights in the majority vote prediction (step 3):

After we have computed the coefficient, , we can now update the weight vector using the following equation (step 2e):

Here, 𝒚̂ × 𝒚 is an element-wise multiplication between the vectors of the predicted and true class labels, respectively. Thus, if a prediction, , is correct, will have a positive sign so that we decrease the ith weight, since is a positive number as well:
Similarly, we will increase the ith weight if predicted the label incorrectly, like this:
Alternatively, it's like this:

After we have updated each weight in the weight vector, we normalize the weights so that they sum up to 1 (step 2f):

Here, = 7 × 0.065 + 3 × 0.153 = 0.914 .

Thus, each weight that corresponds to a correctly classified example will be reduced from the initial value of 0.1 to 0.065/0.914 ≈ 0.071 for the next round of boosting. Similarly, the weights of the incorrectly classified examples will increase from 0.1 to 0.153/0.914 ≈ 0.167 .

Applying AdaBoost using scikit-learn

The previous subsection introduced AdaBoost in a nutshell. Skipping to the more practical part, let's now train an AdaBoost ensemble classifier via scikit-learn. We will use the same Wine subset that we used in the previous section to train the bagging meta-classifier. Via the base_estimator attribute, we will train the AdaBoostClassifier on 500 decision tree stumps:

from sklearn.ensemble import AdaBoostClassifier

tree = DecisionTreeClassifier( criterion='entropy',
                               max_depth=1,
                               random_state=1 )
ada = AdaBoostClassifier( base_estimator=tree,
                          n_estimators=500,
                          learning_rate=0.1,
                          random_state=1 )

tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score( y_train, y_train_pred )
tree_test = accuracy_score( y_test, y_test_pred )
print( 'Decision tree train/test accuracies %.3f/%.3f' % (tree_train, tree_test) )

ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)

ada_train = accuracy_score(y_train, y_train_pred)
ada_test = accuracy_score(y_test, y_test_pred)
print( 'AdaBoost train/test accuracies %.3f/%.3f' % (ada_train, ada_test) )

As you can see, the decision tree stump(max_depth=1) seems to underfit the training data in contrast to the unpruned decision tree（max_depth=None, ） that we saw in the previous section:

Here, you can see that the AdaBoost model predicts all class labels of the training dataset correctly and also shows a slightly improved test dataset performance compared to the decision tree stump. However, you can also see that we introduced additional variance by our attempt to reduce the model bias—a greater gap (1-0.917=0.083 > 0.916-0.875=0.041) between training and test performance.

Although we used another simple example for demonstration purposes, we can see that the performance of the AdaBoost classifier is slightly improved compared to the decision stump and achieved very similar accuracy scores as the bagging classifier that we trained in the previous section. However, we must note that it is considered bad practice to select a model based on the repeated usage of the test dataset. The estimate of the generalization performance may be over-optimistic, which we discussed in more detail in Cp6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.

Lastly, let's check what the decision regions look like:

x1_min, x1_max = X_train[:,0].min()-1, X_train[:,0].max()+1
x2_min, x2_max = X_train[:,1].min()-1, X_train[:,1].max()+1

xx, yy = np.meshgrid( np.arange(x1_min, x1_max, 0.1),
                      np.arange(x2_min, x2_max, 0.1)
                    )

fig, axes = plt.subplots(1,2, sharex='col', sharey='row', figsize=(8,3) )

for idx, clf, clf_label in zip([0,1], [tree, ada], ['Decision tree', 'AdaBoost']):
    clf.fit(X_train, y_train)
    
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3)
    axes[idx].scatter( X_train[y_train==0, 0],
                       X_train[y_train==0, 1],
                       c='blue', marker='^' )
    axes[idx].scatter( X_train[y_train==1, 0],
                       X_train[y_train==1, 1],
                       c='green', marker='o')
    axes[idx].set_title(clf_label)
    
axes[0].set_ylabel('Alcohol', fontsize=12)

plt.tight_layout()
plt.text( 0, -0.2, 
          s='0D280/0D315 of diluted wines', 
          ha='center', va='center', fontsize=12, 
          transform=axes[1].transAxes )
plt.show()

By looking at the decision regions, you can see that the decision boundary of the AdaBoost model is substantially more complex than the decision boundary of the decision stump. In addition, note that the AdaBoost model separates the feature space very similarly to the bagging classifier that we trained in the previous section:

As concluding remarks about ensemble techniques, it is worth noting that ensemble learning increases the computational complexity compared to individual classifiers. In practice, we need to think carefully about whether we want to pay the price of increased computational costs for an often relatively modest improvement in predictive performance.

An often-cited example of this tradeoff is the famous $1 million Netflix Prize, which was won using ensemble techniques. The details about the algorithm were published in The BigChaos Solution to the Netflix Grand Prize by A. Toescher, M. Jahrer, and R. M. Bell, Netflix Prize documentation, 2009, which is available at http://www.stat.osu.edu/~dmsl/GrandPrize2009_BPC_BigChaos.pdf. The winning team received the $1 million grand prize money; however, Netflix never implemented their model due to its complexity, which made it infeasible for a real-world application:

"We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment."
http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html

Gradient boosting https://blog.csdn.net/Linli522362242/article/details/105046444

Another popular variant of boosting is gradient boosting. AdaBoost and gradient boosting share the main overall concept: boosting weak learners (such as decision tree stumps) to strong learners. The two approaches, adaptive and gradient boosting, differ mainly with regard to how the weights are updated and how the (weak) classifiers are combined. If you are familiar with gradient-based optimization and interested in gradient boosting, I recommend reading Jerome Friedman's work (Greedy function approximation: a gradient boosting machine. Jerome Friedman. Annals of Statistics 2001,
pp.1189-1232) and the more recent paper on XGBoost, which is essentially a computationally efficient implementation of the original gradient boost algorithm (XGBoost: A scalable tree boosting system. Tianqi Chen and Carlos Guestrin. Proceeding of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM 2016, pp. 785-794). Note that next to the GradientBoostingClassifier implementation in scikit-learn, scikit-learn now also includes a substantially faster version of gradient boosting in version 0.21, HistGradientBoostingClassifier, which is even faster than XGBoost. For more information about the GradientBoostingClassifier and HistGradientBoostingClassifier in scikit-learn, you can read the documentation at https://scikit-learn.org/stable/modules/ensemble.html#gradienttree-boosting. Also, a short, general explanation of gradient boosting can be found in the lecture notes at https://sebastianraschka.com/pdf/lecture-notes/stat479fs19/07-ensembles__notes.pdf

Summary

In this chapter, we looked at some of the most popular and widely used techniques for ensemble learning. Ensemble methods combine different classification models to cancel out their individual weaknesses, which often results in stable and well performing models that are very attractive for industrial applications as well as machine learning competitions.

At the beginning of this chapter, we implemented MajorityVoteClassifier in Python, which allows us to combine different algorithms for classification. We then looked at bagging, a useful technique for reducing the variance of a model by drawing random bootstrap(random samples with replacement) samples from the training dataset and combining the individually trained classifiers via majority vote. Lastly, we learned about AdaBoost,which is an algorithm that is based on weak learners that subsequently learn from mistakes.

Throughout the previous chapters, we learned a lot about different learning algorithms, tuning, and evaluation techniques. In the next chapter, we will look at a particular application of machine learning, sentiment analysis, which has become an interesting topic in the internet and social media era.