cp3 A Tour of ML Classifiers_3_ random forests_knn

cp3 A Tour of ML Classifiers_stratify_bincount_likelihood_logistic regression_odds ratio_weight decay_L2
https://blog.csdn.net/Linli522362242/article/details/96480059

cp3  A Tour of ML Classifiers_2_support vector_Maximum margin_soft margin_C~slack_kernel_information gain_Gini_entropy_pydotplus_graphviz_decision tree
https://blog.csdn.net/Linli522362242/article/details/107755405

Combining multiple decision trees via random forests

06_Decision Trees_01_graphviz_Gini_Entropy_Decision Tree_CART_prune_Regression_Classification_Tree
https://blog.csdn.net/Linli522362242/article/details/104542381 (Exercise 8_base on _Exercise 7_1000 decision trees)

     Random forests have gained huge popularity in applications of machine learning during the last decade due to their good classification performance, scalability[skeɪlə'bɪlɪtɪ]可扩展性, and ease of use. Intuitively, a random forest can be considered as an ensemble of decision trees. The idea behind a random forest is to average multiple (deep) decision trees that individually suffer from high variance, to build a more robust model that has a better generalization performance and is less susceptible[səˈseptəbl]易受影响的 to overfitting. The random forest algorithm can be summarized in four simple steps:

1. Draw a random bootstrap sample of size n (randomly choose n samples from the training set with replacement).
2. Grow a decision tree from the bootstrap sample. At each node:

  • a. Randomly select d features without replacement.
  • b. Split the node using the feature that provides the best split according to the objective function, for instance, maximizing the information gain.

3. Repeat the steps 1-2 k times.
4. Aggregate the prediction by each tree to assign the class label by majority vote. Majority voting will be discussed in more detail in Chapter 7, Combining Different Models for Ensemble Learning.

     We should note one slight modification in step 2 when we are training the individual decision trees: instead of evaluating all features to determine the best split at each node, we only consider a random subset of those.

07_Ensemble Learning and Random Forests_Bagging_Out-of-Bag_Random Forests_Extra-Trees极端随机树_Boosting
https://blog.csdn.net/Linli522362242/article/details/104771157

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(                  #each limited to maximum 2 nodes   
    DecisionTreeClassifier( splitter="random", max_leaf_nodes=2, random_state=42, criterion='gini' ),
    n_estimators= 25, 
    max_samples=1.0, #The number of samples to draw from X to train each base estimator. #max_samples * X.shape[0]
    bootstrap=True, #samples are drawn with replacement #bootstrap
    random_state=1,
    n_jobs=-1 #-1: using all available CPU cores
)
 
bag_clf.fit(X_train, y_train)

plot_decision_regions(X_combined, y_combined, 
                      classifier=bag_clf, test_idx=range(105, 150))

plt.xlabel('petal length [cm]')
plt.ylabel('petal width [cm]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

 

Note

     In case you are not familiar with the terms sampling with and without replacement, let's walk through a simple thought experiment. Let's assume we are playing a lottery game where we randomly draw numbers from an urn. We start with an urn[ɜːrn]瓮 that holds five unique numbers, 0, 1, 2, 3, and 4, and we draw exactly one number each turn. In the first round, the chance of drawing a particular number from the urn would be 1/5. Now, in sampling without replacement, we do not put the number back into the urn after each turn. Consequently, the probability of drawing a particular number from the set of remaining numbers in the next round depends on the previous round. For example, if we have a remaining set of numbers 0, 1, 2, and 4, the chance of drawing number 0 would become 1/4 in the next turn.

     However, in random sampling with replacement, we always return the drawn number to the urn so that the probabilities of drawing a particular number at each turn does not change; we can draw the same number more than once. In other words, in sampling with replacement, the samples (numbers) are independent and have a covariance of zero. For example, the results from five rounds of drawing random numbers could look like this:

  • Random sampling without replacement: 2, 1, 3, 4, 0
  • Random sampling with replacement: 1, 3, 3, 4, 1

     Although random forests don't offer the same level of interpretability as decision trees, a big advantage of random forests is that we don't have to worry so much about choosing good hyperparameter values. We typically don't need to prune the random forest since the ensemble model is quite robust to noise from the individual decision trees. The only parameter that we really need to care about in practice is the number of trees k (step 3) that we choose for the random forest. Typically, the larger the number of trees, the better the performance of the random forest classifier at the expense of an increased computational cost.

     Although it is less common in practice, other hyperparameters of the random forest classifier that can be optimized—using techniques we will discuss in Chapter 5, Compressing Data via Dimensionality Reduction—are the size n of the bootstrap sample (step 1) and the number of features d that is randomly chosen for each split (step 2.1), respectively.

Via the sample size n of the bootstrap sample, we control the bias-variance tradeoff of the random forest.

     Decreasing the size of the bootstrap sample increases the diversity among the individual trees, since the probability that a particular training sample is included in the bootstrap sample is lower. Thus, shrinking the size of the bootstrap samples may increase the randomness of the random forest, and it can help to reduce the effect of overfitting. However, smaller bootstrap samples typically result in a lower overall performance of the random forest, a small gap between training and test performance, but a low test performance overall.

     Conversely, increasing the size of the bootstrap sample may increase the degree of overfitting(we decrease the randomness). Because the bootstrap samples, and consequently the individual decision trees, become more similar to each other, they learn to fit the original training dataset more closely.

     In most implementations, including the RandomForestClassifier implementation in scikit-learn, the size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff. For the number of features d at each split, we want to choose a value that is smaller than the total number of features in the training set. A reasonable default that is used in scikit-learn and other implementations is , where m is the number of features in the training set.

     Conveniently, we don't have to construct the random forest classifier from individual decision trees by ourselves because there is already an implementation in scikit-learn that we can use:

Combining weak to strong learners via random forests

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(criterion='gini', 
                                n_estimators=25, 
                                random_state=1, 
                                n_jobs=2)#to parallelize the model training using multiple cores of our computer (here two cores)
forest.fit(X_train, y_train)
plot_decision_regions(X_combined, y_combined, classifier=forest, test_idx=range(105, 150))

plt.xlabel('Petal length[cm]')
plt.ylabel('Petal width [cm]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

     After executing the preceding code, we should see the decision regions formed by the ensemble of trees in the random forest, as shown in the following figure: 

     Using the preceding code, we trained a random forest from 25 decision trees via the n_estimators parameter and used the gini criterion as an impurity measure to split the nodes. Although we are growing a very small random forest from a very small training dataset, we used the n_jobs parameter for demonstration purposes, which allows us to parallelize the model training using multiple cores of our computer (here two cores).

K-nearest neighbors – a lazy learning algorithm

     The last supervised learning algorithm that we want to discuss in this chapter is the k-nearest neighbor (KNN) classifier, which is particularly interesting because it is fundamentally different from the learning algorithms that we have discussed so far.

     KNN is a typical example of a lazy learner. It is called lazy not because of its apparent simplicity, but because it doesn't learn a discriminative function from the training data, but memorizes the training dataset instead.

Note

Parametric versus nonparametric models

     Machine learning algorithms can be grouped into parametric and nonparametricmodels. Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM. In contrast, nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of non-parametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM.

     KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance based learning that is associated with no (zero) cost during the learning process.

The KNN algorithm itself is fairly straightforward and can be summarized by the following steps:
1. Choose the number of k and a distance metric.
2. Find the k-nearest neighbors of the sample that we want to classify.
3. Assign the class label by majority vote.

     The following figure illustrates how a new data point (?) is assigned the triangle class label based on majority voting among its five nearest neighbors.

     Based on the chosen distance metric, the KNN algorithm finds the k samples in the training dataset that are closest (most similar) to the point that we want to classify. The class label of the new data point is then determined by a majority vote among its k nearest neighbors.

     The main advantage of such a memory-based approach is that the classifier immediately adapts as we collect new training data. However, the downside is that the computational complexity for classifying new samples grows linearly with the
number of samples in the training dataset in the worst-case scenario
—unless the dataset has very few dimensions (features) and the algorithm has been implemented using efficient data structures such as KD-trees. An Algorithm for Finding Best Matches in Logarithmic Expected Time, J. H. Friedman, J. L. Bentley, and R.A. Finkel, ACM transactions on mathematical software (TOMS), 3(3): 209–226, 1977. Furthermore, we can't discard training samples since no training step is involved.
Thus, storage space can become a challenge if we are working with large datasets
.

     By executing the following code, we will now implement a KNN model in scikitlearn using a Euclidean distance metric:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5,
                           p=2,                # The default metric is minkowski,
                           metric='minkowski') #  and with p=2 is equivalent to the standard Euclidean metric.
knn.fit(X_train_std, y_train)

plot_decision_regions(X_combined_std, y_combined,
                      classifier=knn, test_idx=range(105,150))

plt.xlabel('petal length [standarized]')
plt.ylabel('petal width [standarized]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

By specifying five neighbors in the KNN model for this dataset, we obtain a relatively smooth decision boundary, as shown in the following figure: 

Note

     In the case of a tie, the scikit-learn implementation of the KNN algorithm will prefer the neighbors with a closer distance to the sample. If the neighbors have similar distances, the algorithm will choose the class label that comes first in the training dataset.

     The right choice of k is crucial to find a good balance between overfitting and underfitting. We also have to make sure that we choose a distance metric that is appropriate for the features in the dataset. Often, a simple Euclidean distance measure is used for real-value samples, for example, the flowers in our Iris dataset, which have features measured in centimeters. However, if we are using a Euclidean distance measure, it is also important to standardize the data so that each feature contributes equally to the distance. The minkowski distance that we used in the previous code is just a generalization of the Euclidean and Manhattan distance, which can be written as follows:

     It becomes the Euclidean distance if we set the parameter p=2 or the Manhattan distance at p=1. Many other distance metrics are available in scikit-learn and can be provided to the metric parameter. They are listed at http://scikitlearn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html.

Note

The curse of dimensionality

     It is important to mention that KNN is very susceptible to overfitting due to the curse of dimensionality. The curse of dimensionality describes the phenomenon where the feature space becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset. Intuitively, we can think of even the closest neighbors being too far away in a high-dimensional space to give a good estimate. (Many Machine Learning problems involve thousands or even millions of features for each training instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution, as we will see. This problem is often referred to as the curse of dimensionality https://blog.csdn.net/Linli522362242/article/details/105139547)

     We have discussed the concept of regularization in the section about logistic regression as one way to avoid overfitting. However, in models where regularization is not applicable, such as decision trees and KNN, we can use feature selection and dimensionality reduction techniques to help us avoid the curse of dimensionality. This will be discussed in more detail in the next chapter.

Summary

     In this chapter, you learned about many different machine learning algorithms that are used to tackle linear and nonlinear problems. We have seen that decision trees are particularly attractive if we care about interpretability. Logistic regression is not only a useful model for online learning via stochastic gradient descent, but also allows us to predict the probability of a particular event. Although support vector machines are powerful linear models that can be extended to nonlinear problems via the kernel trick, they have many parameters that have to be tuned in order to make good predictions. In contrast, ensemble methods such as random forests don't require much parameter tuning and don't overfit as easily as decision trees, which makes them attractive models for many practical problem domains. The KNN classifier offers an alternative approach to classification via lazy learning that allows us to make predictions without any model training, but with a more computationally expensive prediction step.

     However, even more important than the choice of an appropriate learning algorithm is the available data in our training dataset. No algorithm will be able to make good predictions without informative and discriminatory features.

     In the next chapter, we will discuss important topics regarding the preprocessing of data, feature selection, and dimensionality reduction, which we will need to build powerful machine learning models. Later in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we will see how we can evaluate and compare the performance of our models and learn useful tricks to fine-tune the different algorithms.

基于SSM框架的智能家政保洁预约系统,是一个旨在提高家政保洁服务预约效率和管理水平的平台。该系统通过集成现代信息技术,为家政公司、家政服务人员和消费者提供了一个便捷的在线预约和管理系统。 系统的主要功能包括: 1. **用户管理**:允许消费者注册、登录,并管理他们的个人资料和预约历史。 2. **家政人员管理**:家政服务人员可以注册并更新自己的个人信息、服务类别和服务时间。 3. **服务预约**:消费者可以浏览不同的家政服务选项,选择合适的服务人员,并在线预约服务。 4. **订单管理**:系统支持订单的创建、跟踪和管理,包括订单的确认、完成和评价。 5. **评价系统**:消费者可以在家政服务完成后对服务进行评价,帮助提高服务质量和透明度。 6. **后台管理**:管理员可以管理用户、家政人员信息、服务类别、预约订单以及处理用户反馈。 系统采用Java语言开发,使用MySQL数据库进行数据存储,通过B/S架构实现用户与服务的在线交互。系统设计考虑了不同用户角色的需求,包括管理员、家政服务人员和普通用户,每个角色都有相应的权限和功能。此外,系统还采用了软件组件化、精化体系结构、分离逻辑和数据等方法,以便于未来的系统升级和维护。 智能家政保洁预约系统通过提供一个集中的平台,不仅方便了消费者的预约和管理,也为家政服务人员提供了一个展示和推广自己服务的机会。同时,系统的后台管理功能为家政公司提供了强大的数据支持和决策辅助,有助于提高服务质量和管理效率。该系统的设计与实现,标志着家政保洁服务向现代化和网络化的转型,为管理决策和控制提供保障,是行业发展中的重要里程碑。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值