【Chapter 7】Zero Code development of XGBoost algorithm_Setosa_DSML

Kenneth風车

已于 2024-12-02 14:54:00 修改

阅读量2.1k

点赞数 83

CC 4.0 BY-SA版权

文章标签：机器学习人工智能数据分析低代码数据挖掘

于 2024-12-02 14:52:40 首次发布

本文链接：https://blog.csdn.net/qq_45586013/article/details/144188221

1、Algorithm concept

XGBoost (eXtreme Gradient Boosting), also known as Extreme Gradient Boosting Tree, is an ensemble machine learning algorithm based on decision trees. It uses a gradient boosting framework and is suitable for classification and regression problems.

XGBoost is a universal Tree Boosting algorithm, represented by Gradient Boosting Decision Tree (GBDT). Its principle is to first train a tree (referred to as CART regression tree, which is a binary tree generated by selecting the root node and each leaf node based on the Gini index) using the training set and sample truth values. Then, use this tree to predict the training set and obtain the predicted value of each sample. Due to the deviation between the predicted value and the truth value, subtracting the two can obtain the "residual". Next, train the second tree using the residual as the standard answer instead of the true value. After training the two trees, the residuals of each sample can be obtained again, and then the third tree can be further trained, and so on. The total number of trees can be manually specified, or certain metrics (such as errors on the validation set) can be monitored to stop training.

When predicting new samples, each tree will have an output value, and adding these output values yields the final predicted value of the sample.

Compared to the GBDT algorithm, the improvements of xgboost are as follows:

a. Introduced regularization term, which has the function of pre pruning and preventing model overfitting;

b. Using Taylor's quadratic expansion term to approximate the objective function is faster and more efficient, and supports custom objective functions as long as the function can be derived in the second order;

c. The method of handling missing values in XGB is to treat them separately as a category and classify them into left or right leaf nodes based on the degree of improvement during node splitting;

d. Supporting parallel computing, XGB pre sorts feature variables and saves the results as a block module stored on the hard drive. During tree splitting, it calls multiple threads to perform operations on multiple feature variables, greatly improving computing speed. In addition, unlike GBDT which uses greedy algorithms to calculate information gain or entropy one by one during feature splitting, XGB calculates feature variable quantiles and provides splitting weights corresponding to feature values. Using approximate greedy algorithms during feature splitting can reduce computational complexity and improve efficiency.

2、Algorithm principle

Firstly, clarify the objective of the algorithm: to establish K regression trees, so that the predicted values of the tree cluster are as close to the true values (accuracy) as possible and have the greatest generalization ability. From a mathematical perspective, this is a functional optimization. The objective function of XGBOOST is:

Where i represents the i-th sample,Indicate the prediction error of the i-th sample, the smaller the error, the better.The function that represents the complexity of a tree, the smaller the complexity, the lower the generalization ability, and the expression is:

Among them, T represents the number of leaf nodes, and w represents the value of nodes (this is what regression trees do, and classification trees correspond to categories).

The general objective function includes the following two terms:

Among them, the error/loss function encourages our model to fit the training data as much as possible, so that the final model will have relatively less bias. The regularization term encourages simpler models. Because when the model is simple, the randomness of the results obtained from fitting limited data is relatively small, making it less prone to overfitting and ultimately making the model's predictions more stable.

Intuitively speaking, the objective function requires the prediction error to be as small as possible, the number of leaf nodes to be as few as possible, and the node values to be as extreme as possible (how to look at this, if a sample value is 4, then the first regression tree predicts 3, and the second regression tree predicts 1; another set of regression trees, one prediction 2, and one prediction 2, then tends towards the latter, why? In the former case, the first tree learns too much and is too close to 4, which means there is a greater risk of overfitting)

How can it be achieved? It is achieved through a greedy strategy and optimization (quadratic optimization)

The greedy algorithm's splitting method is a brute force search method, which traverses every feature and every value of that feature, calculates the gain before and after splitting, and selects the feature value with the highest gain as the splitting point (node of the tree).

How is the greedy strategy used here? At the beginning, you have a group of samples placed at the first node, and T=1. What is w? I don't know, it is calculated. At this point, the predicted values of all samples are w (the nodes of the decision tree represent categories, and the nodes of the regression tree represent predicted values). When the predicted values of the samples are inputted, the loss function becomes:

If hereThe error representation uses square error, so the above function is a quadratic function about w to find the minimum value. The point where the minimum value is taken is the predicted value of this node, and the minimum function value is the minimum loss function.

The above equation transforms the objective function into a quadratic function optimization problem. If it is not a quadratic function, Taylor's formula will be used to convert it to a quadratic function. The objective function has been determined. Next, we need to choose a feature to split into two nodes and become a weak sapling. Therefore, we need to:

(1) The simplest way to select features for splitting is to roughly enumerate (traverse all features) and choose the one with the best loss function effect;

(2) Determine the predicted value of the node and the minimum loss function by taking the derivative of the quadratic function to 0.

When splitting, each time a node splits, the loss function is only affected by the sample of that node. Therefore, for each split, calculating the gain of the split (the decrease in the loss function) only needs to focus on the sample of the node that intends to split. Continue to split, following the above method to form a tree, and then form another tree, each time taking the optimal further splitting/building based on the previous prediction.

Stop node splitting when the following situations occur:

(1) When the gain brought by the introduced split is less than a threshold, we can cut off the split, so not every split will increase the overall loss function, which is a bit of pre pruning;

(2) When the tree reaches its maximum depth, stop building the decision tree and set a hyperparameter x_depth, as learning local samples and overfitting can easily occur if the tree is too deep;

(3) When the sample weights are less than the set threshold, the tree building stops. To explain, there is a hyperparameter involved - the minimum sample weight and min_child_ceight, which is similar to the min_child_1eaf parameter of GBM, but not exactly the same. The general idea is that if a leaf node has too few samples, it will eventually be overfitting;

3、Implementation of Sentosa_DSML Community Edition

Mainly based on the process of model construction, the machine learning algorithm is completed using Sentosa_DSML Community Edition.

（1）Data Load

（2）Sample partitioning

Connect types and sample partitioning operators to partition training and testing data sets.

Firstly, the sample partitioning operator can choose the ratio of partitioning the training and testing sets of the data.

Right click preview to see the data partitioning results.

Secondly, the connection type operator sets the model type of the Specifications column to the Label column.

（3）model training

After completing the sample partitioning, connect the XGBoost classification operator and double-click on the right side to configure the model properties.

（4）Evaluation

Evaluate the model using evaluation operators

Training set evaluation results

Test set evaluation results

（5）Model visualization

4、Implementation of XGBoost Classification

（1）Data loading and sample partitioning

Data loading and sample partitioning are the same as above

（2）model training

After the sample partitioning is completed, connect the XGBoost regression operator, configure the model properties, and execute to obtain the XGBoost regression model.

（3）Evaluation

Evaluate the model using evaluation operators

Training set evaluation results

Test set evaluation results

（4）Model visualization

Right click on the XGBoost regression model to view model information:

5、summarize

Compared to traditional coding methods, using Sentosa_DSML Community Edition to complete the process of machine learning algorithms is more efficient and automated. Traditional methods require manually writing a large amount of code to handle data cleaning, feature engineering, model training, and evaluation. In Sentosa_DSML Community Edition, these steps can be simplified through visual interfaces, pre built modules, and automated processes, effectively reducing technical barriers. Non professional developers can also develop applications through drag and drop and configuration, reducing dependence on professional developers.
Sentosa_DSML Community Edition provides an easy to configure operator flow, reducing the time spent writing and debugging code, and improving the efficiency of model development and deployment. Due to the clearer structure of the application, maintenance and updates become easier, and the platform typically provides version control and update features, making continuous improvement of the application more convenient.

Sentosa Data Science and Machine Learning Platform (Sentosa_DSML) is a one-stop AI development, deployment, and application platform with full intellectual property rights owned by Liwei Intelligent Connectivity. It supports both no-code "drag-and-drop" and notebook interactive development, aiming to assist customers in developing, evaluating, and deploying AI algorithm models through low-code methods. Combined with a comprehensive data asset management model and ready-to-use deployment support, it empowers enterprises, cities, universities, research institutes, and other client groups to achieve AI inclusivity and simplify complexity.

The Sentosa_DSML product consists of one main platform and three functional platforms: the Data Cube Platform (Sentosa_DC) as the main management platform, and the three functional platforms including the Machine Learning Platform (Sentosa_ML), Deep Learning Platform (Sentosa_DL), and Knowledge Graph Platform (Sentosa_KG). With this product, Liwei Intelligent Connectivity has been selected as one of the "First Batch of National 5A-Grade Artificial Intelligence Enterprises" and has led important topics in the Ministry of Science and Technology's 2030 AI Project, while serving multiple "Double First-Class" universities and research institutes in China.

To give back to society and promote the realization of AI inclusivity for all, we are committed to lowering the barriers to AI practice and making the benefits of AI accessible to everyone to create a smarter future together. To provide learning, exchange, and practical application opportunities in machine learning technology for teachers, students, scholars, researchers, and developers, we have launched a lightweight and completely free Sentosa_DSML Community Edition software. This software includes most of the functions of the Machine Learning Platform (Sentosa_ML) within the Sentosa Data Science and Machine Learning Platform (Sentosa_DSML). It features one-click lightweight installation, permanent free use, video tutorial services, and community forum exchanges. It also supports "drag-and-drop" development, aiming to help customers solve practical pain points in learning, production, and life through a no-code approach.

This software is an AI-based data analysis tool that possesses capabilities such as mathematical statistics and analysis, data processing and cleaning, machine learning modeling and prediction, as well as visual chart drawing. It empowers various industries in their digital transformation and boasts a wide range of applications, with examples including the following fields:
1.Finance: It facilitates credit scoring, fraud detection, risk assessment, and market trend prediction, enabling financial institutions to make more informed decisions and enhance their risk management capabilities.
2.Healthcare: In the medical field, it aids in disease diagnosis, patient prognosis, and personalized treatment recommendations by analyzing patient data.
3.Retail: By analyzing consumer behavior and purchase history, the tool helps retailers understand customer preferences, optimize inventory management, and personalize marketing strategies.
4.Manufacturing: It enhances production efficiency and quality control by predicting maintenance needs, optimizing production processes, and detecting potential faults in real-time.
5.Transportation: The tool can optimize traffic flow, predict traffic congestion, and improve transportation safety by analyzing transportation data.
6.Telecommunications: In the telecommunications industry, it aids in network optimization, customer behavior analysis, and fraud detection to enhance service quality and user experience.
7.Energy: By analyzing energy consumption patterns, the software helps utilities optimize energy distribution, reduce waste, and improve sustainability.
8.Education: It supports personalized learning by analyzing student performance data, identifying learning gaps, and recommending tailored learning resources.
9.Agriculture: The tool can monitor crop growth, predict harvest yields, and detect pests and diseases, enabling farmers to make more informed decisions and improve crop productivity.
10.Government and Public Services: It aids in policy formulation, resource allocation, and crisis management by analyzing public data and predicting social trends.

Welcome to the official website of the Sentosa_DSML Community Edition at https://sentosa.znv.com/. Download and experience it for free. Additionally, we have technical discussion blogs and application case shares on platforms such as Bilibili, CSDN, Zhihu, and cnBlog. Data analysis enthusiasts are welcome to join us for discussions and exchanges.

Sentosa_DSML Community Edition: Reinventing the New Era of Data Analysis. Unlock the deep value of data with a simple touch through visual drag-and-drop features. Elevate data mining and analysis to the realm of art, unleash your thinking potential, and focus on insights for the future.

Official Download Site: https://sentosa.znv.com/
Official Community Forum: http://sentosaml.znv.com/
GitHub:https://github.com/Kennethyen/Sentosa_DSML
Bilibili: https://space.bilibili.com/3546633820179281
CSDN: https://blog.csdn.net/qq_45586013?spm=1000.2115.3001.5343
Zhihu: https://www.zhihu.com/people/kennethfeng-che/posts
CNBlog: https://www.cnblogs.com/KennethYuen