项目来源:https://www.kaggle.com/c/home-credit-default-risk/overview
1. Data mining objective and exploratory analysis
Many people find it difficult to obtain loans because of insufficient or non-existent credit records. To evaluate whether those clients who do not have bank accounts and credit records are qualified applicants, Home Credit, an institution in the USA, tried to collect alternative data referring to their social and economic status to predict their behavior.
To have an intuitive understanding of the dataset, we selected some representatives from four categories of variables which we summarized in one-page proposal. (Personal information, Property conditions, Conditions of living area and Application condition)
First, in terms of personal information, data includes information about clients’ education and work. Take education for example, people with Secondary/ Secondary special education type usually have worse credits.
Second, from the perspective of property condition, we choose the variables ‘FLAG_OWN_CAR’ which shows whether the clients have a car to see its distribution among target variable. It is reasonable that people who don’t have a car has more difficulty in repayment.
Third, we just selected ‘REGION_RATING_CLIENT’ to see if the locations of the clients have some influence on the result. People living in region 2 tend to have worse credit than people in other regions.
Fourth, from the information clients provided to apply for this loan, there seemed to be a relationship between the ‘FLAG_EMP_PHONE’ and ‘TARGET’, which surprisingly showed that those who provided work phone seems to have larger chance to default.
2. Data Preprocessing and Model Building
(1) Data preprocessing
The training set is highly unbalanced with 92% of class 0 and only 8% of class 1. And many of the variables contain a large number of missing values.
First, the variables which have more than 60% missing value are removed, since such variables hardly contribute much to our model.
Second, among those left variables, the categorical ones are encoded using one-hot encoding into respective dummy variables. For missing values in categorical variables, the missing value could be created as a new dummy variable during transformation. Therefore, the filling process was designed for numerical and binary variables. After transforming all the left variables into numerical ones, we fill the missing values which are not from the categorical variables with mean.
(2) Model selection and building
Performance over models are evaluated on AUC score and recall of class 1. AUC could work even for imbalanced data and recall is important because in actual scenarios, misclassifying default loans as good loans incurs more cost than the opposite. The model building generally progresses as the following steps.
- Building models with preprocessed data. Since the dataset is quite unbalanced, models built from these data are not reliable, yet the evaluation is performed on these models at first.
- Splitting the dataset into training set and test set with a ratio of 3:7.
- Applying the four classifiers roughly, i.e. LogisticRegression, RandomForestClassifier, XGBClassifier and GradientBoostClassifier to the training set. Evaluation is performed on the test set. The high accuracies and low recalls indicate a serious overfitting problem which is caused by the unbalance in that dataset.
- With AUC scores of 0.74, GBC and XGB perform better in this case. And LogisticRegression has a score of 0.62 which is not good. In general: GBC~XGB>RF>>LR.
(3) Merging datasets
There are several other useful datasets in this case, including “bureau”, “pos_cash”, “credit_card”, etc. However, these tables do not have a uniform primary key, so we aggregate different tables using respective methods which depends on our understanding.
The general steps are as follows:
- Matching new data set with base dataset “application_test” on the primary key
- Dealing with missing value
- Encoding the categorical variables
- Aggregating columns by client’s ID using respective methods (e.g. sum, count or mean)
3. Model improvement
We introduced resampling, feature selection, hyperparameter tuning and merging with other datasets to improve model performance.
(1) Resampling
We then introduced downsampling to the data set to make it balanced. Upsampling the minority class was not adopted here due to limitations of computer’s capability to handle large data sets. After downsampling, we reduced the number of instances from more than 300,000 to about 30,000. Large amounts of computational cost were saved at the cost of losing some samples. The following curve is before resampling.
The following is the curve after resampling. It can be inferred that the AUC got improved.
(2) Feature selection and hyperparameter tuning
To improve performance, we performed feature selection and hyperparameter tuning. First, we run RFE method to find the best number of features to be included in models and then used best features to make predictions on test data. To tune the hyperparameters, we used RandomizedSearchCV with 10-fold validation to find the best combination of parameters for models.
XGBoost and Random Forest performed better before feature selection and hyperparameter tuning. Therefore, the two algorithms were selected for feature selection and hyperparameter tuning. Since random forest and XGB algorithms are both based on tree model, important hyperparameters include the number of trees and the number of features considered in each tree when splitting a node. In this case, we tuned the following hyperparameters and randomly chose 100 combinations to train on:
a) n_estimators: number of trees in the forest
b) max_features: max number of features considered for splitting a node
c) max_depth: max number of levels in each decision tree
d) min_samples_split: min number of data points placed in a node before the node is split
e) min_samples_leaf: min number of data points allowed in a leaf node bootstrap
Using Random Forest Classifier for further exploration, we also found that:
- After feature selection, we reduced 199 features to 90. The model performed a little better with AUC increasing from 0.7418.
- After tuning hyperparameters, however, AUC just increased a little bit to 0.7225.
Using XGBClassifier for further exploration, we also found that: - Feature selection showed that the number of features included in the model seemed make no difference in our XGB models. The reason may be that for datasets with different
features and instances, it may cause insensitivity of XGBClassifier to feature selection by RFE. The AUC was 0.7443. - After tuning hyperparameters, however, AUC just increased a little bit to 0.7434.
(3) Merging with other datasets
To improve AUC score, we tried to add more features from other data sets. We applied XGBoost in this part.
First we added table “previous application.csv”. In this dataset, each person can have more than one previous application. So we need to do some aggregation work.
To deal with this table, first we generated new variables by substraction or division based on the meaning of variables. For example, we calculated annuity difference between current application and previous ones. We also got each client’s last 1,3,5 and 10 applications and get mean value. We tried to dummies and used LabelEncoder for categorial variables, but found the latter one had better performance. After merging with original data set, AUC improved to 0.7566.
Second, we added table “bureau.csv”. Similarly, each person can have more than one credits records in this data set. For example, we counted the number of previous records for each person. We got the mean, sum or count value for different variables based on the meaning. AUC had a relatively large imcrease to 0.7643.
Third, we added table “credit_card_balance.csv”. We did similar things. For example, we got the mean value for past due days of credit cards. This time AUC not improve too much, just to 0.7655.
Then we did hyperparameter tuning using random grid search to the combined dataset.
AUC improve to 0.7672? and recall for 1 improve to 0.70
Following is our setting hyperparameters and best hyperparameters:
“learning_rate”: [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
“max_depth”: [ 3, 4, 5, 6, 8, 10, 12, 15],
“min_child_weight”: [ 1, 3, 5, 7],
“gamma”: [ 0.0, 0.1, 0.2,0.3, 0.4 ],
“colsample_bytree”: [ 0.3, 0.4, 0.5,0.7]
Best hyperparameters: {‘min_child_weight’: 1, ‘max_depth’: 12, ‘learning_rate’: 0.05, ‘gamma’: 0.3, ‘colsample_bytree’: 0.5}
The performance improvement for this part is summarised below: