1.数据预处理
autogluon竟都不需要对str类数据进行处理,为了简便只对yes/no做了一个简单处理,同时,预测的label列也不用专门摘出来,在后续训练中直接指定即可。
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
df1=pd.read_csv('/Users/johnny/Downloads/CreditMaster/heart_2020_cleaned.csv')
healthy = df1[df1['HeartDisease']=='No']
unhealthy = df1[df1['HeartDisease']=='Yes']
up_sampled = resample(unhealthy, replace=True, n_samples=len(healthy))
df_new = pd.concat([healthy,up_sampled])
df_new = df_new.replace({'No': 0, 'Yes': 1})
df_new = df_new.replace({'Male': 0, 'Female': 1})
train, test = train_test_split(df_new,test_size=0.1,random_state=0)
2.模型训练
指定要预测的心脏病列尾预测label,然后调用fit即可。
label = 'HeartDisease'
predictor = TabularPredictor(label = label,presets='best_quality', ).fit(train)
TabularPredictor()
#训练过程中发生了什么
results = predictor.fit_summary()
No path specified. Models will be saved in: "AutogluonModels/ag-20220418_130233/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220418_130233/"
AutoGluon Version: 0.2.0
Train Data Rows: 526359
Train Data Columns: 17
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [1, 0]
If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 3317.47 MB
Train Data (Original) Memory Usage: 172.08 MB (5.2% of available memory)
Warning: Data size prior to feature transformation consumes 5.2% of available memory. Consider increasing memory or subsampling the data to avoid instability.
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
('int', []) : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
('object', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
('float', []) : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
('int', []) : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
2.5s = Fit runtime
17 features in original data used to generate 17 features in processed data.
Train Data (Processed) Memory Usage: 56.85 MB (1.7% of available memory)
Data preprocessing and feature engineering runtime = 2.87s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric argument of fit()
Automatically generating train/validation split with holdout_frac=0.01, Train Rows: 521095, Val Rows: 5264
Fitting model: KNeighborsUnif ...
0.8575 = Validation accuracy score
210.89s = Training runtime
0.8s = Validation runtime
Fitting model: KNeighborsDist ...
0.8733 = Validation accuracy score
222.96s = Training runtime
0.54s = Validation runtime
Fitting model: LightGBMXT ...
0.762 = Validation accuracy score
2.54s = Training runtime
0.03s = Validation runtime
Fitting model: LightGBM ...
[1000] train_set's binary_error: 0.199513 valid_set's binary_error: 0.206117
[2000] train_set's binary_error: 0.177611 valid_set's binary_error: 0.18712
[3000] train_set's binary_error: 0.159547 valid_set's binary