day10 python机器学习全流程实践

在机器学习的实践中,数据预处理与模型构建是极为关键的环节。本文将回顾数据预处理的全流程,并基于处理后的数据完成简单的机器学习建模与评估,暂不涉及复杂的调参过程。

一、预处理流程回顾

机器学习的成功,很大程度上依赖于高质量的数据。以下是数据预处理的标准流程:

  1. 导入库:引入必要的 Python 库,用于数据处理、分析、可视化以及建模。
  2. 读取数据与理解:读取数据集,通过info()head()方法初步了解数据的基本信息与结构。
  3. 缺失值处理:识别并处理数据中的缺失值。
  4. 异常值处理:检测并处理异常数据点。
  5. 离散值处理:将离散型数据转换为适合模型处理的格式。
  6. 特征工程:包括特征缩放、衍生新特征以及特征选择等操作。
  7. 划分数据集:将数据划分为训练集和测试集,用于模型训练与评估。

1.1 导入所需的包

import pandas as pd  # 用于数据处理和分析,可处理表格数据
import numpy as np   # 用于数值计算,提供高效的数组操作
import matplotlib.pyplot as plt  # 用于绘制各种类型的图表
import seaborn as sns  # 基于matplotlib的高级绘图库,能绘制更美观的统计图形

# 设置中文字体(解决中文显示问题)
plt.rcParams['font.sans-serif'] = ['SimHei']  # Windows系统常用黑体字体
plt.rcParams['axes.unicode_minus'] = False    # 正常显示负号

1.2 查看数据信息

data = pd.read_csv('data.csv')    # 读取数据
print("数据基本信息:")
data.info()
print("\n数据前5行预览:")
print(data.head())

数据基本信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Id                            7500 non-null   int64  
 1   Home Ownership                7500 non-null   object 
 2   Annual Income                 5943 non-null   float64
 3   Years in current job          7129 non-null   object 
 4   Tax Liens                     7500 non-null   float64
 5   Number of Open Accounts       7500 non-null   float64
 6   Years of Credit History       7500 non-null   float64
 7   Maximum Open Credit           7500 non-null   float64
 8   Number of Credit Problems     7500 non-null   float64
 9   Months since last delinquent  3419 non-null   float64
 10  Bankruptcies                  7486 non-null   float64
 11  Purpose                       7500 non-null   object 
 12  Term                          7500 non-null   object 
 13  Current Loan Amount           7500 non-null   float64
 14  Current Credit Balance        7500 non-null   float64
 15  Monthly Debt                  7500 non-null   float64
 16  Credit Score                  5943 non-null   float64
 17  Credit Default                7500 non-null   int64  
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB

数据前 5 行预览

   Id Home Ownership  Annual Income Years in current job  Tax Liens  \
0   0       Own Home       482087.0                  NaN        0.0   
1   1       Own Home      1025487.0            10+ years        0.0   
2   2  Home Mortgage       751412.0              8 years        0.0   
3   3       Own Home       805068.0              6 years        0.0   
4   4           Rent       776264.0              8 years        0.0   

   Number of Open Accounts  Years of Credit History  Maximum Open Credit  \
0                     11.0                     26.3             685960.0   
1                     15.0                     15.3            1181730.0   
2                     11.0                     35.0            1182434.0   
3                      8.0                     22.5             147400.0   
4                     13.0                     13.6             385836.0   

   Number of Credit Problems  Months since last delinquent  Bankruptcies  \
0                        1.0                           NaN           1.0   
1                        0.0                           NaN           0.0   
2                        0.0                           NaN           0.0   
3                        1.0                           NaN           1.0   
4                        1.0                           NaN           0.0   

              Purpose        Term  Current Loan Amount  \
0  debt consolidation  Short Term           99999999.0   
1  debt consolidation   Long Term             264968.0   
2  debt consolidation  Short Term           99999999.0   
3  debt consolidation  Short Term             121396.0   
4  debt consolidation  Short Term             125840.0   

   Current Credit Balance  Monthly Debt  Credit Score  Credit Default  
0                 47386.0        7914.0         749.0               0  
1                394972.0       18373.0         737.0               1  
2                308389.0       13651.0         742.0               0  
3                 95855.0       11338.0         694.0               0  
4                 93309.0        7180.0         719.0               0  

1.3 缺失值处理

  • Annual Income:存在 1557 个缺失值,可根据 “Home Ownership” 等相关特征的平均收入进行填充。
  • Years in current job:存在 371 个缺失值,需先将字符串类型转换为数值类型,再用众数或中位数填充。
  • Months since last delinquent:缺失值较多(4081 个),可根据其对目标变量的影响程度,选择多重填补法或直接删除缺失行。
  • Credit Score:存在 1557 个缺失值,处理方式与 “Annual Income” 类似。

1.4 数据类型转换

  • Years in current job:将字符串类型转换为数值类型。
  • Home Ownership、Purpose、Term:根据特征性质,选择独热编码或标签编码。

1.5 异常值处理

对于数值型特征,如 “Annual Income” 和 “Current Loan Amount”,可通过箱线图检测异常值,并根据实际情况决定是否处理。

1.6 特征缩放

对数值型特征进行 Min-Max 标准化或 Z-score 标准化,统一特征的取值范围。

1.7 特征工程

  • 衍生新特征:例如计算 “负债收入比”(Debt-to-Income Ratio)。
  • 特征选择:通过相关性分析等方法,筛选与目标变量相关性高的特征。

二、数据预处理实操

2.1 处理 object 类型变量

# 筛选字符串变量 
discrete_features = data.select_dtypes(include=['object']).columns.tolist()
print(discrete_features)

# 查看每个字符串变量的唯一值
for feature in discrete_features:
    print(f"\n{feature}的唯一值:")
    print(data[feature].value_counts())

处理结果

  • Home Ownership:进行标签编码
mapping = {
    'Own Home': 1,
    'Rent': 2,
    'Have Mortgage': 3,
    'Home Mortgage': 4
}

data['Home Ownership']=data['Home Ownership'].map(mapping)
data.head()
  • Years in current job:进行标签编码
years_in_job_mapping = {
    '< 1 year': 1,
    '1 year': 2,
    '2 years': 3,
    '3 years': 4,
    '4 years': 5,
    '5 years': 6,
    '6 years': 7,
    '7 years': 8,
    '8 years': 9,
    '9 years': 10,
    '10+ years': 11
}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)
  • Purpose:进行独热编码
data = pd.get_dummies(data, columns=['Purpose'])
# 将独热编码后的bool类型转换为数值
for col in data.columns:
    if 'Purpose' in col:
        data[col] = data[col].astype(int)
  • Term:进行 0-1 映射
term_mapping = {
    'Short Term': 0,
    'Long Term': 1
}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True)

2.2 处理数值型变量

# 筛选数值型特征
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()

# 用中位数填补缺失值
for feature in continuous_features:
    median_value = data[feature].median()
    data[feature].fillna(median_value, inplace=True)

处理后的数据信息:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 32 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Id                            7500 non-null   int64  
 1   Home Ownership                7500 non-null   int64  
 2   Annual Income                 7500 non-null   float64
 3   Years in current job          7500 non-null   float64
 4   Tax Liens                     7500 non-null   float64
 5   Number of Open Accounts       7500 non-null   float64
 6   Years of Credit History       7500 non-null   float64
 7   Maximum Open Credit           7500 non-null   float64
 8   Number of Credit Problems     7500 non-null   float64
 9   Months since last delinquent  7500 non-null   float64
 10  Bankruptcies                  7500 non-null   float64
 11  Long Term                     7500 non-null   int64  
 12  Current Loan Amount           7500 non-null   float64
 13  Current Credit Balance        7500 non-null   float64
 14  Monthly Debt                  7500 non-null   float64
 15  Credit Score                  7500 non-null   float64
 16  Credit Default                7500 non-null   int64  
 17  Purpose_business loan         7500 non-null   int32  
 18  Purpose_buy a car             7500 non-null   int32  
 19  Purpose_buy house             7500 non-null   int32  
 20  Purpose_debt consolidation    7500 non-null   int32  
 21  Purpose_educational expenses  7500 non-null   int32  
 22  Purpose_home improvements     7500 non-null   int32  
 23  Purpose_major purchase        7500 non-null   int32  
 24  Purpose_medical bills         7500 non-null   int32  
 25  Purpose_moving                7500 non-null   int32  
 26  Purpose_other                 7500 non-null   int32  
 27  Purpose_renewable energy      7500 non-null   int32  
 28  Purpose_small business        7500 non-null   int32  
 29  Purpose_take a trip           7500 non-null   int32  
 30  Purpose_vacation              7500 non-null   int32  
 31  Purpose_wedding               7500 non-null   int32  
dtypes: float64(13), int32(15), int64(4)
memory usage: 1.4 MB

三、机器学习模型建模与评估

3.1 数据划分

from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1)  # 特征
y = data['Credit Default']  # 标签
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"训练集形状: {X_train.shape}, 测试集形状: {X_test.shape}")

结果

训练集形状: (6000, 31), 测试集形状: (1500, 31)

3.2 模型训练与评估

使用多种常见的分类模型进行训练与评估,包括 SVM、KNN、逻辑回归、朴素贝叶斯、决策树、随机森林、XGBoost 和 LightGBM。

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")

# SVM模型
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("\nSVM 分类报告:")
print(classification_report(y_test, svm_pred))
print("SVM 混淆矩阵:")
print(confusion_matrix(y_test, svm_pred))
print("SVM 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, svm_pred):.4f}")
print(f"精确率: {precision_score(y_test, svm_pred):.4f}")
print(f"召回率: {recall_score(y_test, svm_pred):.4f}")
print(f"F1 值: {f1_score(y_test, svm_pred):.4f}")

# KNN模型
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print("\nKNN 分类报告:")
print(classification_report(y_test, knn_pred))
print("KNN 混淆矩阵:")
print(confusion_matrix(y_test, knn_pred))
print("KNN 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, knn_pred):.4f}")
print(f"精确率: {precision_score(y_test, knn_pred):.4f}")
print(f"召回率: {recall_score(y_test, knn_pred):.4f}")
print(f"F1 值: {f1_score(y_test, knn_pred):.4f}")

# 逻辑回归模型
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
print("\n逻辑回归 分类报告:")
print(classification_report(y_test, logreg_pred))
print("逻辑回归 混淆矩阵:")
print(confusion_matrix(y_test, logreg_pred))
print("逻辑回归 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, logreg_pred):.4f}")
print(f"精确率: {precision_score(y_test, logreg

@浙大疏锦行

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值