关于逻辑回归的一些介绍

最新推荐文章于 2024-09-14 22:52:49 发布

张焚雪

最新推荐文章于 2024-09-14 22:52:49 发布

阅读量1k

点赞数 32

文章标签：逻辑回归算法机器学习

本文链接：https://blog.csdn.net/2301_79096986/article/details/141965486

版权

在这篇文章中，我将介绍一个分类算法——logistic回归，即使它的名字中带有回归二字，但它依旧是分类算法。其中，我将分别介绍它的概念，应用，优缺点以及具体实例等。

一、概念及其由来

Logistic回归是一种常用的统计模型，用于解决分类问题，特别是二分类问题。在logistic回归中，最重要的就是sigmoid函数，即

因为首先我们知道线性回归模型为：此时，我们想要进行二分类任务，如下图：

（图像中略有些问题，在x轴的0.0位置上，不应是条直线，而是在y轴的0.5处是一个点）

看到图例我们知道，方才提到的sigmoid函数就可以帮助我们做到这样的一个分类，那么我们便引入它，让模型变为：

其中 y 是二分类标签（通常是 0 或 1），x 是输入特征向量，w 是权重向量，b 是偏置项。

这样我们就得到了逻辑回归的模型。

二、应用

关于它的应用主要是在是与否的区分上，比如在医学上预测某个病患是否会得高血压，在金融上预测一个借贷人是否会失信等。

当然，在一些图像处理上它同样可以使用，比如用来来区分猫与狗等。

三、优缺点

3.1 逻辑回归的优点

简单直观：逻辑回归模型形式简单，易于理解和解释。

高效：相比复杂的模型，逻辑回归训练速度快，预测效率高。

可解释性强：模型参数可以直接解释为特征的重要性。

1.2 逻辑回归的局限性

假设线性关系：逻辑回归假设特征与输出之间存在线性关系，对于非线性关系的处理能力有限。

对异常值敏感：模型对异常值比较敏感，可能会导致过拟合。

四、 python实例

在这个实例中，我并没有用python完整去写逻辑回归的模型代码，而是直接调用了sklearn中的函数，并且因为最后预测目标有两个，所以我将预测部分放入一个自定义函数中方便反复使用。最后，因为在数据特征中因为有一些特征值并非可以直接使用，所以需要用到onehot编码。

如下是代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler


def pre_logistic(X_encoded, y,submition_data):
    # 划分训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(X_encoded, y,
                                                        test_size=0.2, random_state=42)

    '''import model and predict'''
    # 实例化逻辑回归模型
    logistic_reg = LogisticRegression(random_state=42, max_iter=1000)
    # 拟合模型
    logistic_reg.fit(X_train, y_train)

    # 预测
    y_pred = logistic_reg.predict(X_test)
    pre_result = logistic_reg.predict(submition_data).astype('float64')

    # 评估
    # 计算准确率
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2f}")
    # 打印分类报告
    report = classification_report(y_test, y_pred)
    print("Classification Report:")
    print(report)
    # 打印混淆矩阵
    conf_matrix = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:")
    print(conf_matrix)

    # 交叉验证
    # 使用交叉验证评估模型
    cv_scores = cross_val_score(logistic_reg, X_encoded, y, cv=5)
    print(f"Cross-Validation Scores: {cv_scores}")
    print(f"Mean Cross-Validation Score: {cv_scores.mean():.2f}")
    # 调整 C 参数
    logistic_reg_tuned = LogisticRegression(random_state=42, max_iter=1000, C=10)
    logistic_reg_tuned.fit(X_train, y_train)
    # 再次评估模型
    y_pred_tuned = logistic_reg_tuned.predict(X_test)
    pre_result_tuned = logistic_reg_tuned.predict(submition_data).astype('float64')
    accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
    print(f"Tuned Accuracy: {accuracy_tuned:.2f}")

    # 返回对应预测结果的数组
    return pre_result_tuned


'''data preprocessing'''
# 导入数据
data = pd.read_csv(r'C:\Users\20349\Desktop\Flu Shot Learning\training_set_features.csv')
#submition = pd.read_csv(r'C:\Users\20349\Desktop\Flu Shot Learning\submission_format.csv') # 26707 ~ 53414
submition_data = pd.read_csv(r"C:\Users\20349\Desktop\Flu Shot Learning\test_set_features.csv")
# 目标变量
y_raw = pd.read_csv(r"C:\Users\20349\Desktop\Flu Shot Learning\training_set_labels.csv")
y_h1n1 = y_raw.iloc[:,1]
y_seasonal = y_raw.iloc[:,2]

# 处理缺失值
data.fillna(data.mode().iloc[0],inplace=True)
submition_data.fillna(submition_data.mode().iloc[0],inplace=True)

# 进行onehot码编码
#print(data.iloc[0,:],"\n",data.iloc[1,:])
# 标签
# 选择需要编码的类别型特征
categorical_features = [
    'race', 'sex', 'age_group', 'education', 'income_poverty',
    'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region',
    'census_msa', 'employment_industry', 'employment_occupation'
]
# 选择数值型特征
numeric_features = [
    'h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds',
    'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands',
    'behavioral_large_gatherings', 'behavioral_outside_home',
    'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
    'chronic_med_condition', 'child_under_6_months', 'health_worker',
    'health_insurance'
]
X_categorical = data[categorical_features]
submition_categorical = submition_data[categorical_features]
X_numeric = data[numeric_features]
submition_numeric = submition_data[numeric_features]

# 初始化 OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='if_binary', handle_unknown='ignore')
# 拟合并转换类别型特征
X_categorical_encoded = encoder.fit_transform(X_categorical)
submition_categorical_encoded = encoder.fit_transform(submition_categorical)
# 获取编码后的特征名称
feature_names = encoder.get_feature_names_out(categorical_features)
# 将编码后的特征转换为 DataFrame
X_categorical_df = pd.DataFrame(X_categorical_encoded, columns=feature_names)
submition_categorical_df = pd.DataFrame(submition_categorical_encoded,columns=feature_names)
# 将编码后的数据与数值型特征整合
X_encoded = pd.concat([X_numeric, X_categorical_df], axis=1)
submition_encoded = pd.concat([submition_numeric,submition_categorical_df],axis=1)
# 查看编码后的数据
#print(X_encoded.head())

# 归一化
# 初始化标准化器
scaler = StandardScaler()
# scaler = MinMaxScaler()
# 对数值型特征进行标准化
X_numeric_scaled = scaler.fit_transform(X_numeric)
# 将标准化后的数值型特征与 OneHot 编码后的特征合并
X_encoded = pd.concat([pd.DataFrame(X_numeric_scaled,
                                    columns=numeric_features), X_categorical_df], axis=1)

# 预测
pre_h1n1 = pre_logistic(X_encoded,y_h1n1,submition_encoded)
#print(pre_h1n1)
pre_seasonal = pre_logistic(X_encoded,y_seasonal,submition_encoded)
#print(pre_seasonal)
result = list(zip(pre_h1n1,pre_seasonal))
#print(result)


'''write the result into the file'''
index = 0
with open(r'C:\Users\20349\Desktop\Flu Shot Learning\submission_format.csv','w',encoding='utf-8') as submation_file:
    submation_file.write("respondent_id,h1n1_vaccine,seasonal_vaccine\n")
    for i in range(26707,53415):
        submation_file.write(str(i))
        submation_file.write(",")
        submation_file.write(','.join(map(str, result[index])) + "\n")
        index += 1
    print("Writing completed!!!")

此外，我一并附上一份关于它的matplotlib代码：

import pandas as pd
import matplotlib.pyplot as plt

# 读取 CSV 文件
data = pd.read_csv(r'C:\Users\20349\Desktop\Flu Shot Learning\training_set_features.csv')

# 处理缺失值
data.fillna(data.mode().iloc[0], inplace=True)

# 获取数据的列名
columns = data.columns.tolist()

# 分类数值型特征和类别型特征
numeric_features = [col for col in columns if data[col].dtype in ['int64', 'float64']]
categorical_features = [col for col in columns if col not in numeric_features]


def plot_features(features, title, plot_type='boxplot'):
    num_plots = len(features)
    rows = (num_plots + 2) // 3  # 每行最多3个子图
    fig, axes = plt.subplots(rows, 3, figsize=(15, rows * 5))

    for idx, feature in enumerate(features):
        row_idx = idx // 3
        col_idx = idx % 3
        ax = axes[row_idx, col_idx]

        if plot_type == 'boxplot':
            data[feature].plot(kind='box', vert=False, ax=ax)
        elif plot_type == 'bar':
            value_counts = data[feature].value_counts()
            value_counts.plot(kind='bar', ax=ax)

        ax.set_title(f'{title} of {feature}')
        ax.set_xlabel('Value')
        ax.set_ylabel('')  # 隐藏y轴标签

    # 隐藏多余的子图
    for idx in range(num_plots, rows * 3):
        fig.delaxes(axes[idx // 3, idx % 3])

    plt.tight_layout()
    plt.show()


# 绘制数值型特征的箱形图
plot_features(numeric_features, 'Boxplot')

# 绘制类别型特征的条形图
plot_features(categorical_features, 'Bar Chart', plot_type='bar')

画出的图如下：