【数据＋代码】随机森林算法实现二分类

小Z的科研日常

已于 2023-06-27 09:53:57 修改

阅读量1k

点赞数 1

文章标签： python 数据分析分类算法机器学习

于 2023-06-27 09:52:39 首次发布

本文链接：https://blog.csdn.net/weixin_46287760/article/details/131410442

版权

本文利用乳腺癌数据集，通过数据预处理、特征归一化，建立随机森林模型进行二分类预测，达到了97.66%的精度。使用散点图对比恶性与良性肿瘤的特征，并通过混淆矩阵展示模型性能。

摘要由CSDN通过智能技术生成

1、引言

本文涵盖主题：数据清理与探索、特征归一化、训练和测试集拆分、构建模型四个方面。

本文数据集采用乳腺癌数据，使用随机森林判别乳腺癌为良性、恶性实现二分类预测。

本文主要结果汇总仪表板：

本期内容『数据+代码』已上传百度网盘。有需要的朋友可以关注公众号【小Z的科研日常】，后台回复关键词[随机森林]获取。

2、数据预处理

首先将数据进行读取，并对读取的数据集进行观察。用head()函数和info()函数分别查看了数据框的前几行和信息。然后，我们使用drop()函数删除了无用的列。最后，我们使用describe()函数对数据框进行描述性统计分析，并使用T属性转置了结果，以便更好地查看。

# 导入pandas库并读取数据集
import pandas as pd
df = pd.read_csv("C:/Users/asus/Desktop/data.csv")
# 查看数据集的前几行和信息
df.head()
df.info()
# 删除无用的列
df.drop(["Unnamed: 32","id"],axis=1,inplace=True)
# 描述数据集的统计特征
df.describe().T

接下来进行数据可视化分析，我们用pandas对数据进行初步处理，然后使用matplotlib或seaborn等工具进行绘图。使用seaborn绘制散点图，并对比恶性和良性乳腺肿瘤数据之间的差异。其中，左图展示了半径均值与面积均值之间的关系，右图展示了半径均值与质地均值之间的关系。

M = df[df["diagnosis"]=="M"]
B = df[df["diagnosis"]=="B"]

# 设置绘图参数
plt.figure(figsize=(12, 6))
sns.set_palette('husl') # 配色方案
sns.set_style('ticks') # 绘图风格
# 绘制散点图1
ax1 = plt.subplot(121)
ax1.set_title('Radius Mean vs. Area Mean', fontsize=16)
sns.scatterplot(x='radius_mean', y='area_mean', data=M, color='#8F2D56', label='malignant', s=60, alpha=0.7)
sns.scatterplot(x='radius_mean', y='area_mean', data=B, color='#44817B', label='benign', s=60, alpha=0.7)
plt.xlabel('Radius Mean', fontsize=14)
plt.ylabel('Area Mean', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(fontsize=12)
# 绘制散点图2
ax2 = plt.subplot(122)
ax2.set_title('Radius Mean vs. Texture Mean', fontsize=16)
sns.scatterplot(x='radius_mean', y='texture_mean', data=M, color='#8F2D56', label='malignant', s=60, alpha=0.7)
sns.scatterplot(x='radius_mean', y='texture_mean', data=B, color='#44817B', label='benign', s=60, alpha=0.7)
plt.xlabel('Radius Mean', fontsize=14)
plt.ylabel('Texture Mean', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(fontsize=12)
# 调整子图之间的距离和位置
plt.subplots_adjust(wspace=0.3, left=0.08, right=0.96, bottom=0.12, top=0.88)
# 显示图形
plt.show()

下面我们对数据进行特征工程（将数据集中的诊断结果 diagnosis 中的字符"M"和"B"转换为数字0和1，其中"M"表示恶性肿瘤，用数字1表示；"B"表示良性肿瘤，用数字0表示。这样就可以将分类问题转化为二元分类问题，方便后续模型训练。）

df.diagnosis = [1 if each == "M" else 0 for each in df.diagnosis]
X = df.iloc[:,1:].values
y = df.diagnosis.values

对数据进行归一化：

X = ((X - np.min(X))/(np.max(X)-np.min(X)))

继续对数据集进行训练集、测试机的划分，此实验选取30%的数据作为测试集，其它数据作为训练集：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

3、构建模型

下面将针对本数据集构建随机森林模型进行二分类，根据不同数据集设置random_state、test_size将会对精度有不同影响。

rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train,y_train)
prediction = rfc.predict(X_test)
print(f"{rfc} , Accuracy score: {rfc.score(X_test,y_test)}")

运行上述代码，本实验针对乳腺癌数据集进行二分类精度为：97.66%

接下来采用混淆矩阵对结果进行可视化：

def plot_confusion_matrix(y_true, y_pred):
    fig, ax = plt.subplots(figsize=(8, 6))
    cm = confusion_matrix(y_true, y_pred)
    group_names = ["True Neg", "False Pos", "False Neg", "True Pos"]
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names, group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(2, 2)
    sns.set(font_scale=1.4, style='whitegrid', palette='pastel')
    ax = sns.heatmap(cm / np.sum(cm), annot=labels, fmt="", cmap="Blues_r", cbar=False)
    ax.set_xlabel('Predicted Class', fontsize=16)
    ax.set_ylabel('Actual Class', fontsize=16)
    ax.set_xticklabels(['Negative', 'Positive'], fontsize=14)
    ax.set_yticklabels(['Negative', 'Positive'], fontsize=14)
    ax.set_title('Confusion Matrix - Random Forest Classification', fontsize=16)
    plt.show()

plot_confusion_matrix(y_test, prediction)
print(f"Classification report:\n{classification_report(y_test, prediction)}")
print("")
print("_"*12)
print("")