2024年中青杯B题详细分析及baseline

Misty_Gangnam

已于 2024-08-03 15:42:55 修改

阅读量1.6k

点赞数 36

分类专栏：数学建模文章标签：数学建模大数据随机森林经验分享数据结构

于 2024-08-03 03:16:25 首次发布

本文链接：https://blog.csdn.net/Misty_Gangnam/article/details/140883120

版权

数学建模专栏收录该内容

2 篇文章 0 订阅

订阅专栏

声明：由于第一次基于论文及代码写baseline，可能有些地方不够详细等等问题

题目见：2024年中青杯B题:药物属性预测

2024年中青杯B题源文件

后续会将其附属资源链接放入该博客，目前还在审核。

一、论文思路

在上一篇博客中，基于对问题要求，进行了理论上的数据挖掘及分析（并没有对照论文进行严谨的比对，仅供参考）。与本篇内容可能有所出入。由于是baseline，我参考我的论文对进行优化简述，附着的代码需要自行补充。

1.1问题一：

附件是药物分子的数据（图数据），请您利用传统方法建立药物分子的分类模型，并给出分类精度及其结果分析。

问题的关键词：图数据、传统方法、分类模型、分类精度、结果分析。

图数据：数据与数据之间不可能是割裂的，必然是联系紧密的，搞清楚数据与数据间的关系，尽可能分析多的数据（看个人，分析太多数据，算力时间都比较吃紧）对后续分类模型及其精度都有显著帮助。
传统方法：传统机器学习，如SVM，随机森林、决策树等，可以利用网格法，进行超参数优化传统方法，提高精度，基于上述方法建立分类模型。
分类精度：模型输出可以直接决定分类精度。
结果分析：通过分类精度计算准确率、召回率及精确率。结合画图分析，这些数据，会更加直观与美观。

归纳：针对问题一，首先分析药物分子的数据（图数据）的结构，再运用传统的分类方法 SVM、RandomForest、KNN 分类建立药物分子的分类模型，最后通过分类精度、精确率、召回率和 F1 分数进行评估，选择效果最好的传统分类方法建立分类模型。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 加载和预处理图数据
def load_data(file_path):
    # 实现数据加载函数
    data = pd.read_csv(file_path)
    return data

def preprocess_data(data):
    # 实现数据预处理步骤
    # 从图数据中提取特征
    X = data.drop(columns=['target'])
    y = data['target']
    return X, y

# 定义一个函数来评估模型性能
def evaluate_model(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)  # 分类精度计算
    precision = precision_score(y_true, y_pred, average='weighted')  # 精确率
    recall = recall_score(y_true, y_pred, average='weighted')  # 召回率
    f1 = f1_score(y_true, y_pred, average='weighted')  # F1分数计算
    return accuracy, precision, recall, f1

# 绘制混淆矩阵
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel('预测值')
    plt.ylabel('真实值')
    plt.title(title)
    plt.show()

# 定义和训练模型并进行超参数调优
def train_models(X_train, y_train):
    models = {
        'SVM': SVC(),
        'RandomForest': RandomForestClassifier(),
        'KNN': KNeighborsClassifier()
    }

    params = {
        'SVM': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
        'RandomForest': {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]},
        'KNN': {'n_neighbors': [3, 5, 7]}
    }

    best_models = {}
    for model_name in models:
        grid_search = GridSearchCV(models[model_name], params[model_name], cv=5, scoring='accuracy')
        grid_search.fit(X_train, y_train)
        best_models[model_name] = grid_search.best_estimator_

    return best_models

# 主函数
def main(file_path):
    data = load_data(file_path)
    X, y = preprocess_data(data)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #训练集与测试集划分

    best_models = train_models(X_train, y_train)  #模型训练

    for model_name, model in best_models.items():
        y_pred = model.predict(X_test) 
        accuracy, precision, recall, f1 = evaluate_model(y_test, y_pred)
        print(f'{model_name} - 准确率: {accuracy:.4f}, 精确率: {precision:.4f}, 召回率: {recall:.4f}, F1 分数: {f1:.4f}')
        plot_confusion_matrix(y_test, y_pred, f'{model_name} 混淆矩阵')

if __name__ == '__main__':
    file_path = 'path_to_your_data.csv'
    main(file_path)

上述内容并没有涉及画图，可根据各自需求绘制直方图等。

1.2问题二：

传统药物分子分类方法依赖于复杂的化学属性分析和生物实验，不仅耗时耗力，而且难以处理大规模的分子数据。因此，发展一种高效、准确的分子分类方法成为了当前科研的一个热点。与此同时，一些研究人员将神经网络应用到药物分子挖掘中，提出图神经网络，这种方法能够端到端进行模型的优化学习，在图分类准确度有较大提升。请您给出一种图神经网络模型对附件中的数据进行分类，并给出分类精度及其结果分析

优化问题：请您给出一种图神经网络模型对附件中的数据进行分类，并给出分类精度及其结果分析。

新关键词：图神经网络：这种方法能够端到端进行模型的优化学习

图神经网络（Graph Neural Network, GNN）：是一类专门用于处理图数据的深度学习模型。相比传统的神经网络模型处理的向量或矩阵数据，图神经网络可以直接处理图数据，如社交网络、生物网络、交通网络等。在图数据中，节点代表实体（如人、物体或事件），边表示节点之间的关系或连接。图神经网络的目标是学习节点和边的表示，以便进行各种任务，例如节点分类、图分类、链接预测等。

图神经网络包含但不限于以下几种：

1、图卷积网络（ Graph Convolutional Network ，GCN）是最基础、常用的图神经网络之一，它通过在图结构上进行卷积操作来学习节点表示。可以使用 PyTorch Geometric 或 DGL 等库来实现 GCN。具体来说，GCN模型通过多层的图卷积操作来学习节点的表示，最后将学习到的表示用于分类任务。

2、图注意力网络（Graph Attention Network ，GAT):GAT 使用注意力机制来学习节点之间的关系权重，允许模型更加灵活地捕捉图中节点之间的关系。这对于分子数据的分类可能会有所帮助。GAT模型也是一种图神经网络模型，类似于GCN模型，它也能够通过多层的图卷积操作来学习节点的表示。然而，与GCN不同的是，GAT在学习节点表示时引入了注意力机制，能够动态地为不同的节点赋予不同的注意力权重。

强调：数据处理是十分重要的，数学建模大部分时间都是在数据预处理与处理，后面建模可以参照或仿照。以下仅给出了一个假设的示例

基于上述分析可以大致写出以下baseline：

import torch
import torch.nn.functional as F
from torch_geometric.data import Data, DataLoader
from torch_geometric.nn import GCNConv, GATConv
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 图数据加载与预处理
def load_graph_data(file_path):
    # 数据加载函数
    # 转换为图数据
    data = pd.read_csv(file_path)
    # 创建图数据对象
    edge_index = torch.tensor([[0, 1], [1, 0]], dtype=torch.long)  # 假设数据间的边
    x = torch.tensor([[1], [2]], dtype=torch.float)  # 假设的节点特征
    y = torch.tensor([0, 1], dtype=torch.long)  # 假设的标签
    data = Data(x=x, edge_index=edge_index.t().contiguous(), y=y)
    return data

# 定义GCN模型
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# 定义GAT模型
class GAT(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, heads=1):
        super(GAT, self).__init__()
        self.conv1 = GATConv(input_dim, hidden_dim, heads=heads)
        self.conv2 = GATConv(hidden_dim * heads, output_dim, heads=1, concat=False)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.elu(x)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# 模型训练与评估
def train_and_evaluate(model, data, epochs=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
    criterion = torch.nn.CrossEntropyLoss()

    # 训练
    model.train()
    for epoch in range(epochs):
        optimizer.zero_grad()
        out = model(data)
        loss = criterion(out, data.y)
        loss.backward()
        optimizer.step()
    
    # 评估
    model.eval()
    _, pred = model(data).max(dim=1)
    accuracy = accuracy_score(data.y, pred)
    precision = precision_score(data.y, pred, average='weighted')
    recall = recall_score(data.y, pred, average='weighted')
    f1 = f1_score(data.y, pred, average='weighted')
    return accuracy, precision, recall, f1, pred

# 画混淆矩阵
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.title(title)
    plt.show()

# 主函数
def main(file_path):
    data = load_graph_data(file_path)
    input_dim = data.num_node_features
    hidden_dim = 16
    output_dim = int(data.y.max()) + 1

    gcn_model = GCN(input_dim, hidden_dim, output_dim)
    gat_model = GAT(input_dim, hidden_dim, output_dim)

    print("Training GCN model...")
    gcn_accuracy, gcn_precision, gcn_recall, gcn_f1, gcn_pred = train_and_evaluate(gcn_model, data)
    print(f'GCN - Accuracy: {gcn_accuracy:.4f}, Precision: {gcn_precision:.4f}, Recall: {gcn_recall:.4f}, F1 Score: {gcn_f1:.4f}')
    plot_confusion_matrix(data.y, gcn_pred, 'GCN Confusion Matrix')

    print("Training GAT model...")
    gat_accuracy, gat_precision, gat_recall, gat_f1, gat_pred = train_and_evaluate(gat_model, data)
    print(f'GAT - Accuracy: {gat_accuracy:.4f}, Precision: {gat_precision:.4f}, Recall: {gat_recall:.4f}, F1 Score: {gat_f1:.4f}')
    plot_confusion_matrix(data.y, gat_pred, 'GAT Confusion Matrix')

if __name__ == '__main__':
    file_path = 'path_to_your_data.csv'
    main(file_path)

1.3问题三：

现有图神经网络模型在处理具有节点特征稀疏性和信息冗余的图结构数据时面临挑战，这限制了模型在复杂网络分析中的应用效果。请您尝试给出一种新的药物分子分类方法突破这种限制，给出试验结果，并进行分析讨论。

关键词：节点特征稀疏性喝信息冗余的图数据结构

针对问题三，针对现有图神经网络模型在处理具有节点特征稀疏性和信息冗余的问题，在原有神经网络基础之上，添加了一个权重剪枝移除图中不必要的连接，然后又引入注意力机制和图自编码器更好地提取图中的特征表示，学习节点的低维表示，最后通过交叉验证来评估模型分类的准确性。步骤如下：

权重剪枝：移除图中不必要的连接，减少信息冗余。
图注意力机制：动态地为不同的节点赋予不同的注意力权重，提升模型的灵活性。
图自编码器：更好地提取图中的特征表示，学习节点的低维表示。
交叉验证：评估模型分类的准确性。

import torch
import torch.nn.functional as F
from torch_geometric.data import Data, DataLoader
from torch_geometric.nn import GATConv, GAE, VGAE
from torch_geometric.utils import train_test_split_edges
from torch_geometric.nn.models.autoencoder import InnerProductDecoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 图数据加载与预处理
def load_graph_data(file_path):
    # 这里实现数据加载函数
    # 转换为图数据
    data = pd.read_csv(file_path)
    # 示例中的图数据
    # 创建图数据对象
    edge_index = torch.tensor([[0, 1], [1, 0]], dtype=torch.long)  # 假设的边
    x = torch.tensor([[1], [2]], dtype=torch.float)  # 假设的节点特征
    y = torch.tensor([0, 1], dtype=torch.long)  # 假设的标签
    data = Data(x=x, edge_index=edge_index.t().contiguous(), y=y)
    return data

# 权重剪枝函数
def prune_graph(data, threshold=0.1):
    edge_index, edge_weight = data.edge_index, data.edge_attr
    mask = edge_weight > threshold
    data.edge_index = edge_index[:, mask]
    if edge_weight is not None:
        data.edge_attr = edge_weight[mask]
    return data

# 定义图注意力机制模型
class GAT(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, heads=1):
        super(GAT, self).__init__()
        self.conv1 = GATConv(input_dim, hidden_dim, heads=heads)
        self.conv2 = GATConv(hidden_dim * heads, output_dim, heads=1, concat=False)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.elu(x)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# 定义图自编码器模型
class GraphAutoencoder(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(GraphAutoencoder, self).__init__()
        self.encoder = GAE(InnerProductDecoder(input_dim))
        self.hidden_dim = hidden_dim

    def forward(self, data):
        z = self.encoder.encode(data.x, data.edge_index)
        return z

# 模型训练与评估
def train_and_evaluate(model, data, epochs=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
    criterion = torch.nn.CrossEntropyLoss()

    # 训练
    model.train()
    for epoch in range(epochs):
        optimizer.zero_grad()
        out = model(data)
        loss = criterion(out, data.y)
        loss.backward()
        optimizer.step()

    # 评估
    model.eval()
    _, pred = model(data).max(dim=1)
    accuracy = accuracy_score(data.y, pred)
    precision = precision_score(data.y, pred, average='weighted')
    recall = recall_score(data.y, pred, average='weighted')
    f1 = f1_score(data.y, pred, average='weighted')
    return accuracy, precision, recall, f1, pred

# 画混淆矩阵
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.title(title)
    plt.show()

# 主函数
def main(file_path):
    data = load_graph_data(file_path)
    data = prune_graph(data, threshold=0.1)  # 进行权重剪枝
    input_dim = data.num_node_features
    hidden_dim = 16
    output_dim = int(data.y.max()) + 1

    # 图注意力机制模型
    gat_model = GAT(input_dim, hidden_dim, output_dim)

    print("Training GAT model with attention mechanism...")
    gat_accuracy, gat_precision, gat_recall, gat_f1, gat_pred = train_and_evaluate(gat_model, data)
    print(f'GAT - Accuracy: {gat_accuracy:.4f}, Precision: {gat_precision:.4f}, Recall: {gat_recall:.4f}, F1 Score: {gat_f1:.4f}')
    plot_confusion_matrix(data.y, gat_pred, 'GAT Confusion Matrix')

    # 图自编码器模型
    autoencoder_model = GraphAutoencoder(input_dim, hidden_dim)
    print("Training Graph Autoencoder...")
    data_train, data_test = train_test_split_edges(data)  # 分割训练和测试集
    optimizer = torch.optim.Adam(autoencoder_model.parameters(), lr=0.01, weight_decay=5e-4)
    model.train()
    for epoch in range(epochs):
        optimizer.zero_grad()
        z = autoencoder_model(data_train)
        loss = autoencoder_model.encoder.recon_loss(z, data_train.edge_index)
        loss.backward()
        optimizer.step()

    z = autoencoder_model(data)
    data.x = z  # 用低维表示替换原始特征

    gat_model = GAT(hidden_dim, hidden_dim, output_dim)
    gat_accuracy, gat_precision, gat_recall, gat_f1, gat_pred = train_and_evaluate(gat_model, data)
    print(f'Graph Autoencoder + GAT - Accuracy: {gat_accuracy:.4f}, Precision: {gat_precision:.4f}, Recall: {gat_recall:.4f}, F1 Score: {gat_f1:.4f}')
    plot_confusion_matrix(data.y, gat_pred, 'Graph Autoencoder + GAT Confusion Matrix')

if __name__ == '__main__':
    file_path = 'path_to_your_data.csv'
    main(file_path)

baseline代码仅供参考，由于根据个人习惯，有的会多写一些，给出的一些数字与假设仅供参考。放入论文内可能需要更加严谨，内容更多。

以上的内容仅供参考。本次的记录就到此为此，第一次写可能有所不足，欢迎大家在评论区交流指导一下，也可能由于自己写代码的思路比较碎碎念，多有抱歉。