SSCLMD模型代码实现详解

SSCLMD模型代码实现详解

1. 项目源码结构

SSCLMD项目的源码结构如下:

SSCLMD-main/
├── README.md
├── ST4.xlsx
├── Supplementary File.docx
├── code/
│   ├── calculating_similarity.py
│   ├── data_preparation.py
│   ├── data_preprocess.py
│   ├── layer.py
│   ├── main.py
│   ├── parms_setting.py
│   ├── train.py
│   └── utils.py
└── data/
    ├── dataset1.rar
    └── dataset2.rar

2. 模型核心组件详解

2.1 模型定义(layer.py)

模型在layer.py文件中定义,主要包含以下几个关键类:

  1. Attention类
class Attention(nn.Module):
    def __init__(self, in_size, hidden_size=128):   # LDA:128 MDA,LMI:16
        super(Attention, self).__init__()

        self.project = nn.Sequential(
            nn.Linear(in_size, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, 1, bias=False)
        )

    def forward(self, z):
        w = self.project(z)
        beta = torch.softmax(w, dim=1)
        return (beta * z).sum(1), beta

这是一个注意力机制的实现,通过计算不同视图的权重,实现对不同视图特征的加权聚合。

  1. GCN类
class GCN(nn.Module):
    def __init__(self, nfeat, nhid, out, dropout = 0.5):
        super(GCN, self).__init__()
        self.gc1 = GCNConv(nfeat, nhid)
        self.prelu1 = nn.PReLU(nhid)
        self.gc2 = GCNConv(nhid, out)
        self.prelu2 = nn.PReLU(out)
        self.dropout = dropout

    def forward(self, x, adj):
        x = self.prelu1(self.gc1(x, adj))
        x = F.dropout(x, self.dropout, training=self.training)
        x = self.prelu2(self.gc2(x, adj))
        return x

这是图卷积网络的实现,用于从图结构中提取节点特征。

  1. Discriminator类
class Discriminator(nn.Module):
    def __init__(self, dim):
        super(Discriminator, self).__init__()
        self.fn = nn.Bilinear(dim, dim, 1)

    def forward(self, h1, h2, h3, h4, c1, c2):
        c_x1 = c1.expand_as(h1).contiguous()
        c_x2 = c2.expand_as(h2).contiguous()

        # positive
        sc_1 = self.fn(h1, c_x1).squeeze(1)
        sc_2 = self.fn(h2, c_x2).squeeze(1)

        # negative
        sc_3 = self.fn(h3, c_x1).squeeze(1)
        sc_4 = self.fn(h4, c_x2).squeeze(1)

        logits = th.cat((sc_1, sc_2, sc_3, sc_4))

        return logits

这是自监督对比学习的判别器,用于区分正样本和负样本。

  1. SSCLMD类
class SSCLMD(nn.Module):
    def __init__(self, in_dim, hid_dim, out_dim, decoder1):
        super(SSCLMD, self).__init__()

        self.encoder1 = GCN(in_dim, hid_dim, out_dim)
        self.encoder2 = GCN(in_dim, hid_dim, out_dim)
        self.encoder3 = GCN(in_dim, hid_dim, out_dim)
        self.encoder4 = GCN(in_dim, hid_dim, out_dim)

        self.pooling = AvgReadout()
        self.attention = Attention(out_dim)

        self.disc = Discriminator(out_dim)
        self.act_fn = nn.Sigmoid()

        self.local_mlp = nn.Linear(out_dim, out_dim)
        self.global_mlp = nn.Linear(out_dim, out_dim)

        self.decoder1 = nn.Linear(out_dim * 4, decoder1)
        self.decoder2 = nn.Linear(decoder1, 1)

这是SSCLMD模型的主要类,整合了编码器、注意力机制、判别器和解码器。

2.2 网络前向传播过程

SSCLMD模型的前向传播过程如下:

def forward(self, data_s, data_f, idx):
    # 获取特征和图结构
    feat, s_graph = data_s.x, data_s.edge_index
    shuff_feat, f_graph = data_f.x, data_f.edge_index
    
    # 结构图和特征图编码
    h1 = self.encoder1(feat, s_graph)
    h2 = self.encoder2(feat, f_graph)
    
    h1 = self.local_mlp(h1)
    h2 = self.local_mlp(h2)
    
    # 负样本编码
    h3 = self.encoder1(shuff_feat, s_graph)
    h4 = self.encoder2(shuff_feat, f_graph)
    
    h3 = self.local_mlp(h3)
    h4 = self.local_mlp(h4)
    
    # 额外的编码用于关系预测
    h5 = self.encoder3(feat, s_graph)
    h6 = self.encoder3(feat, f_graph)
    
    # 全局表示
    c1 = self.act_fn(self.global_mlp(self.pooling(h1)))
    c2 = self.act_fn(self.global_mlp(self.pooling(h2)))
    
    # 自监督对比学习
    out = self.disc(h1, h2, h3, h4, c1, c2)
    
    # 多视图融合
    h_com = (h5 + h6)/2
    emb = torch.stack([h1, h2, h_com], dim=1)
    emb, att = self.attention(emb)
    
    # 根据任务类型选择实体
    if args.task_type == 'LDA':
        entity1 = emb[idx[0]]
        entity2 = emb[idx[1] + 386]
    
    if args.task_type == 'MDA':
        entity1 = emb[idx[0] + 702]
        entity2 = emb[idx[1] + 386]
    
    if args.task_type == 'LMI':
        entity1 = emb[idx[0]]
        entity2 = emb[idx[1] + 702]
    
    # 多关系建模解码器
    add = entity1 + entity2
    product = entity1 * entity2
    concatenate = torch.cat((entity1, entity2), dim=1)
    
    feature = torch.cat((add, product, concatenate), dim=1)
    
    log1 = F.relu(self.decoder1(feature))
    log = self.decoder2(log1)
    
    return out, log

3. 数据预处理过程详解

数据预处理主要在data_preprocess.py文件中实现,关键步骤包括:

  1. 数据加载与正负样本构建
positive = np.loadtxt(args.in_file, dtype=np.int64)
link_size = int(positive.shape[0])
np.random.seed(args.seed)
np.random.shuffle(positive)
positive = positive[:link_size]

negative_all = np.loadtxt(args.neg_sample, dtype=np.int64)
np.random.shuffle(negative_all)
negative = np.asarray(negative_all[:positive.shape[0]])

positive = np.concatenate([positive, np.ones(positive.shape[0], dtype=np.int64).reshape(-1, 1)], axis=1)
negative = np.concatenate([negative, np.zeros(negative.shape[0], dtype=np.int64).reshape(-1, 1)], axis=1)

all_data = np.vstack((positive, negative))
  1. 构建K折交叉验证数据集
kf = KFold(n_splits=n_splits, shuffle=True, random_state=args.seed)

cv_train_loaders = []
cv_test_loaders = []

for train_index, test_index in kf.split(all_data):
    train_data = all_data[train_index]
    test_data = all_data[test_index]
    
    train_positive = train_data[train_data[:, 2] == 1][:, :2]
    
    # 构建邻接矩阵...
    
    # 构建数据加载器
    training_set = Data_class(train_data)
    train_loader = DataLoader(training_set, **params)
    
    test_set = Data_class(test_data)
    test_loader = DataLoader(test_set, **params)
    
    cv_train_loaders.append(train_loader)
    cv_test_loaders.append(test_loader)
  1. 构建图数据结构
# 构建边索引
edges_s = s_adj.nonzero()
edge_index_s = torch.tensor(np.vstack((edges_s[0], edges_s[1])), dtype=torch.long)

edges_f = f_adj.nonzero()
edge_index_f = torch.tensor(np.vstack((edges_f[0], edges_f[1])), dtype=torch.long)

# 转换特征为张量
x = torch.tensor(node_feature, dtype=torch.float)
shuf_feature = torch.tensor(shuf_feature, dtype=torch.float)

# 创建PyG的Data对象
data_s = Data(x=x, edge_index=edge_index_s)
data_f = Data(x=shuf_feature, edge_index=edge_index_f)

4. 训练过程详解

训练过程在train.py文件中实现,主要包括以下几个步骤:

  1. 模型初始化
model = SSCLMD(in_dim = args.dimensions, hid_dim= args.hidden1, out_dim = args.hidden2, decoder1=args.decoder1)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

m = torch.nn.Sigmoid()
loss_fct = torch.nn.BCEWithLogitsLoss() 
loss_node = torch.nn.BCELoss()
  1. 训练循环
for epoch in range(args.epochs):
    t = time.time()
    print('-------- Epoch ' + str(epoch + 1) + ' --------')
    y_pred_train = []
    y_label_train = []
    
    lbl_1 = torch.ones(997 * 2)  # dataset1: 997, dataset2: 1071
    lbl_2 = torch.zeros(997 * 2)
    lbl = torch.cat((lbl_1, lbl_2)).cuda()
    
    for i, (label, inp) in enumerate(train_loader):
        if args.cuda:
            label = label.cuda()
            
        model.train()
        optimizer.zero_grad()
        
        # 前向传播
        output, log = model(data_s, data_f, inp)
        log = torch.squeeze(m(log))
        
        # 计算损失
        loss_class = loss_node(log, label.float())
        loss_constra = loss_fct(output, lbl)
        loss_train = loss_class + args.loss_ratio1 * loss_constra
        
        # 反向传播
        loss_train.backward()
        optimizer.step()
        
        # 收集预测结果
        label_ids = label.to('cpu').numpy()
        y_label_train = y_label_train + label_ids.flatten().tolist()
        y_pred_train = y_pred_train + log.flatten().tolist()
        
        if i % 100 == 0:
            print('epoch: ' + str(epoch + 1) + '/ iteration: ' + str(i + 1) + '/ loss_train: ' + str(
                loss_train.cpu().detach().numpy()))
    
    # 计算训练集上的ROC AUC
    roc_train = roc_auc_score(y_label_train, y_pred_train)
    
    print('epoch: {:04d}'.format(epoch + 1),
              'loss_train: {:.4f}'.format(loss_train.item()),
              'auroc_train: {:.4f}'.format(roc_train),
              'time: {:.4f}s'.format(time.time() - t))
  1. 测试过程
def test(model, loader, data_s, data_f, args):
    m = torch.nn.Sigmoid()
    loss_fct = torch.nn.BCEWithLogitsLoss()
    loss_node = torch.nn.BCELoss()
    
    # 设置标签
    lbl_1 = torch.ones(997 * 2)
    lbl_2 = torch.zeros(997 * 2)
    lbl = torch.cat((lbl_1, lbl_2)).cuda()
    
    inp_id0 = []
    inp_id1 = []
    
    model.eval()
    y_pred = []
    y_label = []
    
    with torch.no_grad():
        for i, (label, inp) in enumerate(loader):
            inp_id0.append(inp[0])
            inp_id1.append(inp[1])
            
            if args.cuda:
                label = label.cuda()
            
            # 前向传播
            output, log = model(data_s, data_f, inp)
            log = torch.squeeze(m(log))
            
            # 计算损失
            loss_class = loss_node(log, label.float())
            loss_constra = loss_fct(output, lbl)
            loss = loss_class + args.loss_ratio1 * loss_constra
            
            # 收集预测结果
            label_ids = label.to('cpu').numpy()
            y_label = y_label + label_ids.flatten().tolist()
            y_pred = y_pred + log.flatten().tolist()
            outputs = np.asarray([1 if i else 0 for i in (np.asarray(y_pred) >= 0.5)])
    
    # 计算评估指标
    return roc_auc_score(y_label, y_pred), average_precision_score(y_label, y_pred), f1_score(y_label, outputs), loss

5. 主程序流程(main.py)

主程序的流程非常简洁:

# 参数设置
args = settings()

# CUDA设置
args.cuda = not args.no_cuda and torch.cuda.is_available()
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.cuda:
    torch.cuda.manual_seed(args.seed)

# 加载数据
data_s, data_f, train_loader, test_loader = load_data(args, n_splits=5)

# 对每个fold进行训练和测试
for fold, (train_loader, test_loader) in enumerate(zip(train_loader, test_loader)):
    print(f"Training on fold {fold+1}")
    train_model(data_s, data_f, train_loader, test_loader, args)

6. 参数设置(parms_setting.py)

模型参数设置在parms_setting.py中定义,主要包括:

def settings():
    parser = argparse.ArgumentParser()
    
    # 公共参数
    parser.add_argument('--seed', type=int, default=0,
                        help='Random seed. Default is 0.')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='Disables CUDA training.')
    parser.add_argument('--workers', type=int, default=0,
                        help='Number of parallel workers. Default is 0.')
    
    # 数据路径参数
    parser.add_argument('--in_file', default="dataset1/LDA.edgelist",
                        help='Path to data fold. e.g., data/LDA.edgelist')
    parser.add_argument('--neg_sample', default="dataset1/no_LDA.edgelist",
                        help='Path to data fold. e.g., data/LDA.edgelist')
    parser.add_argument('--task_type', default="LDA", choices=['LDA', 'MDA','LMI'],
                        help='Initial prediction task type. Default is LDA.')
    
    # 训练参数
    parser.add_argument('--lr', type=float, default=5e-4,
                        help='Initial learning rate. Default is 5e-4.')
    parser.add_argument('--dropout', type=float, default=0.5,
                        help='Dropout rate. Default is 0.5.')
    parser.add_argument('--weight_decay', default=5e-4,
                        help='Weight decay (L2 loss on parameters) Default is 5e-4.')
    parser.add_argument('--batch', type=int, default=25,
                        help='Batch size. Default is 25.')
    parser.add_argument('--epochs', type=int, default=80,
                        help='Number of epochs to train. Default is 80.')
    parser.add_argument('--loss_ratio1', type=float, default=0.1,
                        help='Ratio of self_supervision. Default is 1 (LDA), 0.1 (MDA,LMI)')
    
    # 模型参数
    parser.add_argument('--dimensions', type=int, default=512,
                        help='dimensions of feature d. Default is 512 (LDA), 1024 (LDA and LMI)')
    parser.add_argument('--hidden1', default=256,
                        help='Embedding dimension of encoder layer 1 for SSCLMD. Default is d/2.')
    parser.add_argument('--hidden2', default=128,
                        help='Embedding dimension of encoder layer 2 for SSCLMD. Default is d/4.')
    parser.add_argument('--decoder1', default=512,
                        help='Embedding dimension of decoder layer 1 for SSCLMD. Default is 512.')
    
    args = parser.parse_args()
    
    return args

7. 计算相似性(calculating_similarity.py)

该文件主要用于计算不同类型节点之间的相似性,构建拓扑图的内边关系。

8. 数据准备(data_preparation.py)

该文件用于计算lncRNA/miRNA的k-mer特征并构建基于属性的KNN图。

9. 工具函数(utils.py)

utils.py包含一些辅助函数,如拉普拉斯归一化、行归一化等。

10. 项目复现步骤细节

  1. 环境准备

    • 安装Python 3.7+
    • 安装必要的依赖:numpy, torch, sklearn, torch-geometric
  2. 数据准备

    • 解压data/dataset1.rardata/dataset2.rar
  3. 特征预处理

    • 运行data_preparation.py生成k-mer特征和属性图
    • 运行calculating_similarity.py计算相似性和拓扑图内边
  4. 模型训练与测试

    • 运行main.py启动训练和测试过程
    • 根据需要修改parms_setting.py中的参数
  5. 结果评估

    • 查看输出的AUROC、AUPRC和F1分数
    • 可以保存模型以便后续使用

11. 代码优化建议

  1. 代码模块化:将数据加载、模型定义、训练和测试过程更好地模块化
  2. 参数管理:使用配置文件而不是硬编码的参数值
  3. 日志记录:添加更详细的日志记录,方便调试和分析
  4. 可视化:添加训练过程的可视化,如损失曲线和性能指标变化
  5. 数据并行:对于大规模数据集,添加数据并行处理能力
  6. 模型保存:添加定期保存模型检查点的功能
  7. 早停策略:实现早停策略,避免过拟合
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值