SSCLMD模型代码实现详解
1. 项目源码结构
SSCLMD项目的源码结构如下:
SSCLMD-main/
├── README.md
├── ST4.xlsx
├── Supplementary File.docx
├── code/
│ ├── calculating_similarity.py
│ ├── data_preparation.py
│ ├── data_preprocess.py
│ ├── layer.py
│ ├── main.py
│ ├── parms_setting.py
│ ├── train.py
│ └── utils.py
└── data/
├── dataset1.rar
└── dataset2.rar
2. 模型核心组件详解
2.1 模型定义(layer.py)
模型在layer.py
文件中定义,主要包含以下几个关键类:
- Attention类:
class Attention(nn.Module):
def __init__(self, in_size, hidden_size=128): # LDA:128 MDA,LMI:16
super(Attention, self).__init__()
self.project = nn.Sequential(
nn.Linear(in_size, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, 1, bias=False)
)
def forward(self, z):
w = self.project(z)
beta = torch.softmax(w, dim=1)
return (beta * z).sum(1), beta
这是一个注意力机制的实现,通过计算不同视图的权重,实现对不同视图特征的加权聚合。
- GCN类:
class GCN(nn.Module):
def __init__(self, nfeat, nhid, out, dropout = 0.5):
super(GCN, self).__init__()
self.gc1 = GCNConv(nfeat, nhid)
self.prelu1 = nn.PReLU(nhid)
self.gc2 = GCNConv(nhid, out)
self.prelu2 = nn.PReLU(out)
self.dropout = dropout
def forward(self, x, adj):
x = self.prelu1(self.gc1(x, adj))
x = F.dropout(x, self.dropout, training=self.training)
x = self.prelu2(self.gc2(x, adj))
return x
这是图卷积网络的实现,用于从图结构中提取节点特征。
- Discriminator类:
class Discriminator(nn.Module):
def __init__(self, dim):
super(Discriminator, self).__init__()
self.fn = nn.Bilinear(dim, dim, 1)
def forward(self, h1, h2, h3, h4, c1, c2):
c_x1 = c1.expand_as(h1).contiguous()
c_x2 = c2.expand_as(h2).contiguous()
# positive
sc_1 = self.fn(h1, c_x1).squeeze(1)
sc_2 = self.fn(h2, c_x2).squeeze(1)
# negative
sc_3 = self.fn(h3, c_x1).squeeze(1)
sc_4 = self.fn(h4, c_x2).squeeze(1)
logits = th.cat((sc_1, sc_2, sc_3, sc_4))
return logits
这是自监督对比学习的判别器,用于区分正样本和负样本。
- SSCLMD类:
class SSCLMD(nn.Module):
def __init__(self, in_dim, hid_dim, out_dim, decoder1):
super(SSCLMD, self).__init__()
self.encoder1 = GCN(in_dim, hid_dim, out_dim)
self.encoder2 = GCN(in_dim, hid_dim, out_dim)
self.encoder3 = GCN(in_dim, hid_dim, out_dim)
self.encoder4 = GCN(in_dim, hid_dim, out_dim)
self.pooling = AvgReadout()
self.attention = Attention(out_dim)
self.disc = Discriminator(out_dim)
self.act_fn = nn.Sigmoid()
self.local_mlp = nn.Linear(out_dim, out_dim)
self.global_mlp = nn.Linear(out_dim, out_dim)
self.decoder1 = nn.Linear(out_dim * 4, decoder1)
self.decoder2 = nn.Linear(decoder1, 1)
这是SSCLMD模型的主要类,整合了编码器、注意力机制、判别器和解码器。
2.2 网络前向传播过程
SSCLMD模型的前向传播过程如下:
def forward(self, data_s, data_f, idx):
# 获取特征和图结构
feat, s_graph = data_s.x, data_s.edge_index
shuff_feat, f_graph = data_f.x, data_f.edge_index
# 结构图和特征图编码
h1 = self.encoder1(feat, s_graph)
h2 = self.encoder2(feat, f_graph)
h1 = self.local_mlp(h1)
h2 = self.local_mlp(h2)
# 负样本编码
h3 = self.encoder1(shuff_feat, s_graph)
h4 = self.encoder2(shuff_feat, f_graph)
h3 = self.local_mlp(h3)
h4 = self.local_mlp(h4)
# 额外的编码用于关系预测
h5 = self.encoder3(feat, s_graph)
h6 = self.encoder3(feat, f_graph)
# 全局表示
c1 = self.act_fn(self.global_mlp(self.pooling(h1)))
c2 = self.act_fn(self.global_mlp(self.pooling(h2)))
# 自监督对比学习
out = self.disc(h1, h2, h3, h4, c1, c2)
# 多视图融合
h_com = (h5 + h6)/2
emb = torch.stack([h1, h2, h_com], dim=1)
emb, att = self.attention(emb)
# 根据任务类型选择实体
if args.task_type == 'LDA':
entity1 = emb[idx[0]]
entity2 = emb[idx[1] + 386]
if args.task_type == 'MDA':
entity1 = emb[idx[0] + 702]
entity2 = emb[idx[1] + 386]
if args.task_type == 'LMI':
entity1 = emb[idx[0]]
entity2 = emb[idx[1] + 702]
# 多关系建模解码器
add = entity1 + entity2
product = entity1 * entity2
concatenate = torch.cat((entity1, entity2), dim=1)
feature = torch.cat((add, product, concatenate), dim=1)
log1 = F.relu(self.decoder1(feature))
log = self.decoder2(log1)
return out, log
3. 数据预处理过程详解
数据预处理主要在data_preprocess.py
文件中实现,关键步骤包括:
- 数据加载与正负样本构建:
positive = np.loadtxt(args.in_file, dtype=np.int64)
link_size = int(positive.shape[0])
np.random.seed(args.seed)
np.random.shuffle(positive)
positive = positive[:link_size]
negative_all = np.loadtxt(args.neg_sample, dtype=np.int64)
np.random.shuffle(negative_all)
negative = np.asarray(negative_all[:positive.shape[0]])
positive = np.concatenate([positive, np.ones(positive.shape[0], dtype=np.int64).reshape(-1, 1)], axis=1)
negative = np.concatenate([negative, np.zeros(negative.shape[0], dtype=np.int64).reshape(-1, 1)], axis=1)
all_data = np.vstack((positive, negative))
- 构建K折交叉验证数据集:
kf = KFold(n_splits=n_splits, shuffle=True, random_state=args.seed)
cv_train_loaders = []
cv_test_loaders = []
for train_index, test_index in kf.split(all_data):
train_data = all_data[train_index]
test_data = all_data[test_index]
train_positive = train_data[train_data[:, 2] == 1][:, :2]
# 构建邻接矩阵...
# 构建数据加载器
training_set = Data_class(train_data)
train_loader = DataLoader(training_set, **params)
test_set = Data_class(test_data)
test_loader = DataLoader(test_set, **params)
cv_train_loaders.append(train_loader)
cv_test_loaders.append(test_loader)
- 构建图数据结构:
# 构建边索引
edges_s = s_adj.nonzero()
edge_index_s = torch.tensor(np.vstack((edges_s[0], edges_s[1])), dtype=torch.long)
edges_f = f_adj.nonzero()
edge_index_f = torch.tensor(np.vstack((edges_f[0], edges_f[1])), dtype=torch.long)
# 转换特征为张量
x = torch.tensor(node_feature, dtype=torch.float)
shuf_feature = torch.tensor(shuf_feature, dtype=torch.float)
# 创建PyG的Data对象
data_s = Data(x=x, edge_index=edge_index_s)
data_f = Data(x=shuf_feature, edge_index=edge_index_f)
4. 训练过程详解
训练过程在train.py
文件中实现,主要包括以下几个步骤:
- 模型初始化:
model = SSCLMD(in_dim = args.dimensions, hid_dim= args.hidden1, out_dim = args.hidden2, decoder1=args.decoder1)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
m = torch.nn.Sigmoid()
loss_fct = torch.nn.BCEWithLogitsLoss()
loss_node = torch.nn.BCELoss()
- 训练循环:
for epoch in range(args.epochs):
t = time.time()
print('-------- Epoch ' + str(epoch + 1) + ' --------')
y_pred_train = []
y_label_train = []
lbl_1 = torch.ones(997 * 2) # dataset1: 997, dataset2: 1071
lbl_2 = torch.zeros(997 * 2)
lbl = torch.cat((lbl_1, lbl_2)).cuda()
for i, (label, inp) in enumerate(train_loader):
if args.cuda:
label = label.cuda()
model.train()
optimizer.zero_grad()
# 前向传播
output, log = model(data_s, data_f, inp)
log = torch.squeeze(m(log))
# 计算损失
loss_class = loss_node(log, label.float())
loss_constra = loss_fct(output, lbl)
loss_train = loss_class + args.loss_ratio1 * loss_constra
# 反向传播
loss_train.backward()
optimizer.step()
# 收集预测结果
label_ids = label.to('cpu').numpy()
y_label_train = y_label_train + label_ids.flatten().tolist()
y_pred_train = y_pred_train + log.flatten().tolist()
if i % 100 == 0:
print('epoch: ' + str(epoch + 1) + '/ iteration: ' + str(i + 1) + '/ loss_train: ' + str(
loss_train.cpu().detach().numpy()))
# 计算训练集上的ROC AUC
roc_train = roc_auc_score(y_label_train, y_pred_train)
print('epoch: {:04d}'.format(epoch + 1),
'loss_train: {:.4f}'.format(loss_train.item()),
'auroc_train: {:.4f}'.format(roc_train),
'time: {:.4f}s'.format(time.time() - t))
- 测试过程:
def test(model, loader, data_s, data_f, args):
m = torch.nn.Sigmoid()
loss_fct = torch.nn.BCEWithLogitsLoss()
loss_node = torch.nn.BCELoss()
# 设置标签
lbl_1 = torch.ones(997 * 2)
lbl_2 = torch.zeros(997 * 2)
lbl = torch.cat((lbl_1, lbl_2)).cuda()
inp_id0 = []
inp_id1 = []
model.eval()
y_pred = []
y_label = []
with torch.no_grad():
for i, (label, inp) in enumerate(loader):
inp_id0.append(inp[0])
inp_id1.append(inp[1])
if args.cuda:
label = label.cuda()
# 前向传播
output, log = model(data_s, data_f, inp)
log = torch.squeeze(m(log))
# 计算损失
loss_class = loss_node(log, label.float())
loss_constra = loss_fct(output, lbl)
loss = loss_class + args.loss_ratio1 * loss_constra
# 收集预测结果
label_ids = label.to('cpu').numpy()
y_label = y_label + label_ids.flatten().tolist()
y_pred = y_pred + log.flatten().tolist()
outputs = np.asarray([1 if i else 0 for i in (np.asarray(y_pred) >= 0.5)])
# 计算评估指标
return roc_auc_score(y_label, y_pred), average_precision_score(y_label, y_pred), f1_score(y_label, outputs), loss
5. 主程序流程(main.py)
主程序的流程非常简洁:
# 参数设置
args = settings()
# CUDA设置
args.cuda = not args.no_cuda and torch.cuda.is_available()
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.cuda:
torch.cuda.manual_seed(args.seed)
# 加载数据
data_s, data_f, train_loader, test_loader = load_data(args, n_splits=5)
# 对每个fold进行训练和测试
for fold, (train_loader, test_loader) in enumerate(zip(train_loader, test_loader)):
print(f"Training on fold {fold+1}")
train_model(data_s, data_f, train_loader, test_loader, args)
6. 参数设置(parms_setting.py)
模型参数设置在parms_setting.py
中定义,主要包括:
def settings():
parser = argparse.ArgumentParser()
# 公共参数
parser.add_argument('--seed', type=int, default=0,
help='Random seed. Default is 0.')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='Disables CUDA training.')
parser.add_argument('--workers', type=int, default=0,
help='Number of parallel workers. Default is 0.')
# 数据路径参数
parser.add_argument('--in_file', default="dataset1/LDA.edgelist",
help='Path to data fold. e.g., data/LDA.edgelist')
parser.add_argument('--neg_sample', default="dataset1/no_LDA.edgelist",
help='Path to data fold. e.g., data/LDA.edgelist')
parser.add_argument('--task_type', default="LDA", choices=['LDA', 'MDA','LMI'],
help='Initial prediction task type. Default is LDA.')
# 训练参数
parser.add_argument('--lr', type=float, default=5e-4,
help='Initial learning rate. Default is 5e-4.')
parser.add_argument('--dropout', type=float, default=0.5,
help='Dropout rate. Default is 0.5.')
parser.add_argument('--weight_decay', default=5e-4,
help='Weight decay (L2 loss on parameters) Default is 5e-4.')
parser.add_argument('--batch', type=int, default=25,
help='Batch size. Default is 25.')
parser.add_argument('--epochs', type=int, default=80,
help='Number of epochs to train. Default is 80.')
parser.add_argument('--loss_ratio1', type=float, default=0.1,
help='Ratio of self_supervision. Default is 1 (LDA), 0.1 (MDA,LMI)')
# 模型参数
parser.add_argument('--dimensions', type=int, default=512,
help='dimensions of feature d. Default is 512 (LDA), 1024 (LDA and LMI)')
parser.add_argument('--hidden1', default=256,
help='Embedding dimension of encoder layer 1 for SSCLMD. Default is d/2.')
parser.add_argument('--hidden2', default=128,
help='Embedding dimension of encoder layer 2 for SSCLMD. Default is d/4.')
parser.add_argument('--decoder1', default=512,
help='Embedding dimension of decoder layer 1 for SSCLMD. Default is 512.')
args = parser.parse_args()
return args
7. 计算相似性(calculating_similarity.py)
该文件主要用于计算不同类型节点之间的相似性,构建拓扑图的内边关系。
8. 数据准备(data_preparation.py)
该文件用于计算lncRNA/miRNA的k-mer特征并构建基于属性的KNN图。
9. 工具函数(utils.py)
utils.py
包含一些辅助函数,如拉普拉斯归一化、行归一化等。
10. 项目复现步骤细节
-
环境准备:
- 安装Python 3.7+
- 安装必要的依赖:numpy, torch, sklearn, torch-geometric
-
数据准备:
- 解压
data/dataset1.rar
和data/dataset2.rar
- 解压
-
特征预处理:
- 运行
data_preparation.py
生成k-mer特征和属性图 - 运行
calculating_similarity.py
计算相似性和拓扑图内边
- 运行
-
模型训练与测试:
- 运行
main.py
启动训练和测试过程 - 根据需要修改
parms_setting.py
中的参数
- 运行
-
结果评估:
- 查看输出的AUROC、AUPRC和F1分数
- 可以保存模型以便后续使用
11. 代码优化建议
- 代码模块化:将数据加载、模型定义、训练和测试过程更好地模块化
- 参数管理:使用配置文件而不是硬编码的参数值
- 日志记录:添加更详细的日志记录,方便调试和分析
- 可视化:添加训练过程的可视化,如损失曲线和性能指标变化
- 数据并行:对于大规模数据集,添加数据并行处理能力
- 模型保存:添加定期保存模型检查点的功能
- 早停策略:实现早停策略,避免过拟合