【Course-AI-EnzymeDesign】Excercise1

原创已于 2025-11-07 17:25:30 修改 · 770 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

于 2025-06-21 18:20:05 首次发布

Excercise1 酶和底物特征提取

0 环境与python库的安装

请在终端命令行中激活您创建的Conda环境,例如pt12，然后逐行执行以下命令（第三条命令将默认pip源更改为清华镜像，推荐使用清华镜像）：

conda create -n pt12 python=3.12
conda activate pt12
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install numpy
pip install rdkit
pip install ipykernel
pip install fair-esm
pip install matplotlib
pip install scikit-learn
pip install pandas
pip install tqdm
pip install biopython
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 -f https://mirrors.aliyun.com/pytorch-wheels/cu118
pip install torch-geometric

酶和底物特征提取

1. 实验准备和数据收集

1.1 了解酶和底物的特征提取方法

• 学习使用Python和相关的生物信息学工具进行特征提取
• 掌握数据预处理和清洗技术

1.2 所需的工具和库

• Python：编程语言
• Biopython：生物信息学库
• RDKit：化学信息学库
• Pandas：数据操作库
• NumPy：数值计算库
• Scikit-learn：机器学习库
• MMSeqs2：蛋白质相似性计算工具

1.3 数据来源

• 蛋白质序列：UniProt
• 化合物：PubChem、ChEMBL

2.酶和底物的数据源介绍

2.1 酶的数据来源

• UniProt：一个全面的数据库，提供详细的蛋白质序列，功能注释，以及有关蛋白质相互作用和途径的信息。它是研究==酶的功能和特性==的重要资源。
• PDB（蛋白质数据库）：提供有关蛋白质、核酸和复杂生物分子组装的三维结构信息的存储库。PDB对于在分子水平上理解酶的机制和相互作用至关重要。

问题1：酿酒酵母YKL060C基因的UniProt Entry/ID是什么？它的氨基酸长度是多少？请到UniProt网站搜索相关信息。
答案1：UniProt Entry (Entry Name)：P14540（ALF_YEAST）
氨基酸长度：359 个氨基酸

2.2 底物数据来源

• PubChem：一个免费的数据库，提供了广泛的化合物的化学结构、生物活性和安全数据的综合信息。对于想要探索底物特性的研究人员来说，这是一个宝贵的资源。
• ChEMBL：包含药物靶点、化合物生物活性和药理学数据的详细信息的数据库。ChEMBL对参与药物发现和开发的研究人员特别有用。问题2：化合物d -葡萄糖的SMILES结构是什么？请去PubChem搜索这些信息。

问题2：化合物D-glucose的SMILES结构是什么？请去PubChem搜索这些信息。
C([C@@H]1C@HO)O

3. 数据预处理和清洗

3.1 数据预处理

•删除重复项：确保数据集不包含重复的序列或化合物。
•标准化：统一数据格式，包括氨基酸序列和SMILES字符串。
•处理缺失值：根据需要填写或删除缺失数据。

3.2 数据清洗

•质量控制：检查数据的质量，如序列长度、复合结构的完整性等。
•格式转换：将数据转换为适合分析的格式，如CSV或FASTA。

问题3:SMILES的CC(=O)C(=O)[O-]和CC(=O)C([O-])=O所代表的化合物是一样的吗？为了确定这一点，您可以使用Python代码生成它们的结构并直观地比较它们。这个练习强调了不同的数据源可能包含不同的数据格式。因此，必须执行质量控制和数据统一步骤，以确保数据集之间的一致性。

# 从RDKit导入必要的库
from rdkit import Chem
from rdkit.Chem import Draw

# 定义两个表示化合物的smile字符串
# SMILES是一种用于表示化学分子结构的符号系统
smiles1 = "CC(=O)C(=O)[O-]"  # 这表示一个分子有两个羧基
smiles2 = "CC(=O)C([O-])=O"  # 相同分子的另一种表现形式，但略有不同的smile

# 使用RDKit的molfrommiles函数将SMILES字符串转换为分子对象
mol1 = Chem.MolFromSmiles(smiles1)  # 这将从第一个SMILES创建一个分子对象
mol2 = Chem.MolFromSmiles(smiles2)  # 这从第二个SMILES创建了一个分子对象

# 生成分子图像以可视化其结构
img1 = Draw.MolToImage(mol1)  # 创建第一个分子的图像
img2 = Draw.MolToImage(mol2)  # 创建第二个分子的图像

# 显示生成的分子图像
img1.show()  # 展示第一个分子的图像
img2.show()  # 展示第二个分子的图像

# 检查这两个分子是否相同
inchikey1 = Chem.MolToInchiKey(mol1)
inchikey2 = Chem.MolToInchiKey(mol2)

# 比较结果
if inchikey1 == inchikey2:
    print("The two compounds are the same.")
else:
    print("The two compounds are different.")

The two compounds are the same.

问题 4 SMILES C(C(=O)COP(=O)(O)O)O 和 OCC(=O)COP(O)(O)=O 所代表的化合物相同吗？请在此处填写答案，并在下方方框中输入代码。

4. 复合特征提取方法

from rdkit import Chem
from rdkit.Chem import Draw
# 分子的smile
smiles = 'CNC(CO[P@TB13](=O)(O)OCC(C)(C)C(O)C(=O)NCCC(=O)NCCSC(=O)CC(C)C)C(C)=O' 

# 将 SMILES 字符串解析为可用于计算的分子对象（Mol）
mol = Chem.MolFromSmiles(smiles)

# 可视化分子
Draw.MolToImage(mol)

4.1分子指纹（Molecular Fingerprints）：一个“结构存在性”编码，把一个大分子的局部结构是否存在，压缩成一个方便比较的向量。

定义：原始的化学结构（如 SMILES、PDB）不适合直接输入到模型或用于相似性比较。分子指纹是一种简洁、可比较、计算友好的表征方法。分子指纹是一种将分子结构转化为固定长度二进制矢量的方法。常用于：
分子相似性搜索（如 Tanimoto 相似度）
QSAR/QSPR 模型输入
化合物聚类与分类
药物筛选、虚拟筛选（Virtual Screening）
这个向量表示分子中特定的亚结构或特征的存在或不存在。
• 示例：
■摩根指纹（ECFP）：扩展连接指纹是一种圆形指纹，可以捕获分子中每个原子的局部环境。例如，可以使用RDKit生成半径为2、位向量长度为1024位的ECFP4指纹。
■MACCS密钥：MACCS（分子访问系统）密钥是另一种类型的分子指纹，它使用一组预定义的166个结构密钥。
常见类型

指纹类型	特点与说明
Morgan 指纹（ECFP）	也称为圆形指纹，RDKit 默认；基于原子邻域，适用于结构相似性计算
MACCS keys	166位布尔向量，表示常见化学子结构是否存在
Topological Fingerprint	基于原子路径（Path-based），较传统
AtomPair、RDKit 指纹	多种自定义路径长度或环境编码方法
Pharmacophore Fingerprint	关注药效团特征，如氢键供体/受体、电荷等

■RDKit示例代码：

# 从RDKit导入必要的库
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

# 从SMILES字符串中加载一个分子('CCO'是乙醇)
mol = Chem.MolFromSmiles('CCO')  # SMILES ‘CCO’对应于乙醇分子（C2H5OH）

# 生成Morgan指纹（也称为圆形指纹）
# “2”是指纹的半径，“1024”是指纹的位数
# 这将生成分子结构的位向量表示
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)

#将指纹转换为numpy数组，以便进一步处理或分析
fingerprint_array1 = np.array(fp1)

# 将指纹打印为numpy数组及其形状（指纹中的位数）
print('Morgan Fingerprint is ', fingerprint_array1, '\n Its shape is', fingerprint_array1.shape)  #Shape应该是（1024,），因为它是1024位向量

#生成MACCS密钥指纹（一组166个预定义的分子位特征）
fp2 = AllChem.GetMACCSKeysFingerprint(mol)

#将MACCS密钥指纹转换为numpy数组
fingerprint_array2 = np.array(fp2)

# 打印MACCS密钥指纹及其形状(应该是（166,），因为有166个特征)
print('MACCS keys fingerprint is ', fingerprint_array2, '\n Its shape is', fingerprint_array2.shape)

Morgan Fingerprint 维度是（1024,），MACCS维度是（166,）

问题5：你能用摩根指纹法计算两种化合物的相似度吗？您需要填写以下代码
•化合物1的SMILES CC(=O)C(=O)[O-]
•化合物2的SMILES C(C(=O)COP(=O)(O)O)O

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.DataStructs import FingerprintSimilarity

# 生成摩根指纹的函数
def get_morgan_fingerprint(smiles, radius=2, nBits=2048):
    """
    为由SMILES表示的给定分子生成摩根指纹（圆形指纹）.
    
    Parameters:
    - smiles (str): 表示分子的SMILES字符串。.
    - radius (int): 指纹的半径（默认为2） .
    - nBits (int): 指纹位向量的长度（默认为2048）.
    
    Returns:
    - 摩根的指纹作为向量.
    """
    # 将SMILES字符串转换为一个分子对象
    mol = Chem.MolFromSmiles(smiles)
    
    # 检查分子是否有效
    if mol is None:
        raise ValueError(f"Invalid SMILES: {smiles}")
    
    # 生成并返回摩根指纹作为位向量
    return AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits)

# 两种化合物的SMILES
'''#################### 你需要替换这部分 ############################### PAY ATTENTION HERE ################################################'''
smiles1 = ""  
smiles2 = ""
'''#################### 你需要替换这部分 ############################### PAY ATTENTION HERE ################################################'''

# 生成这两种化合物的摩根指纹
fp1 = get_morgan_fingerprint(smiles1)
fp2 = get_morgan_fingerprint(smiles2)

# 计算 两个指纹之间的Tanimoto 相似性
similarity = FingerprintSimilarity(fp1, fp2)

# 输出相似度分数及说明
print(f"Similarity between the two compounds ({smiles1} and {smiles2}): {similarity:.4f}")
### 这将输出两种化合物之间的谷本相似度，该数值基于它们的分子指纹计算，用于衡量它们的相似程度。

### 4.2分子的One-Hot编码
定义：One-Hot编码是一种将分子中的原子转换成定长二进制向量的方法。
向量上的每个位置对应于一个特定的原子类型，如果原子在分子中该位置存在，则该位置的值为1，否则为0。这种方法捕捉分子中原子的精确序列。
One-Hot Encoding：对于由SMILES字符串表示的给定分子，我们可以将分子中的每个原子转换为One-Hot编码向量。
例如，考虑分子“CCO”（乙醇）。这种分子的One-Hot编码可以使用一组预定义的原子来生成。■RDKit示例代码：

```python
import numpy as np
from rdkit import Chem
import matplotlib.pyplot as plt

#定义在独热编码过程中使用的原子列表
atoms = ['C', 'H', 'O', 'N', 'S', 'P', 'F', 'Cl', 'Br', 'I']

# 创建一个字典，将每个原子符号映射到索引
# 允许快速查找独热编码的原子索引
atom_to_index = {atom: i for i, atom in enumerate(atoms)}

def smiles_to_one_hot(smiles, atom_to_index, max_length=20):
    """
    将SMILES字符串转换为原子的独热编码矩阵表示形式.
    
    Parameters:
    - smiles (str): 分子的SMILES表示.
    - atom_to_index (dict): 将原子符号映射到索引的字典.
    - max_length (int): 在独热编码中要考虑的最大原子数.
    
    Returns:
    - one_hot_matrix (np.ndarray):形状矩阵(max_length, len(atom_to_index)), 
      每一行代表一个原子，它的存在被编码成一个独热向量.
    """
    
    # 每一行代表一个原子，它的存在被编码成一个独热向量
    mol = Chem.MolFromSmiles(smiles)
    
    # 检查SMILES字符串是否有效
    if mol is None:
        raise ValueError("Invalid SMILES string")
    
    # 从分子中提取原子符号
    atom_symbols = [atom.GetSymbol() for atom in mol.GetAtoms()]
    
    # 如果分子的原子数超过了最大值_长度，截断它
    if len(atom_symbols) > max_length:
        atom_symbols = atom_symbols[:max_length]
    
    # 初始化一个维度为 (max_length, number of unique atoms)
    one_hot_matrix = np.zeros((max_length, len(atom_to_index)), dtype=int)
    
    # 在基于原子符号的适当位置用1填充
    for i, symbol in enumerate(atom_symbols):
        if symbol in atom_to_index:
            # 将矩阵中对应的位置设置为1
            one_hot_matrix[i, atom_to_index[symbol]] = 1
        else:
            raise ValueError(f"Unknown atom: {symbol}")
    
    return one_hot_matrix

smiles = "C(C(C(CS)O)O)S"

# 得到分子的独热编码矩阵
one_hot_matrix = smiles_to_one_hot(smiles, atom_to_index)

# 打印独热矩阵的形状和内容
print("One-Hot matrix representation of smiles is:")
print(one_hot_matrix)
print('Its shape is ', one_hot_matrix.shape)
# 绘制独热矩阵
plt.figure(figsize=(10, 6))

# 将独热矩阵显示为图像（每行是一个原子，每列是一个原子类型）
plt.imshow(one_hot_matrix, cmap='binary', aspect='auto')

# 添加一个颜色条来显示二进制色阶（黑色为0，白色为1）
plt.colorbar()

# 设置绘图的标题和轴标签
plt.title('One-Hot Encoding of Molecule')
plt.xlabel('Atom Type')  # x轴表示原子类型（例如，C， N， O）
plt.ylabel('Position in Sequence')  # y轴表示原子在分子中的位置

# 设置x轴刻度以显示原子符号
plt.xticks(ticks=np.arange(len(atom_to_index)), labels=atoms)

# Display the plot
plt.show()

4.3 MPNN

定义：MPNN （Message Passing Neural Networks，消息传递神经网络）是一种专门为从图结构数据中学习而设计的神经网络架构，特别是在化学信息学和药物发现领域。这些网络通过在分子图中的节点（原子）之间迭代传递信息来运行，使它们能够捕获局部化学环境和关系。
MPNN优点：灵活性、可解释性和可移植性，这使得它们对于分子表示学习和预测建模任务很有价值。
图表示：在分子图中，原子被表示为节点，键被表示为边。每个节点都有一个特征向量，其中可以包括原子性质，如原子序数、价和杂化。
■消息传递：在消息传递阶段，每个节点从相邻节点聚集信息。这个过程重复了几次迭代，以允许信息通过图传播。
■PyTorch几何示例代码：

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import MessagePassing, global_mean_pool
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import matplotlib.pyplot as plt

# 布洛芬的示例SMILES（用于生成MPNN嵌入） 
ibuprofen_smiles = "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"

# 将SMILES转换为RDKit分子对象
ibuprofen_molecule = Chem.MolFromSmiles(ibuprofen_smiles)
AllChem.Compute2DCoords(ibuprofen_molecule)  # 生成2D坐标可视化
#定义函数提取原子特征，包括原子序数、度、形式电荷、杂化和芳香性
def atom_features(atom):
    return [
        atom.GetAtomicNum(),  # 原子序数
        atom.GetDegree(),     # 原子的键数
        atom.GetFormalCharge(),  # 原子的形式电荷  Formal charge on the atom
        int(atom.GetHybridization()),  # 杂化态（sp、sp2、sp3等）
        atom.GetIsAromatic()  # 原子是否是芳环的一部分
    ]
#将分子转换成图形（节点代表原子，边代表键）
def mol_to_graph(mol):
    # 创建一个原子特性列表
    atom_features_list = []
    for atom in mol.GetAtoms():
        atom_features_list.append(atom_features(atom))
    
    #创建边索引（原子间的键） 创建边索引(bonds between atoms)
    edge_index = []
    for bond in mol.GetBonds():
        i = bond.GetBeginAtomIdx()  # 得到键中第一个原子的索引
        j = bond.GetEndAtomIdx()    # 得到键中第二个原子的索引
        edge_index.append([i, j])   # 两个方向相加 (i -> j and j -> i)
        edge_index.append([j, i])
    
    # 将原子特征转换为张量
    x = torch.tensor(atom_features_list, dtype=torch.float)
    # 将边索引转换为张量并确保形状正确
    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
    
    return Data(x=x, edge_index=edge_index)

# 生成布洛芬分子图
graph = mol_to_graph(ibuprofen_molecule)

# 打印图中节点（原子）和边（键）的数目
# 节点表示原子，边表示原子间的键
print("Number of nodes (atoms):", graph.num_nodes)  #打印布洛芬中的原子数
print("Number of edges (bonds):", graph.num_edges)  # 打印布洛芬中的边数
#定义消息传递神经网络（MPNN）类 
class MPNN(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(MPNN, self).__init__(aggr='add')  #使用‘add’作为聚合方法（对消息求和） 
        #Use 'add' as the aggregation method (summing messages)
        self.lin = nn.Linear(in_channels, out_channels)  # 一个用于变换节点特征的线性层
        #A linear layer to transform node features

    def forward(self, x, edge_index):
        # 通过网络应用传播（消息传递）  
        return self.propagate(edge_index, x=self.lin(x))  #应用线性转换并传播消息  

    def message(self, x_j):
        # 在本例中，消息只是相邻原子的特征（没有转换） 
        return x_j

    def update(self, aggr_out):
        # 更新函数返回聚合的消息（即更新的节点特性） 
        return aggr_out

# 定义一个图级MPNN模型来处理图并产生最终的表示 
class GraphLevelMPNN(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(GraphLevelMPNN, self).__init__()
        self.conv1 = MPNN(in_channels, 64)  # First MPNN layer with 64 output channels
        self.conv2 = MPNN(64, 64)          # Second MPNN layer with 64 output channels
        self.lin = nn.Linear(64, out_channels)  # 最后的线性层输出所需的嵌入尺寸 

    def forward(self, data):
        #应用MPNN层和全局池化操作   Apply the MPNN layers and a global pooling operation
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))  # Apply first MPNN layer and ReLU activation
        x = F.relu(self.conv2(x, edge_index))  # Apply second MPNN layer and ReLU activation
        x = global_mean_pool(x, data.batch)   # 全局平均池化以获得图级表示  Global mean pooling to obtain a graph-level representation
        x = self.lin(x)                       #应用最后的线性层得到输出   Apply the final linear layer to get the output
        return x  

# 使用输入特征大小和输出大小（嵌入维度）实例化模型 
mpnn_model = GraphLevelMPNN(graph.x.shape[1], 320)  

# 创建一个批处理张量（对于批处理多个图是必需的，即使我们这里只有一个图）  
graph.batch = torch.zeros(graph.num_nodes, dtype=torch.long)

#得到布洛芬的嵌入表示   
mpnn_output = mpnn_model(graph)

# 打印布洛芬分子的MPNN嵌入表示  
print("The MPNN representation embedding for ibuprofen is:\n", mpnn_output)

问题 6 Can you generate the MPNN representation for CC(=O)C(=O)[O-]?

4.4 GCN

定义：GCN （Graph Convolutional Network）是一种专门为从图结构数据中学习而设计的神经网络架构。它扩展了卷积神经网络（cnn）的概念，使其能够在图上操作，从而捕获节点之间的关系和依赖关系。
GCNs通过多层神经网络操作，用特征向量初始化节点，在相邻节点之间传递消息，聚合信息，更新节点特征。
它们提供了几个优点，如参数共享、捕获本地连接和可扩展性，使它们适合于药物-目标相互作用（DTI）预测等任务。
图表示：在分子图中，原子被表示为节点，键被表示为边。每个节点都有一个特征向量，其中可以包括原子性质，如原子序数、价和杂化。
卷积层：GCNs使用卷积层来聚合来自相邻节点的信息。每一层基于从其邻居处聚合的信息更新节点特征。
■PyTorch几何示例代码：

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, global_mean_pool
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import matplotlib.pyplot as plt

# 定义布洛芬分子的SMILES字符串
ibuprofen_smiles = "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"

# Convert SMILES string to an RDKit molecule object
ibuprofen_molecule = Chem.MolFromSmiles(ibuprofen_smiles)

# Compute 2D coordinates for visualization (this step is not used in the model but can be helpful for visualizing the molecule)
AllChem.Compute2DCoords(ibuprofen_molecule)

# 功能：提取分子中每个原子的原子特征
def atom_features(atom):
    """
    提取分子中每个原子的特征。 特征包括原子序数、度数、形式电荷、杂化状态和芳香性。
    """
    return [
        atom.GetAtomicNum(),  # Atomic number
        atom.GetDegree(),     # Number of bonds the atom has
        atom.GetFormalCharge(),  # Formal charge on the atom
        int(atom.GetHybridization()),  # Hybridization state (sp, sp2, sp3, etc.)
        atom.GetIsAromatic()  # Whether the atom is part of an aromatic ring
    ]

# Function to convert an RDKit molecule into a graph representation
def mol_to_graph(mol):
    """
    将分子转化为图结构，其中：
    - 节点代表原子及其特征
    - 边代表原子间的化学键
    """
    atom_features_list = []
    
    # 为分子中的每个原子提取特征
    for atom in mol.GetAtoms():
        atom_features_list.append(atom_features(atom))
    
    edge_index = []
    
    # 基于原子间的键合创建边
    for bond in mol.GetBonds():
        i = bond.GetBeginAtomIdx()  # 键中第一个原子的索引
        j = bond.GetEndAtomIdx()    # 键中第二个原子的索引
        edge_index.append([i, j])   # 添加有向边 (i -> j)
        edge_index.append([j, i])   # 添加反向边（j -> i）使其无向
    # 将原子特征和边索引转换为PyTorch张量
    x = torch.tensor(atom_features_list, dtype=torch.float)  # 节点特征张量
    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()  # 边缘索引张量（必须为COO格式）
    
    # Return a Data object representing the graph
    return Data(x=x, edge_index=edge_index)

# Convert the ibuprofen molecule to a graph
graph = mol_to_graph(ibuprofen_molecule)

# Print the number of nodes (atoms) and edges (bonds) in the graph
# Nodes represent atoms, and edges represent bonds between atoms
print("Number of nodes (atoms):", graph.num_nodes)  # Number of atoms in ibuprofen molecule
print("Number of edges (bonds):", graph.num_edges)  # Number of bonds in ibuprofen molecule


# Define a Graph Convolutional Network (GCN) model
class GraphLevelGCN(nn.Module):
    def __init__(self, in_channels, out_channels):
        """
        Initialize the GCN model. 
        - The model consists of two GCN layers followed by a linear layer to produce the final output representation.
        """
        super(GraphLevelGCN, self).__init__()
        # First GCN convolutional layer: maps input feature size to 64
        self.conv1 = GCNConv(in_channels, 64)
        # Second GCN convolutional layer: maintains 64 features
        self.conv2 = GCNConv(64, 64)
        # Linear layer to map from 64 features to the output size (320 in this case)
        self.lin = nn.Linear(64, out_channels)

    def forward(self, data):
        """
        Forward pass through the GCN model.
        - Propagate features through the GCN layers and apply global pooling.
        """
        x, edge_index = data.x, data.edge_index
        # Apply the first GCN layer and ReLU activation
        x = F.relu(self.conv1(x, edge_index))
        # Apply the second GCN layer and ReLU activation
        x = F.relu(self.conv2(x, edge_index))
        # Apply global mean pooling to aggregate node features into a single graph-level representation
        x = global_mean_pool(x, data.batch)
        # Final linear transformation to produce the output
        x = self.lin(x)
        return x 

# Initialize the GCN model with input channels matching the atom feature size (5)
gcn_model = GraphLevelGCN(graph.x.shape[1], 320)  # Output dimension is 320

# Create a batch attribute for the graph (necessary for pooling)
# Since we're using a single graph, all nodes belong to the same batch (batch=0 for all nodes)
graph.batch = torch.zeros(graph.num_nodes, dtype=torch.long)

# Get the output representation of the graph from the GCN model
gcn_output = gcn_model(graph)

# Print the resulting graph-level representation from the GCN
print("The GCN representation of ibuprofen:", gcn_output)
# This prints the learned representation (embedding) of the ibuprofen molecule.

5 序列特征提取

5.1 One-hot Encoding

独热编码是一种将蛋白质序列中的每个氨基酸表示为二进制向量的方法。在这种表示中，每个氨基酸被映射到向量中的唯一位置，并且只有该位置被设置为1，而所有其他位置都被设置为0。
考虑一个简化的例子，我们有一个氨基酸序列“ACDE”，可能的氨基酸是“A”，“C”，“D”和“E”。

The one-hot encoding for this sequence would be:

Amino Acid A C D E
A 1 0 0 0
C 0 1 0 0
D 0 0 1 0
E 0 0 0 1

Amino Acid	A	C	D	E
A	1	0	0	0
C	0	1	0	0
D	0	0	1	0
E	0	0	0	1

So the representation of A is: [1,0,0,0]

import numpy as np

#执行氨基酸序列的独热编码  Function to perform one-hot encoding of the amino acid sequence
def one_hot_encode(seq):
    """
    This function takes an amino acid sequence as input and returns a one-hot encoded numpy array.
    Each amino acid in the sequence is mapped to a 20-dimensional vector (one-hot encoded).
    """
    #定义一个字典，将每个氨基酸映射到一个唯一的整数  Define a dictionary that maps each amino acid to a unique integer
    aa_to_int = {'A': 0, 'C': 1, 'D': 2, 'E': 3, 'F': 4,
                 'G': 5, 'H': 6, 'I': 7, 'K': 8, 'L': 9,
                 'M': 10, 'N': 11, 'P': 12, 'Q': 13, 'R': 14,
                 'S': 15, 'T': 16, 'V': 17, 'W': 18, 'Y': 19}
    
    # Initialize an array of zeros with shape (len(seq), 20), where 20 represents the amino acid alphabet size  用shape （len(seq), 20）初始化一个零数组，其中20表示氨基酸字母表的大小
    one_hot_array = np.zeros((len(seq), 20))
    
    # 迭代氨基酸序列并更新one-hot数组  Iterate over the amino acid sequence and update the one-hot array
    for i, amino_acid in enumerate(seq):  # Enumerate gives both index and amino acid
        if amino_acid in aa_to_int:  # 确保氨基酸有效
            # 根据氨基酸索引设置对应的列为1
            one_hot_array[i, aa_to_int[amino_acid]] = 1
        else:
            raise ValueError("Unknown amino acid encountered: " + amino_acid)
    
    return one_hot_array

# 示例氨基酸序列（蛋白质序列）Example amino acid sequence (protein sequence)
seq = "ACDEFGHIKLMNPQRSTVWYWVTSRQPNMLKIHGFEDCA"
# 或者，您可以使用更长的序列（示例见注释）
# seq = "MLDDRARMEAAKKEKVEQILAEFQLQEEDLKKVMRRMQKEMDRGLRLETHEEASVKMLPTYVRSTPEGSEVGDFLSLDLGGTNFRVMLVKVGEGEEGQWSVKTKHQMYSIPEDAMTGTAEMLFDYISECISDFLDKHQMKHKKLPLGFTFSFPVRHEDIDKGILLNWTKGFKASGAEGNNVVGLLRDAIKRRGDFEMDVVAMVNDTVATMISCYYEDHQCEVGMIVGTGCNACYMEEMQNVELVEGDEGRMCVNTEWGAFGDSGELDEFLLEYDRLVDESSANPGQQLYEKLIGGKYMGELVRLVLLRLVDENLLFHGEASEQLRTRGAFETRFVSQVESDTGDRKQIYNILSTLGLRPSTTDCDIVRRACESVSTRAAHMCSAGLAGVINRMRESRSEDVMRITVGVDGSVYKLHPSFKERFHASVRRLTPSCEITFIESEEGSGRGAALVSAVACKKACMLGQ"

# 对定义的序列执行独热编码
encoded_seq = one_hot_encode(seq)

#输出独热编码数组结果
print("One-Hot Encoded Array of Sequence Is:")
print(encoded_seq)

# Note:
# The one-hot encoded representation is a matrix where:
# - Each row corresponds to an amino acid in the sequence.
# - Each column corresponds to one of the 20 standard amino acids.
# A '1' in the matrix indicates the presence of that amino acid at the given position in the sequence.
# '0's indicate the absence of the other amino acids at that position.
#
# This encoding is useful for machine learning tasks, such as:
# - Protein classification
# - Sequence analysis
# - Feature extraction for protein structure prediction

5.2 K-mer （N-gram）编码

定义：k -mer编码，也称为N-gram编码，涉及将序列分解为长度为k的子串，并计算每个子串的频率。这种方法在生物信息学中通常用于捕获局部序列模式。
■考虑一个简化的氨基酸序列“ACDEFG”，让我们用k = 3。
■该序列将被分解为以下3组：“ACD”、“CDE”、“DEF”、“EFG”。
■然后计算每个3-mer的频率。
■Python示例代码：

import numpy as np
from collections import defaultdict
import pandas as pd

def extract_ngrams(seq, n=3):
    """
    Extract n-grams from a protein sequence.
    This function takes a sequence and a specified n (default 3) and returns a list of n-grams from the sequence.
    E.g., for n=3 and seq="MKQAWFI", it will return ['MKQ', 'KQA', 'QAW', 'AWF', 'WFI'].
    """
    ngrams = [seq[i:i+n] for i in range(len(seq)-n+1)]  # List comprehension to extract all n-grams of length n
    return ngrams

def build_ngram_index(ngrams_list):
    """
    Build an index for n-grams to map each unique n-gram to a unique number.
    This function uses a defaultdict to create an index that maps each n-gram to an integer.
    The index starts from 1 for the first unique n-gram.
    """
    index = defaultdict(int)  # Default to 0 for unseen n-grams
    for ngrams in ngrams_list:
        for ngram in ngrams:
            if ngram not in index:
                index[ngram] = len(index) + 1  # Assign a unique index starting from 1
    return index

def vectorize_sequences(ngrams_list, ngram_index):
    """
    Convert lists of n-grams into a fixed-size vector based on the n-gram index.
    This function converts the list of n-grams into a binary vector of length equal to the number of unique n-grams.
    For each sequence, it marks the occurrence of each n-gram.
    """
    vector_length = len(ngram_index)  # The size of the vector is equal to the number of unique n-grams
    vectors = np.zeros((len(ngrams_list), vector_length))  # Initialize a zero matrix for the vectors
    
    for i, ngrams in enumerate(ngrams_list):
        for ngram in ngrams:
            if ngram in ngram_index:
                vectors[i, ngram_index[ngram] - 1] += 1  # Increment the position based on the n-gram index
    return vectors
# Example protein sequences
sequences = ["MKQAWFI", "MRRKQLEDKVEELLSKNYHLENEVARLKKLVMKQAWFIWFIWFI"]

# Extract 3-grams from each sequence
n = 3
ngrams_list = [extract_ngrams(seq, n) for seq in sequences]
# print("ngrams_list:", ngrams_list)

# Build the index for the extracted n-grams
ngram_index = build_ngram_index(ngrams_list)
# print("ngram_index:", ngram_index)

# Convert sequences to fixed-size vectors based on the n-gram index
vectors = vectorize_sequences(ngrams_list, ngram_index)
# print("vectors:", vectors)

# use pandas to build DataFrame

sorted_ngrams = [ng for ng, idx in sorted(ngram_index.items(), key=lambda item: item[1])]

df = pd.DataFrame(data=vectors, columns=sorted_ngrams, index=[f"Sequence {i+1}" for i in range(len(vectors))])

print('The K-mer (N-gram) Encoding of sequences is:\n', vectors)
print('If using ngram_index as the table header, and vectors based on those indexes as the table contents, you can see this:')
df

5.3 Protein Language Models

定义：蛋白质语言模型，如ESM2（进化尺度模型2），是预先训练的深度学习模型，可以将蛋白质序列编码为数字表示。这些模型捕捉序列中氨基酸的上下文信息和结构特性。
•示例：
■如果下载缓慢，您可以从https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t6_8M_UR50D.pt 下载模型放到 /data/home/{yourusername}/.cache/torch/hub/checkpoints/目录下

import torch
import esm

def esm_method(sequence):
    """
    This function takes a protein sequence as input and returns the token embeddings from an ESM model.
    It uses the pre-trained ESM2 model to encode the sequence and retrieve the embeddings from the last layer.
    """
    # Set the device to GPU if available, else use CPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load the pre-trained ESM2 model and its corresponding alphabet (vocabulary)
    model, alphabet = esm.pretrained.esm2_t6_8M_UR50D()
    model = model.to(device)  # Move the model to the appropriate device (GPU/CPU)
    model.eval()  # Set the model to evaluation mode (disables dropout, etc.)

    # Encode the input protein sequence into token IDs using the alphabet
    token_ids = torch.tensor([alphabet.encode(sequence)], device=device)

    # Get the embeddings by passing the token IDs through the model
    with torch.no_grad():  # Disable gradient computation to save memory
        results = model(token_ids, repr_layers=[6])  # Extract the representation from layer 6 (last layer)
        token_embeddings = results["representations"][6]  # Retrieve embeddings from the specified layer

    # Optionally, you can average the token embeddings over all tokens to get a single sequence embedding
    # sequence_embedding = token_embeddings.mean(dim=1)
    
    return token_embeddings  # Return the token embeddings for further processing

# Example protein sequence
sequence = "MKWVTFISLLFLFSSAYSAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"

# Generate the token embeddings for the sequence
embeddings = esm_method(sequence)

# Print the shape and values of the generated embeddings
print("Encoded representations shape:", embeddings.shape)
print("Encoded representations:", embeddings)

解释：
■Tokenization：使用ESM2标记器对序列进行标记化，该标记器将氨基酸序列转换为模型可以处理的格式。
■模型输出：模型处理标记序列并产生嵌入。last_hidden_state包含序列中每个氨基酸的上下文嵌入。■嵌入形状：[1, sequence_length, embedding_dim]。例如，序列长度为60，嵌入维数为320，则形状为[1,60,320]。

6. 集群实践

6.1蛋白质相似度计算

概述：蛋白质相似性计算是生物信息学中的一项常见任务，用于比较蛋白质之间的相似性。
具体的计算方法可以根据所使用的视角和工具而有所不同。
例如，它可以基于序列、结构或功能域。在这里，我们将关注基于序列的相似性
■考虑两个蛋白质序列：“ACDEFG”和“ACDEGH”

from Bio import pairwise2

def calculate_identity(seq1, seq2):
    # Perform a global pairwise sequence alignment.
    # The function globalxx uses a simple scoring system where matches score +1 and mismatches score 0.
    alignments = pairwise2.align.globalxx(seq1, seq2)

    # Extract the best alignment from the list of possible alignments.
    # alignments[0] contains the best alignment (highest score).
    best_alignment = alignments[0]

    # Extract the number of matches (positions where the two sequences have identical characters).
    matches = best_alignment[2]  # Matches are stored at index 2 of the alignment tuple.

    # Extract the total number of aligned elements (including gaps).
    total = best_alignment[4]  # The total number of aligned positions is stored at index 4.

    # Calculate the identity percentage: the proportion of matches to the total number of elements.
    identity_percent = (matches / total) * 100

    # Return the calculated identity percentage.
    return identity_percent

# Example sequences to compare
seq1 = 'MTEITAAMVKELRESTGAGMMDCKNALSETQHEWFAAKRQGKLSPWITGRKTGQDEHILLMNDGWQ'
seq2 = 'MTEITAAMVKELRESTGAGMMDCKNALSETQHEWFAALLMNDGWQ'

# Calculate the identity percentage between the two sequences
identity = calculate_identity(seq1, seq2)

# Print the identity percentage
identity

6.2 分子相似性计算

示例：
■考虑两种化合物，由它们的SMILES字符串表示：“CCO”（乙醇）和“CCOC”（乙基甲基醚）。
■我们将使用RDKit库使用不同类型的分子指纹来计算这些化合物之间的分子相似性。

# Importing necessary modules from RDKit
from rdkit import Chem  # RDKit's core functionality for handling molecules
from rdkit.Chem import AllChem  # RDKit's extended functionality for molecular computations
from rdkit.DataStructs import TanimotoSimilarity  # Function to calculate Tanimoto similarity between two bit vectors

# Step 1: Convert SMILES strings into RDKit molecule objects
mol1 = Chem.MolFromSmiles("CCO")  # Convert SMILES string for ethanol (C2H5OH) into a molecule object
mol2 = Chem.MolFromSmiles("CCCO")  # Convert SMILES string for propanol (C3H7OH) into a molecule object

# Step 2: Generate Morgan Fingerprints for each molecule
# Morgan fingerprint is a type of circular fingerprint (substructural representation of a molecule)
# '2' is the radius of the fingerprint (i.e., the number of bonds considered around each atom)
# '1024' is the length of the bit vector representing the fingerprint (higher bits = more detailed)
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, 2, nBits=1024)  # Generate fingerprint for mol1 (ethanol)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, 2, nBits=1024)  # Generate fingerprint for mol2 (propanol)

# Step 3: Calculate the Tanimoto similarity between the two fingerprints
# The Tanimoto similarity measures how similar two sets (or bit vectors) are.
# A similarity score of 1 means the fingerprints are identical, and 0 means no similarity.
similarity = TanimotoSimilarity(fp1, fp2)

# Step 4: Output the similarity score
print(f"The tanimoto similarity between ethanol and propanol is: {similarity}")  # Print the Tanimoto similarity between ethanol and propanol

t-SNE （t-distributed Stochastic Neighbor Embedding）

t-SNE是一种用于降维的机器学习算法，特别适合于高维数据集的可视化。与侧重于保留全局结构的PCA（主成分分析）不同，t-SNE旨在保留局部结构，这使得它对复杂数据中的集群和模式的可视化特别有用。

# Import necessary libraries
import pickle  # For loading the precomputed embeddings from a file
import numpy as np  # For numerical operations
import torch  # PyTorch library for tensor operations
from Bio import SeqIO  # Biopython for handling biological sequences in FASTA format
from sklearn.manifold import TSNE  # For dimensionality reduction using t-SNE
import matplotlib.pyplot as plt  # For plotting the 2D visualization

import esm  # For using the pre-trained ESM model (Evolutionary Scale Modeling)

# Function to obtain sequence embeddings using ESM model
def esm_method(sequence):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Check if GPU is available for faster processing

    # Load the pre-trained ESM model and its alphabet
    model, alphabet = esm.pretrained.esm2_t6_8M_UR50D()
    model = model.to(device)  # Move model to GPU if available
    model.eval()  # Set the model to evaluation mode

    # Encode the sequence into token IDs using the model's alphabet
    token_ids = torch.tensor([alphabet.encode(sequence)], device=device)
    
    # Pass the token IDs through the model to get the embeddings (representation of the sequence)
    with torch.no_grad():  # Disable gradient computation for inference
        results = model(token_ids, repr_layers=[6])  # Get the output embeddings from final layer
        token_embeddings = results["representations"][6]  # Extract the embeddings for this layer

    # Calculate the mean of the token embeddings to get a fixed-size representation for the entire sequence
    sequence_embedding = token_embeddings.mean(dim=1)
    return sequence_embedding

# Function to read sequences from a FASTA file
def read_fasta(file_path):
    sequences = []  # List to store sequences from the FASTA file
    for record in SeqIO.parse(file_path, "fasta"):  # Parse the FASTA file
        sequences.append(str(record.seq))  # Extract the sequence as a string
        if len(sequences) >= 40:  # Limit to first 40 sequences for demonstration
            break
    return sequences

# Example usage: Reading sequences from a FASTA file (this part is commented out as the file path is not provided)
# fasta_file = 'pr_dataset.fasta'
# sequences = read_fasta(fasta_file)

# embeddings = []  # List to store the sequence embeddings
# for seq in sequences:  # For each sequence in the FASTA file
#     embedding = esm_method(seq)  # Get the embedding for the sequence using ESM model
#     embeddings.append(embedding.cpu().detach().numpy())  # Convert tensor to NumPy array and store

# embeddings = np.vstack(embeddings)  # Stack all embeddings into a single matrix

# Alternatively, load precomputed embeddings from a pickle file (for demonstration)
embeddings = pickle.load(open('embedding.pkl', 'rb'))

# Apply t-SNE (t-distributed Stochastic Neighbor Embedding) to reduce the dimensionality of the embeddings
# from high-dimensional space to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)  # Set the number of components (2D) and random seed for reproducibility
embeddings_2d = tsne.fit_transform(embeddings)  # Perform t-SNE dimensionality reduction

# Visualize the 2D representation of the embeddings using a scatter plot
plt.figure(figsize=(10, 8))  # Set figure size for better readability
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=5)  # Plot the 2D points with small markers
plt.title('t-SNE visualization of embeddings from FASTA sequences')  # Set title for the plot
plt.xlabel('t-SNE Component 1')  # Label for the X-axis
plt.ylabel('t-SNE Component 2')  # Label for the Y-axis
plt.grid()  # Display grid on the plot
plt.show()  # Show the plot