从CATH网站下载数据(CATH)
wget http://download.cathdb.info/cath/releases/latest-release/non-redundant-data-sets/cath-dataset-nonredundant-S40.list ./
为了从CATH数据库的cath-dataset-nonredundant-S40.list文件中找到含有多条多肽链的复合体结构,整个流程需要使用CATH数据库和PDB文件的结合分析,脚本可以用Python或其他脚本语言编写。以下是大致的分析步骤、可能用到的工具和脚本思路。
数据分析代码
import pandas as pd
### 1. 读取CATH数据
cath_file = '/Users/zhengxueming/Downloads/cath-dataset-nonredundant-S40.list'
pdb_ids = []
chain_ids = []
pdb_chain_ids = []
domain_ids = []
with open(cath_file, 'r') as file:
for line in file:
line = line.strip()
#print(line.strip()) # 使用 strip() 去除每行末尾的换行符
# 1a15A00
pdb_id = line[:4] # 提取PDB ID
chain_id = line[4:5] # 提取链ID
domain_id = line[5