网上没看到有相关内容,就一个用爬虫下载的,对于百万以上量级来说太慢了。
首先到AlphaFold的ftp(Index of /pub/databases/alphafold (ebi.ac.uk))上下载accession_ids.csv,包含alphafold预测的所有uniprot的accession,具体内容说明参考README.txt。如果官网下载有点慢可以复制链接到fdm下载。
下载后文件内容如下:
A0A2I1YHU5,1,933,AF-A0A2I1YHU5-F1,4
A0A5H2Z360,1,342,AF-A0A5H2Z360-F1,4
A0A6L5B7P9,1,275,AF-A0A6L5B7P9-F1,4
最后两列分别是alphafold的accession和version号,手动创建链接即可:
for line in f:
spl = line.strip("\n").split("\t")
AlphaFold_Accession, Version = spl[-2], spl[-1]
AlphaFold_PDB_Link = f"https://alphafold.ebi.ac.uk/files/{AlphaFold_Accession}-model_v{Version}.pdb"
后续下载PDB文件也可以用python的requests,直接for循环太慢了可以用多线程下载:
import os
import requests
from concurrent.futures import ThreadPoolExecutor
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
os.makedirs("./AlphaFold", exist_ok=True)
count = 0
session = requests.Session()
session.trust_env = False # 禁用环境代理
# 设置重试策略
retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retries)
session.mount('http://', adapter)
session.mount('https://', adapter)
def download(uniprot, link, count):
# 在这里我是把1000个pdb文件写在一个文件里,可以根据自己需要修改
try:
response = session.get(link)
response.raise_for_status() # 如果响应状态码不是200,抛出HTTPError异常
pdbcontent = "@".join(response.text.splitlines())
with open(f"./AlphaFold/AlphaFold_{count // 1000}.txt", "a") as w:
w.write(f"{uniprot}\t{pdbcontent}\n")
return 1
except (requests.exceptions.RequestException, ValueError) as e:
print(f"Error downloading {line.strip()}: {e}")
return 0
# 使用多线程下载
with open("./1.AlphaFoldDownloadLink.txt") as f:
with ThreadPoolExecutor(max_workers=10) as executor:
futures = []
for line in f:
uniprot, link = line.strip("\n").split("\t")
futures.append(executor.submit(download, uniprot, link, count))
count += 1
if len(futures) >= 100: # 控制每次提交的任务数量
for future in futures:
future.result() # 等待当前批次任务完成
futures = [] # 清空当前批次任务
# 处理剩余的任务
for future in futures:
future.result()
print(f"Total processed: {count}")
根据网络情况多开些线程的话大概可以一秒百来个吧