AlphaFold2配置和部署过程
目录
参考:https://github.com/kalininalab/alphafold_non_docker
1.Environment
CPU:Intel® Xeon® Gold 6240 CPU @ 2.60GHz
GPU:Tesla V100S PCIe 32GB * 4
OS:Ubuntu 20.04.2 LTS
CUDA:11.3
NVIDIA:NVIDIA driver version: 470.57.02
2.Insatll environment dependencies
cudnn和cudatoolkit根据自己电脑配置
conda install -y -c conda-forge openmm==7.5.1 cudnn==8.2.1.32 cudatoolkit==11.0.3 pdbfixer==1.7
conda install -y -c bioconda hmmer==3.3.2 hhsuite==3.3.0 kalign2==2.04
3.Download chemical properties to the common folder
wget -q -P alphafold/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
4.Install alphafold dependencies
pip install absl-py==0.13.0 biopython==1.79 chex==0.0.7 dm-haiku==0.0.4 dm-tree==0.1.6 immutabledict==2.0.0 jax==0.2.14 ml-collections==0.1.0 numpy==1.19.5 scipy==1.7.0 tensorflow==2.5.0
pip install --upgrade jax jaxlib==0.1.69+cuda111 -f https://storage.googleapis.com/jax-releases/jax_releases.html
注意点:
1.jaxlib版本要下载正确,与cuda的版本一致,不然在run alphafold的时候不能调用GPU。
2.jaxlib只能在linux下使用,windows下没有这个包,没有linux建议放弃。
5.Apply OpenMM patch
cd ~/anaconda3/envs/alphafold/lib/python3.8/site-packages/
patch -p0 < $alphafold_path/docker/openmm.patch
其中$alphafold_path表示自己电脑上alphafold存放的位置
6.Download all databases
跟着官方的下载渠道走: https://github.com/deepmind/alphafold#genetic-databases
解压后的数据大约2.2T,存放在SSD上效果更佳。
7.Run AlphaFold2
两种选择:
1.在源码的run_alphafold.py中将参数中各个数据库的路径、输出结果的路径,以及其他参数(例如GPU设置、需要预测的蛋白质氨基酸序列 [存放在fastashang] )进行配置。
2.写一个script,用于将各个数据库的路径拼接,配置相关的一些参数,以及run run_alphafold.py,具体的如下(该来自参考的作者)
#!/bin/bash
# Description: AlphaFold non-docker version
# Author: Sanjay Kumar Srikakulam
usage() {
echo ""
echo "Please make sure all required parameters are given"
echo "Usage: $0 <OPTIONS>"
echo "Required Parameters:"
echo "-d <data_dir> Path to directory of supporting data"
echo "-o <output_dir> Path to a directory that will store the results."
echo "-m <model_names> Names of models to use (a comma separated list)"
echo "-f <fasta_path> Path to a FASTA file containing one sequence"
echo "-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets"
echo "Optional Parameters:"
echo "-n <openmm_threads> OpenMM threads (default: all available cores)"
echo "-b <benchmark> Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'False')"
echo "-g <use_gpu> Enable NVIDIA runtime to run with GPUs (default: True)"
echo "-a <gpu_devices> Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)"
echo "-p <preset> Choose preset model configuration - no ensembling and smaller genetic database config (reduced_dbs), no ensembling and full genetic database config (full_dbs) or full genetic database config and 8 model ensemblings (casp14)"
echo ""
exit 1
}
while getopts ":d:o:m:f:t:g:n:a:p:b" i; do
case "${i}" in
d)
data_dir=$OPTARG
;;
o)
output_dir=$OPTARG
;;
m)
model_names=$OPTARG
;;
f)
fasta_path=$OPTARG
;;
t)
max_template_date=$OPTARG
;;
g)
use_gpu=$OPTARG
;;
n)
openmm_threads=$OPTARG
;;
a)
gpu_devices=$OPTARG
;;
p)
preset=$OPTARG
;;
b)
benchmark=true
;;
esac
done
# Parse input and set defaults
if [[ "$data_dir" == "" || "$output_dir" == "" || "$model_names" == "" || "$fasta_path" == "" || "$max_template_date" == "" ]] ; then
usage
fi
if [[ "$benchmark" == "" ]] ; then
benchmark=false
fi
if [[ "$use_gpu" == "" ]] ; then
use_gpu=true
fi
if [[ "$gpu_devices" == "" ]] ; then
gpu_devices=0
fi
if [[ "$preset" == "" ]] ; then
preset="full_dbs"
fi
if [[ "$preset" != "full_dbs" && "$preset" != "casp14" && "$preset" != "reduced_dbs" ]] ; then
echo "Unknown preset! Using default ('full_dbs')"
preset="full_dbs"
fi
# This bash script looks for the run_alphafold.py script in its current working directory, if it does not exist then exits
current_working_dir=$(pwd)
alphafold_script="$current_working_dir/run_alphafold.py"
if [ ! -f "$alphafold_script" ]; then
echo "Alphafold python script $alphafold_script does not exist."
exit 1
fi
# Export ENVIRONMENT variables and set CUDA devices for use
# CUDA GPU control
export CUDA_VISIBLE_DEVICES=-1
if [[ "$use_gpu" == true ]] ; then
export CUDA_VISIBLE_DEVICES=0
if [[ "$gpu_devices" ]] ; then
export CUDA_VISIBLE_DEVICES=$gpu_devices
fi
fi
# OpenMM threads control
if [[ "$openmm_threads" ]] ; then
export OPENMM_CPU_THREADS=$openmm_threads
fi
# TensorFlow control
export TF_FORCE_UNIFIED_MEMORY='1'
# JAX control
export XLA_PYTHON_CLIENT_MEM_FRACTION='4.0'
# Path and user config (change me if required)
bfd_database_path="$data_dir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
small_bfd_database_path="$data_dir/small_bfd/bfd-first_non_consensus_sequences.fasta"
mgnify_database_path="$data_dir/mgnify/mgy_clusters.fa"
template_mmcif_dir="$data_dir/pdb_mmcif/mmcif_files"
obsolete_pdbs_path="$data_dir/pdb_mmcif/obsolete.dat"
pdb70_database_path="$data_dir/pdb70/pdb70"
uniclust30_database_path="$data_dir/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
uniref90_database_path="$data_dir/uniref90/uniref90.fasta"
# Binary path (change me if required)
hhblits_binary_path=$(which hhblits)
hhsearch_binary_path=$(which hhsearch)
jackhmmer_binary_path=$(which jackhmmer)
kalign_binary_path=$(which kalign)
# Run AlphaFold with required parameters
# 'reduced_dbs' preset does not use bfd and uniclust30 databases
if [[ "$preset" == "reduced_dbs" ]]; then
$(python $alphafold_script --hhblits_binary_path=$hhblits_binary_path --hhsearch_binary_path=$hhsearch_binary_path --jackhmmer_binary_path=$jackhmmer_binary_path --kalign_binary_path=$kalign_binary_path --small_bfd_database_path=$small_bfd_database_path --mgnify_database_path=$mgnify_database_path --template_mmcif_dir=$template_mmcif_dir --obsolete_pdbs_path=$obsolete_pdbs_path --pdb70_database_path=$pdb70_database_path --uniref90_database_path=$uniref90_database_path --data_dir=$data_dir --output_dir=$output_dir --fasta_paths=$fasta_path --model_names=$model_names --max_template_date=$max_template_date --preset=$preset --benchmark=$benchmark --logtostderr)
else
$(python $alphafold_script --hhblits_binary_path=$hhblits_binary_path --hhsearch_binary_path=$hhsearch_binary_path --jackhmmer_binary_path=$jackhmmer_binary_path --kalign_binary_path=$kalign_binary_path --bfd_database_path=$bfd_database_path --mgnify_database_path=$mgnify_database_path --template_mmcif_dir=$template_mmcif_dir --obsolete_pdbs_path=$obsolete_pdbs_path --pdb70_database_path=$pdb70_database_path --uniclust30_database_path=$uniclust30_database_path --uniref90_database_path=$uniref90_database_path --data_dir=$data_dir --output_dir=$output_dir --fasta_paths=$fasta_path --model_names=$model_names --max_template_date=$max_template_date --preset=$preset --benchmark=$benchmark --logtostderr)
fi
如果在源码上修改,则直接 python run_alphafold.py
如果使用bash script,则可以使用以下命令,-d 后是数据库存放的位置,-o 是结果输出的路径,-m 是选择模型(可以选多个), -f 是查询序列方法的路径,-t 是使用该时间之前的数据库内容。
# Example run (Uses the GPU with index id 0 as default)
bash run_alphafold.sh -d ./alphafold_data/ -o ./outputs/ -m model_1 -f ./xxx/T100.fasta -t 2020-05-14
# OR for CPU only run
bash run_alphafold.sh -d ./alphafold_data/ -o ./outputs/ -m model_1 -f ./xxx/T100.fasta -t 2020-05-14 -g False
8.View protein 3D structure
1.安装生物大分子展示软件PyMOL,具体参考:http://pymol.chenzhaoqiang.com/intro/overview.html
我使用的是alphafold2作者在colab上 Single sequence input (no MSA)使用的序列,如下:
query_sequence = "GWSTELEKHREELKEFLKKEGITNVEIRIDNGRLEVRVEGGTERLKRFLEELRQKLEKKGYTVDIKIE"
在运行代码后在outputs里包含多个文件,现在后缀名为pdb的文件在PyMOL中打开,可以看到如下:
可以在pdb数据库中查找一些已有蛋白质三维结构的氨基酸序列进行验证。