Colabfold 1.5.2
- Source : https://github.com/sokrypton/ColabFold
- release version: 1.5.2
- first version with model
AlphaFold2-multimer-v3
Colabfold Inputs:
To run colabfold, you need to use the colabfold_batch
command with several arguments.
- to get a full list of arguments type
colabfold_batch -h
(default values have been added here):
usage: colabfold_batch [-h] [--stop-at-score STOP_AT_SCORE]
[--num-recycle NUM_RECYCLE]
[--recycle-early-stop-tolerance RECYCLE_EARLY_STOP_TOLERANCE]
[--num-ensemble NUM_ENSEMBLE] [--num-seeds NUM_SEEDS]
[--random-seed RANDOM_SEED] [--num-models {1,2,3,4,5}]
[--recompile-padding RECOMPILE_PADDING]
[--model-order MODEL_ORDER] [--host-url HOST_URL]
[--data DATA]
[--msa-mode {mmseqs2_uniref_env,mmseqs2_uniref,single_sequence}]
[--model-type {auto,alphafold2,alphafold2_ptm,alphafold2_multimer_v1,alphafold2_multimer_v2,alphafold2_multimer_v3}]
[--amber] [--num-relax NUM_RELAX] [--templates]
[--custom-template-path CUSTOM_TEMPLATE_PATH]
[--rank {auto,plddt,ptm,iptm,multimer}]
[--pair-mode {unpaired,paired,unpaired_paired}]
[--sort-queries-by {none,length,random}]
[--save-single-representations]
[--save-pair-representations] [--use-dropout]
[--max-seq MAX_SEQ] [--max-extra-seq MAX_EXTRA_SEQ]
[--max-msa MAX_MSA] [--disable-cluster-profile] [--zip]
[--use-gpu-relax] [--save-all] [--save-recycles]
[--overwrite-existing-results]
[--disable-unified-memory]
input results
positional arguments:
input Can be one of the following: Directory with fasta/a3m
files, a csv/tsv file, a fasta file or an a3m file
results Directory to write the results to
optional arguments:
-h, --help show this help message and exit
--stop-at-score STOP_AT_SCORE
Compute models until plddt (single chain) or ptmscore
(complex) > threshold is reached. This can make
colabfold much faster by only running the first model
for easy queries.
--num-recycle NUM_RECYCLE
Number of prediction recycles.Increasing recycles can
improve the quality but slows down the prediction.
--recycle-early-stop-tolerance RECYCLE_EARLY_STOP_TOLERANCE
Specify convergence criteria.Run until the distance
between recycles is within specified value.
--num-ensemble NUM_ENSEMBLE
Number of ensembles.The trunk of the network is run
multiple times with different random choices for the
MSA cluster centers.
--num-seeds NUM_SEEDS
Number of seeds to try. Will iterate from
range(random_seed, random_seed+num_seeds)..
--random-seed RANDOM_SEED
Changing the seed for the random number generator can
result in different structure predictions.
--num-models {1,2,3,4,5}
--recompile-padding RECOMPILE_PADDING
Whenever the input length changes, the model needs to
be recompiled.We pad sequences by specified length, so
we can e.g. compute sequence from length 100 to 110
without recompiling.The prediction will become
marginally slower for the longer input, but overall
performance increases due to not recompiling. Set to 0
to disable.
--model-order MODEL_ORDER
--host-url HOST_URL
--data DATA
--msa-mode {mmseqs2_uniref_env,mmseqs2_uniref,single_sequence}
Using an a3m file as input overwrites this option
--model-type {auto,alphafold2,alphafold2_ptm,alphafold2_multimer_v1,alphafold2_multimer_v2,alphafold2_multimer_v3}
predict strucutre/complex using the following
model.Auto will pick "alphafold2_ptm" for structure
predictions and "alphafold2_multimer_v3" for
complexes.
--amber Use amber for structure refinement.To control number
of top ranked structures are relaxed set --num-relax.
--num-relax NUM_RELAX
specify how many of the top ranked structures to relax
using amber.
--templates Use templates from pdb
--custom-template-path CUSTOM_TEMPLATE_PATH
Directory with pdb files to be used as input
--rank {auto,plddt,ptm,iptm,multimer}
rank models by auto, plddt or ptmscore
--pair-mode {unpaired,paired,unpaired_paired}
rank models by auto, unpaired, paired, unpaired_paired
--sort-queries-by {none,length,random}
sort queries by: none, length, random
--save-single-representations
saves the single representation embeddings of all
models
--save-pair-representations
saves the pair representation embeddings of all models
--use-dropout activate dropouts during inference to sample from
uncertainity of the models
--max-seq MAX_SEQ number of sequence clusters to use
--max-extra-seq MAX_EXTRA_SEQ
number of extra sequences to use
--max-msa MAX_MSA defines: `max-seq:max-extra-seq` number of sequences
to use
--disable-cluster-profile
EXPERIMENTAL: for multimer models, disable cluster
profiles
--zip zip all results into one <jobname>.result.zip and
delete the original files
--use-gpu-relax run amber on GPU instead of CPU
--save-all save ALL raw outputs from model to a pickle file
--save-recycles save all intermediate predictions at each recycle
--overwrite-existing-results
--disable-unified-memory
if you are getting tensorflow/jax errors it might help
to disable this
- launch Colabfold using a
.csv
file as input sequence(s) with a format like this for a tetramer:
id,sequence
Complex,<SEQUENCE>:<SEQUENCE>:<SEQUENCE>:<SEQUENCE>
- replace
<SEQUENCE>
with your sequence
$ colabfold_batch test.csv out_dir
IPOPUP cluster Job sumission:
SRUN submission
To launch colabfold on IPOPUP cluster, here’s a command line example. The job can be launched on any queue list holding GPU nodes as (
rpbs
,cmpli
ormaster-bi
).In the following example, the
srun
command is launched on thecmpli
queue list, asking one GPU and 10 cpus.
$ srun -p cmpli --gres=gpu:1 -c 10 singularity run --bind /shared/banks/alphafold2/2022-12-13/:/root/.cache/colabfold --nv /shared/software/singularity/images/alphafold-colabfold_1.5.2-rpbs.sif colabfold_batch test.csv out_dir --num-seeds 20 --num-recycle 12 --msa-mode mmseqs2_uniref_env --model-type alphafold2_multimer_v3 --rank multimer --pair-mode unpaired_paired --num-models 5 --use-dropout --save-recycles
SBATCH submission
- A
.sbatch
file need to be created, to use the same options as previously you could use:
#!/bin/bash
#SBATCH -p cmpli
#SBATCH -c 10
#SBATCH --gres=gpu:1
#SBATCH -o test.out
#SBATCH -e test.err
#SBATCH --job-name=test
# with SRA-Toolkit fasterq-dump
srun singularity run --bind /shared/banks/alphafold2/2022-12-13/:/root/.cache/colabfold --nv /shared/software/singularity/images/alphafold-colabfold_1.5.2-rpbs.sif colabfold_batch test.csv out_dir --num-seeds 20 --num-recycle 12 --msa-mode mmseqs2_uniref_env --model-type alphafold2_multimer_v3 --rank multimer --pair-mode unpaired_paired --num-models 5 --use-dropout --save-recycles
- Submit the job, with a sbatch file called
test.sbatch
, launch:
$ sbatch test.sbatch
Running the image locally
Build docker image:
- create your own docker image:
$ git clone -b colabfold_1.5.2 https://gitlab.rpbs.univ-paris-diderot.fr/docker/alphafold/
$ cd alphafold
$ docker build -t colabfold_1.5.2 .
- run docker image:
$ docker run --gpus all --rm -v $(pwd):$(pwd) -w $(pwd) colabfold_1.5.2 colabfold_batch -h
Build singularity image:
- if you have already made a docker image:
$ sudo singularity build colabfold_1.5.2.sif docker-daemon://colabfold_1.5.2:latest
- run singularity
$ singularity run --cleanenv --nv colabfold_1.5.2.sif colabfold_batch -h
Reference:
- Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and
Steinegger M. ColabFold - Making protein folding accessible to
all.
bioRxiv (2021) doi: 10.1101/2021.08.15.456425 - If you’re using AlphaFold, please also cite:
Jumper et al. “Highly accurate protein structure prediction with AlphaFold.”
Nature (2021) doi: 10.1038/s41586-021-03819-2 - If you’re using AlphaFold-multimer, please also
cite:
Evans et al. “Protein complex prediction with AlphaFold-Multimer.”
biorxiv (2021) doi: 10.1101/2021.10.04.463034v1