Colabfold 1.5.2

  • Source : https://github.com/sokrypton/ColabFold
  • release version: 1.5.2
  • first version with model AlphaFold2-multimer-v3
  • DOI

Colabfold Inputs:

To run colabfold, you need to use the colabfold_batch command with several arguments.

  • to get a full list of arguments type colabfold_batch -h (default values have been added here):
usage: colabfold_batch [-h] [--stop-at-score STOP_AT_SCORE]
                       [--num-recycle NUM_RECYCLE]
                       [--recycle-early-stop-tolerance RECYCLE_EARLY_STOP_TOLERANCE]
                       [--num-ensemble NUM_ENSEMBLE] [--num-seeds NUM_SEEDS]
                       [--random-seed RANDOM_SEED] [--num-models {1,2,3,4,5}]
                       [--recompile-padding RECOMPILE_PADDING]
                       [--model-order MODEL_ORDER] [--host-url HOST_URL]
                       [--data DATA]
                       [--msa-mode {mmseqs2_uniref_env,mmseqs2_uniref,single_sequence}]
                       [--model-type {auto,alphafold2,alphafold2_ptm,alphafold2_multimer_v1,alphafold2_multimer_v2,alphafold2_multimer_v3}]
                       [--amber] [--num-relax NUM_RELAX] [--templates]
                       [--custom-template-path CUSTOM_TEMPLATE_PATH]
                       [--rank {auto,plddt,ptm,iptm,multimer}]
                       [--pair-mode {unpaired,paired,unpaired_paired}]
                       [--sort-queries-by {none,length,random}]
                       [--save-single-representations]
                       [--save-pair-representations] [--use-dropout]
                       [--max-seq MAX_SEQ] [--max-extra-seq MAX_EXTRA_SEQ]
                       [--max-msa MAX_MSA] [--disable-cluster-profile] [--zip]
                       [--use-gpu-relax] [--save-all] [--save-recycles]
                       [--overwrite-existing-results]
                       [--disable-unified-memory]
                       input results

positional arguments:
  input                 Can be one of the following: Directory with fasta/a3m
                        files, a csv/tsv file, a fasta file or an a3m file
  results               Directory to write the results to

optional arguments:
  -h, --help            show this help message and exit
  --stop-at-score STOP_AT_SCORE
                        Compute models until plddt (single chain) or ptmscore
                        (complex) > threshold is reached. This can make
                        colabfold much faster by only running the first model
                        for easy queries.
  --num-recycle NUM_RECYCLE
                        Number of prediction recycles.Increasing recycles can
                        improve the quality but slows down the prediction.
  --recycle-early-stop-tolerance RECYCLE_EARLY_STOP_TOLERANCE
                        Specify convergence criteria.Run until the distance
                        between recycles is within specified value.
  --num-ensemble NUM_ENSEMBLE
                        Number of ensembles.The trunk of the network is run
                        multiple times with different random choices for the
                        MSA cluster centers.
  --num-seeds NUM_SEEDS
                        Number of seeds to try. Will iterate from
                        range(random_seed, random_seed+num_seeds)..
  --random-seed RANDOM_SEED
                        Changing the seed for the random number generator can
                        result in different structure predictions.
  --num-models {1,2,3,4,5}
  --recompile-padding RECOMPILE_PADDING
                        Whenever the input length changes, the model needs to
                        be recompiled.We pad sequences by specified length, so
                        we can e.g. compute sequence from length 100 to 110
                        without recompiling.The prediction will become
                        marginally slower for the longer input, but overall
                        performance increases due to not recompiling. Set to 0
                        to disable.
  --model-order MODEL_ORDER
  --host-url HOST_URL
  --data DATA
  --msa-mode {mmseqs2_uniref_env,mmseqs2_uniref,single_sequence}
                        Using an a3m file as input overwrites this option
  --model-type {auto,alphafold2,alphafold2_ptm,alphafold2_multimer_v1,alphafold2_multimer_v2,alphafold2_multimer_v3}
                        predict strucutre/complex using the following
                        model.Auto will pick "alphafold2_ptm" for structure
                        predictions and "alphafold2_multimer_v3" for
                        complexes.
  --amber               Use amber for structure refinement.To control number
                        of top ranked structures are relaxed set --num-relax.
  --num-relax NUM_RELAX
                        specify how many of the top ranked structures to relax
                        using amber.
  --templates           Use templates from pdb
  --custom-template-path CUSTOM_TEMPLATE_PATH
                        Directory with pdb files to be used as input
  --rank {auto,plddt,ptm,iptm,multimer}
                        rank models by auto, plddt or ptmscore
  --pair-mode {unpaired,paired,unpaired_paired}
                        rank models by auto, unpaired, paired, unpaired_paired
  --sort-queries-by {none,length,random}
                        sort queries by: none, length, random
  --save-single-representations
                        saves the single representation embeddings of all
                        models
  --save-pair-representations
                        saves the pair representation embeddings of all models
  --use-dropout         activate dropouts during inference to sample from
                        uncertainity of the models
  --max-seq MAX_SEQ     number of sequence clusters to use
  --max-extra-seq MAX_EXTRA_SEQ
                        number of extra sequences to use
  --max-msa MAX_MSA     defines: `max-seq:max-extra-seq` number of sequences
                        to use
  --disable-cluster-profile
                        EXPERIMENTAL: for multimer models, disable cluster
                        profiles
  --zip                 zip all results into one <jobname>.result.zip and
                        delete the original files
  --use-gpu-relax       run amber on GPU instead of CPU
  --save-all            save ALL raw outputs from model to a pickle file
  --save-recycles       save all intermediate predictions at each recycle
  --overwrite-existing-results
  --disable-unified-memory
                        if you are getting tensorflow/jax errors it might help
                        to disable this
  • launch Colabfold using a .csv file as input sequence(s) with a format like this for a tetramer:
id,sequence
Complex,<SEQUENCE>:<SEQUENCE>:<SEQUENCE>:<SEQUENCE>
  • replace <SEQUENCE> with your sequence
$ colabfold_batch test.csv out_dir

IPOPUP cluster Job sumission:

SRUN submission

  • To launch colabfold on IPOPUP cluster, here’s a command line example. The job can be launched on any queue list holding GPU nodes as (rpbs, cmpli or master-bi).

  • In the following example, the srun command is launched on the cmpli queue list, asking one GPU and 10 cpus.

$ srun -p cmpli --gres=gpu:1 -c 10 singularity run --bind /shared/banks/alphafold2/2022-12-13/:/root/.cache/colabfold --nv /shared/software/singularity/images/alphafold-colabfold_1.5.2-rpbs.sif colabfold_batch test.csv out_dir --num-seeds 20 --num-recycle 12 --msa-mode mmseqs2_uniref_env --model-type alphafold2_multimer_v3 --rank multimer --pair-mode unpaired_paired --num-models 5 --use-dropout --save-recycles

SBATCH submission

  • A .sbatch file need to be created, to use the same options as previously you could use:
#!/bin/bash

#SBATCH -p cmpli
#SBATCH -c 10
#SBATCH --gres=gpu:1
#SBATCH -o test.out
#SBATCH -e test.err
#SBATCH --job-name=test

# with SRA-Toolkit fasterq-dump
srun singularity run --bind /shared/banks/alphafold2/2022-12-13/:/root/.cache/colabfold --nv /shared/software/singularity/images/alphafold-colabfold_1.5.2-rpbs.sif colabfold_batch test.csv out_dir --num-seeds 20 --num-recycle 12 --msa-mode mmseqs2_uniref_env --model-type alphafold2_multimer_v3 --rank multimer --pair-mode unpaired_paired --num-models 5 --use-dropout --save-recycles
  • Submit the job, with a sbatch file called test.sbatch, launch:
$ sbatch test.sbatch

Running the image locally

Build docker image:

  • create your own docker image:
$ git clone -b colabfold_1.5.2 https://gitlab.rpbs.univ-paris-diderot.fr/docker/alphafold/
$ cd alphafold
$ docker build -t colabfold_1.5.2 .
  • run docker image:
$ docker run --gpus all --rm -v $(pwd):$(pwd) -w $(pwd) colabfold_1.5.2 colabfold_batch -h

Build singularity image:

  • if you have already made a docker image:
$ sudo singularity build colabfold_1.5.2.sif docker-daemon://colabfold_1.5.2:latest
  • run singularity
$ singularity run --cleanenv --nv  colabfold_1.5.2.sif colabfold_batch -h

Reference:

  • Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold - Making protein folding accessible to all.
    bioRxiv (2021) doi: 10.1101/2021.08.15.456425
  • If you’re using AlphaFold, please also cite:
    Jumper et al. “Highly accurate protein structure prediction with AlphaFold.”
    Nature (2021) doi: 10.1038/s41586-021-03819-2
  • If you’re using AlphaFold-multimer, please also cite:
    Evans et al. “Protein complex prediction with AlphaFold-Multimer.”
    biorxiv (2021) doi: 10.1101/2021.10.04.463034v1