Skip to content
Snippets Groups Projects

Cloned AlphaFold 3 repo into this one

Merged Fortune Iriaye requested to merge clone-alphafold3-github into main
7 files
+ 263
165
Compare changes
  • Side-by-side
  • Inline
Files
7
+ 61
10
@@ -186,7 +186,7 @@ git clone https://github.com/google-deepmind/alphafold3.git
## Obtaining Genetic Databases
This step requires `curl` and `zstd` to be installed on your machine.
This step requires `wget` and `zstd` to be installed on your machine.
AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run:
@@ -200,20 +200,20 @@ AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run:
* [RFam](https://rfam.org/)
* [RNACentral](https://rnacentral.org/)
We provide a Python program `fetch_databases.py` that can be used to download
and set up all of these databases. This process takes around 45 minutes when not
We provide a bash script `fetch_databases.sh` that can be used to download and
set up all of these databases. This process takes around 45 minutes when not
installing on local SSD. We recommend running the following in a `screen` or
`tmux` session as downloading and decompressing the databases takes some time.
```sh
cd alphafold3 # Navigate to the directory with cloned AlphaFold 3 repository.
python3 fetch_databases.py --download_destination=<DATABASES_DIR>
./fetch_databases.sh <DB_DIR>
```
This script downloads the databases from a mirror hosted on GCS, with all
versions being the same as used in the AlphaFold 3 paper.
:ledger: **Note: The download directory `<DATABASES_DIR>` should *not* be a
:ledger: **Note: The download directory `<DB_DIR>` should *not* be a
subdirectory in the AlphaFold 3 repository directory.** If it is, the Docker
build will be slow as the large databases will be copied during the image
creation.
@@ -221,17 +221,17 @@ creation.
:ledger: **Note: The total download size for the full databases is around 252 GB
and the total size when unzipped is 630 GB. Please make sure you have sufficient
hard drive space, bandwidth, and time to download. We recommend using an SSD for
better genetic search performance, and faster runtime of `fetch_databases.py`.**
better genetic search performance.**
:ledger: **Note: If the download directory and datasets don't have full read and
write permissions, it can cause errors with the MSA tools, with opaque
(external) error messages. Please ensure the required permissions are applied,
e.g. with the `sudo chmod 755 --recursive <DATABASES_DIR>` command.**
e.g. with the `sudo chmod 755 --recursive <DB_DIR>` command.**
Once the script has finished, you should have the following directory structure:
```sh
pdb_2022_09_28_mmcif_files.tar # ~200k PDB mmCIF files in this tar.
mmcif_files/ # Directory containing ~200k PDB mmCIF files.
bfd-first_non_consensus_sequences.fasta
mgy_clusters_2022_05.fa
nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta
@@ -242,6 +242,18 @@ uniprot_all_2021_04.fa
uniref90_2022_05.fa
```
Optionally, after the script finishes, you may want copy databases to an SSD.
You can use theses two scripts:
* `src/scripts/gcp_mount_ssd.sh <SSD_MOUNT_PATH>` Mounts and formats an
unmounted GCP SSD drive. It will skip the either step if the disk is either
already formatted or already mounted. The default `<SSD_MOUNT_PATH>` is
`/mnt/disks/ssd`.
* `src/scripts/copy_to_ssd.sh <DB_DIR> <SSD_DB_DIR>` this will copy as many
files that it can fit on to the SSD. The default `<DATABASE_DIR>` is
`$HOME/public_databases` and the default `<SSD_DB_DIR>` is
`/mnt/disks/ssd/public_databases`.
## Obtaining Model Parameters
To request access to the AlphaFold 3 model parameters, please complete
@@ -267,7 +279,7 @@ docker run -it \
--volume $HOME/af_input:/root/af_input \
--volume $HOME/af_output:/root/af_output \
--volume <MODEL_PARAMETERS_DIR>:/root/models \
--volume <DATABASES_DIR>:/root/public_databases \
--volume <DB_DIR>:/root/public_databases \
--gpus all \
alphafold3 \
python run_alphafold.py \
@@ -280,6 +292,27 @@ docker run -it \
persistent disk, which is slow.** If you want better genetic and template search
performance, make sure all databases are placed on a local SSD.
If you have databases on SSD in `<SSD_DB_DIR>` you can use uses it as the
location to look for databases but allowing for a multiple fallbacks with
`--db_dir` which can be specified multiple times.
```
docker run -it \
--volume $HOME/af_input:/root/af_input \
--volume $HOME/af_output:/root/af_output \
--volume <MODEL_PARAMETERS_DIR>:/root/models \
--volume <SSD_DB_DIR>:/root/public_databases \
--volume <DB_DIR>:/root/public_databases_fallback \
--gpus all \
alphafold3 \
python run_alphafold.py \
--json_path=/root/af_input/fold_input.json \
--model_dir=/root/models \
--db_dir=/root/public_databases \
--db_dir=/root/public_databases_fallback \
--output_dir=/root/af_output
```
If you get an error like the following, make sure the models and data are in the
paths (flags named `--volume` above) in the correct locations.
@@ -346,7 +379,7 @@ singularity exec \
--bind $HOME/af_input:/root/af_input \
--bind $HOME/af_output:/root/af_output \
--bind <MODEL_PARAMETERS_DIR>:/root/models \
--bind <DATABASES_DIR>:/root/public_databases \
--bind <DB_DIR>:/root/public_databases \
alphafold3.sif \
python alphafold3/run_alphafold.py \
--json_path=/root/af_input/fold_input.json \
@@ -354,3 +387,21 @@ singularity exec \
--db_dir=/root/public_databases \
--output_dir=/root/af_output
```
Or with some databases on SSD in location `<SSD_DB_DIR>`:
```sh
singularity exec \
--nv alphafold3.simg \
--bind $HOME/af_input:/root/af_input \
--bind $HOME/af_output:/root/af_output \
--bind <MODEL_PARAMETERS_DIR>:/root/models \
--bind <SSD_DB_DIR>:/root/public_databases \
--bind <DB_DIR>:/root/public_databases_fallback \
python alphafold3/run_alphafold.py \
--json_path=/root/af_input/fold_input.json \
--model_dir=/root/models \
--db_dir=/root/public_databases \
--db_dir=/root/public_databases_fallback \
--output_dir=/root/af_output
```
Loading