diff --git a/docs/installation.md b/docs/installation.md index de855ed2fbcd988fabe83d4a50a8304630e4c344..5be190e2979f6bf034a006f18beca92361c4bbfa 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -186,7 +186,7 @@ git clone https://github.com/google-deepmind/alphafold3.git ## Obtaining Genetic Databases -This step requires `curl` and `zstd` to be installed on your machine. +This step requires `wget` and `zstd` to be installed on your machine. AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run: @@ -200,20 +200,20 @@ AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run: * [RFam](https://rfam.org/) * [RNACentral](https://rnacentral.org/) -We provide a Python program `fetch_databases.py` that can be used to download -and set up all of these databases. This process takes around 45 minutes when not +We provide a bash script `fetch_databases.sh` that can be used to download and +set up all of these databases. This process takes around 45 minutes when not installing on local SSD. We recommend running the following in a `screen` or `tmux` session as downloading and decompressing the databases takes some time. ```sh cd alphafold3 # Navigate to the directory with cloned AlphaFold 3 repository. -python3 fetch_databases.py --download_destination=<DATABASES_DIR> +./fetch_databases.sh <DB_DIR> ``` This script downloads the databases from a mirror hosted on GCS, with all versions being the same as used in the AlphaFold 3 paper. -:ledger: **Note: The download directory `<DATABASES_DIR>` should *not* be a +:ledger: **Note: The download directory `<DB_DIR>` should *not* be a subdirectory in the AlphaFold 3 repository directory.** If it is, the Docker build will be slow as the large databases will be copied during the image creation. @@ -221,17 +221,17 @@ creation. :ledger: **Note: The total download size for the full databases is around 252 GB and the total size when unzipped is 630 GB. Please make sure you have sufficient hard drive space, bandwidth, and time to download. We recommend using an SSD for -better genetic search performance, and faster runtime of `fetch_databases.py`.** +better genetic search performance.** :ledger: **Note: If the download directory and datasets don't have full read and write permissions, it can cause errors with the MSA tools, with opaque (external) error messages. Please ensure the required permissions are applied, -e.g. with the `sudo chmod 755 --recursive <DATABASES_DIR>` command.** +e.g. with the `sudo chmod 755 --recursive <DB_DIR>` command.** Once the script has finished, you should have the following directory structure: ```sh -pdb_2022_09_28_mmcif_files.tar # ~200k PDB mmCIF files in this tar. +mmcif_files/ # Directory containing ~200k PDB mmCIF files. bfd-first_non_consensus_sequences.fasta mgy_clusters_2022_05.fa nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta @@ -242,6 +242,18 @@ uniprot_all_2021_04.fa uniref90_2022_05.fa ``` +Optionally, after the script finishes, you may want copy databases to an SSD. +You can use theses two scripts: + +* `src/scripts/gcp_mount_ssd.sh <SSD_MOUNT_PATH>` Mounts and formats an + unmounted GCP SSD drive. It will skip the either step if the disk is either + already formatted or already mounted. The default `<SSD_MOUNT_PATH>` is + `/mnt/disks/ssd`. +* `src/scripts/copy_to_ssd.sh <DB_DIR> <SSD_DB_DIR>` this will copy as many + files that it can fit on to the SSD. The default `<DATABASE_DIR>` is + `$HOME/public_databases` and the default `<SSD_DB_DIR>` is + `/mnt/disks/ssd/public_databases`. + ## Obtaining Model Parameters To request access to the AlphaFold 3 model parameters, please complete @@ -267,7 +279,7 @@ docker run -it \ --volume $HOME/af_input:/root/af_input \ --volume $HOME/af_output:/root/af_output \ --volume <MODEL_PARAMETERS_DIR>:/root/models \ - --volume <DATABASES_DIR>:/root/public_databases \ + --volume <DB_DIR>:/root/public_databases \ --gpus all \ alphafold3 \ python run_alphafold.py \ @@ -280,6 +292,27 @@ docker run -it \ persistent disk, which is slow.** If you want better genetic and template search performance, make sure all databases are placed on a local SSD. +If you have databases on SSD in `<SSD_DB_DIR>` you can use uses it as the +location to look for databases but allowing for a multiple fallbacks with +`--db_dir` which can be specified multiple times. + +``` +docker run -it \ + --volume $HOME/af_input:/root/af_input \ + --volume $HOME/af_output:/root/af_output \ + --volume <MODEL_PARAMETERS_DIR>:/root/models \ + --volume <SSD_DB_DIR>:/root/public_databases \ + --volume <DB_DIR>:/root/public_databases_fallback \ + --gpus all \ + alphafold3 \ + python run_alphafold.py \ + --json_path=/root/af_input/fold_input.json \ + --model_dir=/root/models \ + --db_dir=/root/public_databases \ + --db_dir=/root/public_databases_fallback \ + --output_dir=/root/af_output +``` + If you get an error like the following, make sure the models and data are in the paths (flags named `--volume` above) in the correct locations. @@ -346,7 +379,7 @@ singularity exec \ --bind $HOME/af_input:/root/af_input \ --bind $HOME/af_output:/root/af_output \ --bind <MODEL_PARAMETERS_DIR>:/root/models \ - --bind <DATABASES_DIR>:/root/public_databases \ + --bind <DB_DIR>:/root/public_databases \ alphafold3.sif \ python alphafold3/run_alphafold.py \ --json_path=/root/af_input/fold_input.json \ @@ -354,3 +387,21 @@ singularity exec \ --db_dir=/root/public_databases \ --output_dir=/root/af_output ``` + +Or with some databases on SSD in location `<SSD_DB_DIR>`: + +```sh +singularity exec \ + --nv alphafold3.simg \ + --bind $HOME/af_input:/root/af_input \ + --bind $HOME/af_output:/root/af_output \ + --bind <MODEL_PARAMETERS_DIR>:/root/models \ + --bind <SSD_DB_DIR>:/root/public_databases \ + --bind <DB_DIR>:/root/public_databases_fallback \ + python alphafold3/run_alphafold.py \ + --json_path=/root/af_input/fold_input.json \ + --model_dir=/root/models \ + --db_dir=/root/public_databases \ + --db_dir=/root/public_databases_fallback \ + --output_dir=/root/af_output +``` diff --git a/fetch_databases.py b/fetch_databases.py deleted file mode 100644 index b4d47a11f1410bc99ca79e6c5fd9a13202e97e33..0000000000000000000000000000000000000000 --- a/fetch_databases.py +++ /dev/null @@ -1,137 +0,0 @@ -# Copyright 2024 DeepMind Technologies Limited -# -# AlphaFold 3 source code is licensed under CC BY-NC-SA 4.0. To view a copy of -# this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ -# -# To request access to the AlphaFold 3 model parameters, follow the process set -# out at https://github.com/google-deepmind/alphafold3. You may only use these -# if received directly from Google. Use is subject to terms of use available at -# https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md - -"""Downloads the AlphaFold v3.0 databases from GCS and decompresses them. - -Curl is used to download the files and Zstandard (zstd) is used to decompress -them. Make sure both are installed on your system before running this script. -""" - -import argparse -import concurrent.futures -import functools -import os -import pathlib -import shutil -import subprocess -import sys - - -DATABASE_FILES = ( - 'bfd-first_non_consensus_sequences.fasta.zst', - 'mgy_clusters_2022_05.fa.zst', - 'nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta.zst', - 'pdb_2022_09_28_mmcif_files.tar.zst', - 'pdb_seqres_2022_09_28.fasta.zst', - 'rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta.zst', - 'rnacentral_active_seq_id_90_cov_80_linclust.fasta.zst', - 'uniprot_all_2021_04.fa.zst', - 'uniref90_2022_05.fa.zst', -) - -BUCKET_URL = 'https://storage.googleapis.com/alphafold-databases/v3.0' - - -def download_and_decompress( - filename: str, *, bucket_url: str, download_destination: pathlib.Path -) -> None: - """Downloads and decompresses a ztsd-compressed file.""" - print( - f'STARTING download {filename} from {bucket_url} to' - f' {download_destination}' - ) - # Continue (`continue-at -`) for resumability of a partially downloaded file. - # --progress-bar is used to show some progress in the terminal. - # tr '\r' '\n' is used to remove the \r characters which are used by curl to - # updated the progress bar, which can be confusing when multiple calls are - # made at once. - subprocess.run( - args=( - 'curl', - '--progress-bar', - *('--continue-at', '-'), - *('--output', str(download_destination / filename)), - f'{bucket_url}/{filename}', - *('--stderr', '/dev/stdout'), - ), - check=True, - stdout=subprocess.PIPE, - stderr=subprocess.PIPE, - # Same as text=True in Python 3.7+, used for backwards compatibility. - universal_newlines=True, - ) - print( - f'FINISHED downloading {filename} from {bucket_url} to' - f' {download_destination}.' - ) - - print(f'STARTING decompressing of {filename}') - - # The original compressed file is kept so that if it is interupted it can be - # resumed, skipping the need to download the file again. - subprocess.run( - ['zstd', '--decompress', '--force', f'{download_destination}/{filename}'], - check=True, - ) - print(f'FINISHED decompressing of {filename}') - - -def main(argv=('',)) -> None: - """Main function.""" - parser = argparse.ArgumentParser(description='Downloads AlphaFold databases.') - parser.add_argument( - '--download_destination', - default='/srv/alphafold3_data/public_databases', - help='The directory to download the databases to.', - ) - args = parser.parse_args(argv) - - destination = pathlib.Path(os.path.expanduser(args.download_destination)) - - print(f'Downloading all data to: {str(destination)}') - try: - destination.mkdir(exist_ok=True) - if not os.access(destination, os.W_OK): - raise PermissionError() - except PermissionError as e: - raise PermissionError( - 'You do not have write permissions to the destination directory' - f' {destination}.' - ) from e - - if shutil.which('curl') is None: - raise ValueError('curl is not installed. Please install it and try again.') - if shutil.which('zstd') is None: - raise ValueError('zstd is not installed. Please install it and try again.') - - # Download each of the files and decompress them in parallel. - with concurrent.futures.ThreadPoolExecutor( - max_workers=len(DATABASE_FILES) - ) as pool: - any( - pool.map( - functools.partial( - download_and_decompress, - bucket_url=BUCKET_URL, - download_destination=destination, - ), - DATABASE_FILES, - ) - ) - - # Delete all zstd files at the end (after successfully decompressing them). - for filename in DATABASE_FILES: - os.remove(destination / filename) - - print('All databases have been downloaded and decompressed.') - - -if __name__ == '__main__': - main(sys.argv[1:]) diff --git a/fetch_databases.sh b/fetch_databases.sh new file mode 100644 index 0000000000000000000000000000000000000000..d2c4ad43b241e43964160ddd57da03eb84286ff9 --- /dev/null +++ b/fetch_databases.sh @@ -0,0 +1,45 @@ +#!/bin/bash +# Copyright 2024 DeepMind Technologies Limited +# +# AlphaFold 3 source code is licensed under CC BY-NC-SA 4.0. To view a copy of +# this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ +# +# To request access to the AlphaFold 3 model parameters, follow the process set +# out at https://github.com/google-deepmind/alphafold3. You may only use these +# if received directly from Google. Use is subject to terms of use available at +# https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md + +set -euo pipefail + +readonly db_dir=${1:-$HOME/public_databases} + +for cmd in wget tar zstd ; do + if ! command -v "${cmd}" > /dev/null 2>&1; then + echo "${cmd} is not installed. Please install it." + fi +done + +echo "Fetching databases to ${db_dir}" +mkdir -p "${db_dir}" + +readonly SOURCE=https://storage.googleapis.com/alphafold-databases/v3.0 + +echo "Start Fetching and Untarring 'pdb_2022_09_28_mmcif_files.tar'" +wget --quiet --output-document=- \ + "${SOURCE}/pdb_2022_09_28_mmcif_files.tar.zst" | \ + tar --use-compress-program=zstd -xf - --directory="${db_dir}" & + +for NAME in mgy_clusters_2022_05.fa \ + bfd-first_non_consensus_sequences.fasta \ + uniref90_2022_05.fa uniprot_all_2021_04.fa \ + pdb_seqres_2022_09_28.fasta \ + rnacentral_active_seq_id_90_cov_80_linclust.fasta \ + nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta \ + rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta ; do + echo "Start Fetching '${NAME}'" + wget --quiet --output-document=- "${SOURCE}/${NAME}.zst" | \ + zstd --decompress > "${db_dir}/${NAME}" & +done + +wait +echo "Complete" diff --git a/run_alphafold.py b/run_alphafold.py index 69a1a988d2fee5ae12861d685c9cd5af28c7bf72..468113476eba47fb2e28876651112b4bfc14beb7 100644 --- a/run_alphafold.py +++ b/run_alphafold.py @@ -137,11 +137,13 @@ _HMMBUILD_BINARY_PATH = flags.DEFINE_string( ) # Database paths. -_DB_DIR = flags.DEFINE_string( +_DB_DIR = flags.DEFINE_multi_string( 'db_dir', - DEFAULT_DB_DIR.as_posix(), - 'Path to the directory containing the databases.', + (DEFAULT_DB_DIR.as_posix(),), + 'Path to the directory containing the databases. Can be specified multiple' + ' times to search multiple directories in order.', ) + _SMALL_BFD_DATABASE_PATH = flags.DEFINE_string( 'small_bfd_database_path', '${DB_DIR}/bfd-first_non_consensus_sequences.fasta', @@ -180,7 +182,7 @@ _RNA_CENTRAL_DATABASE_PATH = flags.DEFINE_string( ) _PDB_DATABASE_PATH = flags.DEFINE_string( 'pdb_database_path', - '${DB_DIR}/pdb_2022_09_28_mmcif_files.tar', + '${DB_DIR}/mmcif_files', 'PDB database directory with mmCIF files path, used for template search.', ) _SEQRES_DATABASE_PATH = flags.DEFINE_string( @@ -480,6 +482,22 @@ def process_fold_input( ... +def replace_db_dir(path_with_db_dir: str, db_dirs: Sequence[str]) -> str: + """Replaces the DB_DIR placeholder in a path with the given DB_DIR.""" + template = string.Template(path_with_db_dir) + if 'DB_DIR' in template.get_identifiers(): + for db_dir in db_dirs: + path = template.substitute(DB_DIR=db_dir) + if os.path.exists(path): + return path + raise FileNotFoundError( + f'{path_with_db_dir} with ${{DB_DIR}} not found in any of {db_dirs}.' + ) + if not os.path.exists(path_with_db_dir): + raise FileNotFoundError(f'{path_with_db_dir} does not exist.') + return path_with_db_dir + + def process_fold_input( fold_input: folding_input.Input, data_pipeline_config: pipeline.DataPipelineConfig | None, @@ -606,28 +624,24 @@ def main(_): print('\n'.join(notice)) if _RUN_DATA_PIPELINE.value: - replace_db_dir = lambda x: string.Template(x).substitute( - DB_DIR=_DB_DIR.value - ) + expand_path = lambda x: replace_db_dir(x, _DB_DIR.value) data_pipeline_config = pipeline.DataPipelineConfig( jackhmmer_binary_path=_JACKHMMER_BINARY_PATH.value, nhmmer_binary_path=_NHMMER_BINARY_PATH.value, hmmalign_binary_path=_HMMALIGN_BINARY_PATH.value, hmmsearch_binary_path=_HMMSEARCH_BINARY_PATH.value, hmmbuild_binary_path=_HMMBUILD_BINARY_PATH.value, - small_bfd_database_path=replace_db_dir(_SMALL_BFD_DATABASE_PATH.value), - mgnify_database_path=replace_db_dir(_MGNIFY_DATABASE_PATH.value), - uniprot_cluster_annot_database_path=replace_db_dir( + small_bfd_database_path=expand_path(_SMALL_BFD_DATABASE_PATH.value), + mgnify_database_path=expand_path(_MGNIFY_DATABASE_PATH.value), + uniprot_cluster_annot_database_path=expand_path( _UNIPROT_CLUSTER_ANNOT_DATABASE_PATH.value ), - uniref90_database_path=replace_db_dir(_UNIREF90_DATABASE_PATH.value), - ntrna_database_path=replace_db_dir(_NTRNA_DATABASE_PATH.value), - rfam_database_path=replace_db_dir(_RFAM_DATABASE_PATH.value), - rna_central_database_path=replace_db_dir( - _RNA_CENTRAL_DATABASE_PATH.value - ), - pdb_database_path=replace_db_dir(_PDB_DATABASE_PATH.value), - seqres_database_path=replace_db_dir(_SEQRES_DATABASE_PATH.value), + uniref90_database_path=expand_path(_UNIREF90_DATABASE_PATH.value), + ntrna_database_path=expand_path(_NTRNA_DATABASE_PATH.value), + rfam_database_path=expand_path(_RFAM_DATABASE_PATH.value), + rna_central_database_path=expand_path(_RNA_CENTRAL_DATABASE_PATH.value), + pdb_database_path=expand_path(_PDB_DATABASE_PATH.value), + seqres_database_path=expand_path(_SEQRES_DATABASE_PATH.value), jackhmmer_n_cpu=_JACKHMMER_N_CPU.value, nhmmer_n_cpu=_NHMMER_N_CPU.value, ) diff --git a/run_alphafold_test.py b/run_alphafold_test.py index 746405bae489b7ad4a6fb863df93ef7a91a863e3..b279b9e07c7298c52ba4dadd78dea5079d5ba335 100644 --- a/run_alphafold_test.py +++ b/run_alphafold_test.py @@ -461,6 +461,30 @@ class InferenceTest(test_utils.StructureTestCase): [1.0] * actual_inf.predicted_structure.num_atoms, ) + @parameterized.product(num_db_dirs=tuple(range(1, 3))) + def test_replace_db_dir(self, num_db_dirs: int) -> None: + """Test that the db_dir is replaced correctly.""" + db_dirs = [pathlib.Path(self.create_tempdir()) for _ in range(num_db_dirs)] + db_dirs_posix = [db_dir.as_posix() for db_dir in db_dirs] + + for i, db_dir in enumerate(db_dirs): + for j in range(i + 1): + (db_dir / f'filename{j}.txt').write_text(f'hello world {i}') + + for i in range(num_db_dirs): + self.assertEqual( + pathlib.Path( + run_alphafold.replace_db_dir( + f'${{DB_DIR}}/filename{i}.txt', db_dirs_posix + ) + ).read_text(), + f'hello world {i}', + ) + with self.assertRaises(FileNotFoundError): + run_alphafold.replace_db_dir( + f'${{DB_DIR}}/filename{num_db_dirs}.txt', db_dirs_posix + ) + if __name__ == '__main__': absltest.main() diff --git a/src/alphafold3/scripts/copy_to_ssd.sh b/src/alphafold3/scripts/copy_to_ssd.sh new file mode 100755 index 0000000000000000000000000000000000000000..9d53716db1122d2a733a5aceb2817175246c54f2 --- /dev/null +++ b/src/alphafold3/scripts/copy_to_ssd.sh @@ -0,0 +1,54 @@ +#!/bin/bash +# Copyright 2024 DeepMind Technologies Limited +# +# AlphaFold 3 source code is licensed under CC BY-NC-SA 4.0. To view a copy of +# this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ +# +# To request access to the AlphaFold 3 model parameters, follow the process set +# out at https://github.com/google-deepmind/alphafold3. You may only use these +# if received directly from Google. Use is subject to terms of use available at +# https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md + +set -euo pipefail + +readonly SOURCE_DIR=${1:-$HOME/public_databases} +readonly TARGET_DIR=${2:-/mnt/disks/ssd/public_databases} + +mkdir -p "${TARGET_DIR}" + +FILES=(pdb_seqres_2022_09_28.fasta \ + uniprot_all_2021_04.fa \ + mgy_clusters_2022_05.fa \ + uniref90_2022_05.fa \ + bfd-first_non_consensus_sequences.fasta \ + rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta \ + nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta \ + rnacentral_active_seq_id_90_cov_80_linclust.fasta) + +NOT_COPIED_FILES=() + +while (( ${#FILES[@]} )); do + # Get total size of files to copy in bytes + SOURCE_FILES=( "${FILES[@]/#/${SOURCE_DIR}/}" ) + TOTAL_SIZE=$(du -sbc "${SOURCE_FILES[@]}" | awk 'END{print $1}') + + # Get available space on target drive in bytes + AVAILABLE_SPACE=$(df --portability --block-size=1 "$TARGET_DIR" | awk 'END{print $4}') + + # Compare sizes and copy if enough space + if (( TOTAL_SIZE <= AVAILABLE_SPACE )); then + printf 'Copying files... %s\n' "${FILES[@]}" + echo "From ${SOURCE_DIR} -> ${TARGET_DIR}" + + for file in "${FILES[@]}"; do + cp -r "${SOURCE_DIR}/${file}" "${TARGET_DIR}/" & + done + break + else + NOT_COPIED_FILES+=("${FILES[-1]}") + unset 'FILES[-1]' + fi +done + +printf 'No room left on ssd for: %s\n' "${NOT_COPIED_FILES[@]}" +wait diff --git a/src/alphafold3/scripts/gcp_mount_ssd.sh b/src/alphafold3/scripts/gcp_mount_ssd.sh new file mode 100755 index 0000000000000000000000000000000000000000..18f97888adec1005e347915ba3f6336cbbdaa3fe --- /dev/null +++ b/src/alphafold3/scripts/gcp_mount_ssd.sh @@ -0,0 +1,47 @@ +#!/bin/bash +# Copyright 2024 DeepMind Technologies Limited +# +# AlphaFold 3 source code is licensed under CC BY-NC-SA 4.0. To view a copy of +# this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ +# +# To request access to the AlphaFold 3 model parameters, follow the process set +# out at https://github.com/google-deepmind/alphafold3. You may only use these +# if received directly from Google. Use is subject to terms of use available at +# https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md + +set -euo pipefail + +readonly MOUNT_DIR="${1:-/mnt/disks/ssd}" + +if [[ -d "${MOUNT_DIR}" ]]; then + echo "Mount directory ${MOUNT_DIR} already exists, skipping" + exit 0 +fi + +for SSD_DISK in $(realpath "$(find /dev/disk/by-id/ | grep google-local)") +do + # Check if the disk is already formatted + if ! blkid -o value -s TYPE "${SSD_DISK}" > /dev/null 2>&1; then + echo "Disk ${SSD_DISK} is not formatted, format it." + mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard "${SSD_DISK}" || continue + fi + + # Check if the disk is already mounted + if grep -qs "^/dev/nvme0n1 " /proc/mounts; then + grep -s "^/dev/nvme0n1 " /proc/mounts + echo "Disk ${SSD_DISK} is already mounted, skip it." + continue + fi + + # Disk is not mounted, mount it + echo "Mounting ${SSD_DISK} to ${MOUNT_DIR}" + mkdir -p "${MOUNT_DIR}" + chmod -R 777 "${MOUNT_DIR}" + mount "${SSD_DISK}" "${MOUNT_DIR}" + break +done + +if [[ ! -d "${MOUNT_DIR}" ]]; then + echo "No unmounted SSD disks found" + exit 1 +fi