data-policies.md

# Data Policies and Standards

[TOC]

## Locations

The following table describes what types of data/files should go where for CGDS

| Data Type | Examples | Storage |Location |
| Anything containing PHI                         | Pictures, Medical Records, Clinical Descriptions  | Box     |
<https://uab.app.box.com/>                                                                |
| Project analysis files not used for computation | IRB Documents, project charter, deidentified data | Box     |
CGDS/Projects/{project identifier}/                                                   |
| Project sequence files                          | Fastqs                                            | Cluster |
/data/project/worthey_lab/projects/{project identifier}/raw/{sample identifier}/      |
| Project analysis files used for computation     | BAM, VCF, output from other tools                 | Cluster |
/data/project/worthey_lab/projects/{project identifier}/analysis/{sample identifier}/ |

## Box

In general here's what the overall layout should be of the Box directories:

```text
.
├── Alexander and Worthey Labs Collaboration
└── CGDS
    ├── General
    │   ├── Grants\ Info
        ├── Job\ Descriptions
    │   ├── LabCharter
    │   └── MeetingNotes
    └── Projects
        ├── CF\ Projects
        │   ├── CF-Brothers\ Three
        │   ├── CF-First
        │   └── Gerber\ CF
        ├── Data\ Requests
        ├── KidsNetwork\ RO3
        ├── MECFS
        └── TSC
```

### Collaborations

From time to time, it is important for our lab to share files with other labs. To this end, we will create collaborative
directories in Box that grant access between our lab and the collaborators lab. For example, a collaboration directory
for work done with the Alexander lab would exist at the same level as the Root CGDS directory (see below). When first
entering Box you would see the directory `Alexander and Worthey Labs Collaboration` at the same level as the `CGDS`
directory (see above diagram of the directory structure).
Remember the collaboration directory is for sharing large-ish files and results between the labs. Permissions are set
according to the particular collaboration needs.

### Root CGDS Directory

The root directory in Box is for all files related to the CGDS. The only directories within the root directory are the
`Projects` and `General` directories. Descriptions and organization of those directories are below.

### Projects Directory

This directory is the space for storing all "project" related data. The term `Project` is meant to encompass a wide
range of activities executed in the lab. A project is not defined by the amount of time, size, space, or work it will
involve. A project is used to describe and organize information about a topic of work, research, or development. For
example the following all define projects in the lab (this is not an exhaustive list):

- application development
- pipeline development
- grant funded research
- hypothesis driven research
- Standard Operating Procedures

#### Projects Directory Structure Guidelines

Under this directory there should be only other directories. Each directory should represent a single project, or a host
of projects with common theme. For instance, above, the CF Projects directory contains several directories of projects
all related to CF under it. Either way of representing your project is fine as long as it sticks to this paradigm and
suits your needs.

It is recommended and encouraged that there be some sort of charter documentation at the root of each project directory
to describe the project, it's purpose, goals, etc. Eventually there will be a template that can be used to create a new
charter for a project, but until that time please provide a Boxnote with that information.

### General Directory

This directory is the space for storing all "non-project" related data. Things like a Lab Charter or Grant Information
would end up being located here. Grant Information is a fuzzy topic so it may be desirable to keep this in the Projects
directory. If you have strong feelings on which space a directory should live under then it can be discussed with the
data manager for the lab (Brandon Wilk, bwilk@peds.uab.edu).

One way or another we can work together to make sure information is stored in a representative location.

#### General Directory Structure Guidelines

Under this directory, there should be only other directories. Each directory should represent a single section of work
that needs organizational structure.

It is recommended and encouraged that there be some sort of info documentation at the root of each directory to
describe the info contained within to some degree; this doesn't have to be anything longer than a few sentences.

Basically, this is the space where organization should not be dictated and be allowed to grow organically at the user's
needs, so have at it!

---

## Cluster

The [Cheaha cluster](cheaha-cluster.md) is the main massive compute infrastructure on campus.
This is **the place** where our large lab data should live.

**WARNING**: this should not be a place where PHI and HIPAA protected data should live, until further notice.

---

### Lab Space

CGDS has it's own designated team space on the Cheaha cluster. This is a central location for storage of
multiple types of data related to samples, analysis, computation, pipelines, etc. all related to various
projects in the lab.

#### Lab Space Location

`/data/project/worthey_lab/`

#### Lab Space General Layout

At a high level there are several main directories into which various pieces of information are stored. Below is an
example representation of how this information is sorted:

```text
/data/project/worthey_lab
├── datasets_central
├── projects
├── samples
└── tools
```

Although this is not an exhaustive list it does cover the directories of major interest. In brief the table below
describes the purpose of these directories:

| Directory | Description |
| [datasets_central](data-policies.md#datasets-central) | storage for reference, annotation, and generic data sets |
| [projects](data-policies.md#cluster-projects)         | storage for **ALL** of the lab's cluster appropriate project
data |
| samples                                               | storage for control samples and other samples unspecific to a
given lab project |
| tools                                                 | installation location for lab specific cluster tools, except
pipelines and their associated tooling |

The following sections cover these directories and their internal structure in greater depth and should be read
carefully before interacting with data in the lab cluster space.

### Datasets Central

The directory space on the cluster file system (GPFS) where reference, annotation, and generic data sets used for
analysis should be stored.

Please see the guidelines, documentation, and code responsible for populating this space in GitLab under the
[Datasets Central Manager](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/datasets_central_manager).

#### Datasets Central Location

`/data/project/worthey_lab/datasets_central/`

#### Datasets Central General Layout

The responsibility of the Datasets Central Manager is to codify how we obtain our data and standardize how it's stored.
Thus, manual manipulation of any data in this directory is discouraged. Instead creation of or addition to an Ansible
role in the [Datasets Central Manager](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/datasets_central_manager)
is the preferred method to get a dataset added to this space.

Below is directory tree representation of how datasets are organized in Datasets Central. Each dataset correlates with
a role in the [Datasets Central Manager](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/datasets_central_manager)
and directory under `/data/project/worthey_lab/datasets_central` (the `dbsnp` and `human_reference_genome` directories).
Within a dataset directory is a README file describing the dataset and directories storing the different versions of
the data store that was downloaded (see the `build_146` directory). Within a specific version there may be a directory
separating information further by genomic build (if applicable to the data). At the deepest level data is then broken
into two directories, `raw` and `processed`. The `raw` directory contains data that is directly downloaded from the
source, and the `processed` directory is for any resulting files from the processing/transforming of the data in the
`raw` directory.

```text
/data/project/worthey_lab/datasets_central
├── dbsnp
│   ├── build_146
│   │   └── hg38
│   │       └── raw
│   │           ├── dbsnp_146.hg38.vcf.gz
│   │           └── dbsnp_146.hg38.vcf.gz.tbi
│   └── README.md
└── human_reference_genome
    ├── GRCh38
    │   ├── processed
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.1.ht2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.2.ht2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.3.ht2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.4.ht2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.5.ht2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.6.ht2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.7.ht2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.8.ht2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.amb
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.ann
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.1.bt2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.2.bt2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.3.bt2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.4.bt2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.rev.1.bt2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.rev.2.bt2
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bwt
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai
    │   │   ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.pac
    │   │   └── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.sa
    │   └── raw
    │       ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai
    │       ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
    │       ├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.hisat2_index.tar.gz
    │       ├── GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.bowtie_index.tar.gz
    │       └── GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.bwa_index.tar.gz
    └── README.md
```

---

### Projects

This directory space is to house **ALL** of the lab's project data that needs to reside on the cluster (e.g. FASTQ, BAM,
VCF, analysis pipeline outputs, etc.)

1. ALL project data on the cluster should live here, regardless of the nature of the project, the size, the scale,
exploratory or active; there are NO EXCEPTIONS to this, lab data is not to reside in your home or data directory.

2. exploratory projects should leverage the layouts and guidelines below even if they don't take off because it provides
us with a succinct record of what was done, and where it is located

#### Projects Location

`/data/project/worthey_lab/projects/`

#### Projects Directory Layout and Guidelines

```text
projects
├── CF_TLOAF
│   ├── analysis
│   │   └── sample_1
│   │       ├── bams
│   │       ├── mei
│   │       │   ├── output.ALU.vcf
│   │       │   └── output.LINE.vcf
│   │       └── small_variants
│   │           └── output.vcf
│   └── raw
│       └── sample_1
│           └── fastq
│               ├── r1.fastq
│               └── r2.fastq
└── DMD
    ├── analysis
    │   ├── sample_2
    │   │   ├── bams
    │   │   ├── mei
    │   │   │   └── output.ALU.vcf
    │   │   └── small_variants
    │   │       └── output.vcf
    │   └── sample_3
    │       ├── bams
    │       ├── mei
    │       │   └── output.SINE.vcf
    │       └── small_variants
    │           └── output.vcf
    └── raw
```

`analysis` is intended to contain all of the output from any work done to analyze the data contained in the `raw`
directory. Sometimes this information (i.e. a BAM or a VCF) will be supplied from a collaborator instead of our lab
processing it. Regardless, this is the location where those files should end up.

`raw` is the storage location for sequence level information (if the project involves it) or for non-sequencing projects
it's the location where the supplied data obtained from the originating source goes. Not all projects will
have this information available.

---

### Experimental Pipelines

This directory is temporary space for the development of pipelines for the lab, including tests of all third party
pipelines

1. Third Party Pipelines (pipelines designed outside of the lab)
1.1 unless *ABSOLUTELY* necessary try not to install the tools leveraged by the pipeline directly on the cluster (i.e.
    make sure all tools used by the pipeline can be containerized in a Docker or Singularity image)
2. Experimental lab developed pipelines can go here until they are ready to be used in "Production"
2.1 in many cases "Production" ready in the R&D space is when we as a lab start using the output of the pipeline to
    analyze samples/cases/cohorts/data for research

#### Experimental Pipelines Location

`/data/project/worthey_lab/projects/experimental_pipelines/`

#### Experimental Pipelines Directory Layout and Guidelines

```text
experimental_pipelines/
├── annovar_vcf_annotation
│   └── Annotate_Sample_Annovar.py
├── mana
│   ├── choosing_refgenome
│   │   ├── configs
│   │   ├── data
│   │   ├── logs
│   │   ├── notebooks
│   │   ├── README.md
│   │   └── Snakefile
│   ├── small_var_caller_pipelines
│   │   ├── crossman_pipeline
│   │   ├── data
│   │   ├── dna-seq-gatk-variant-calling
│   │   └── evaluate_calls
│   └── sv_caller
│       ├── data
│       ├── manta
│       └── parliament2
└── mei
     └── mei_pipeline.tar
```

As shown in example directory tree above, the directory structure of `experimental_pipelines` is quite loose and left
up to the developer. Besides being contained within an named directory under `experimental_pipelines` (e.g. `mei`
directory) the only restrictions to follow are those listed above.

---

### User Data Home Directory

**NOTE**: this space is not visible nor accessible to other users, even those in the lab

1. Do **NOT** put pipeline or analysis related code or data here, do not store files intended for shared uses into here
2. this should be a very temporary location for data stored on the cluster
2.1 it is best practice and a lab requirement that all project related data live in the specific projects directory in
lab's group space.
2.2 it is best practice and a lab requirement that all common data files and datasets live in a common accessible
location under the datasets central directory

3. really the goal is to prevent this space from being used at all
3.1 the theory is that the less data, scripts, pipelines, apps that live in just your home directory the more
    transparency and redundancy will be built into our lab processes
3.2 also this prevents us from having a really difficult time figuring out what someone else is doing in case of
    unforeseen outages (think what would happen to your work if you were suddenly unavailable for an extended period?)
4. any disagreement with this policy should be brought up to the lab cluster data manager for the lab
(Brandon Wilk, bwilk@peds.uab.edu) and the policy could be reviewed and possibly revised, otherwise they will work with
you to help setup your project, analysis, data storage, pipeline needs to enable you to work at your best! <3

#### User Data Home Directory Location

`/data/user/**Blazer ID**/`

#### User Data Home Directory Layout and Guidelines

This is at your discretion as a user, but this space should really be seldom used, if at all.

---

### User Home Directory

**WARNING**: *The home directory must not be used to store large amounts of data*
**NOTE**: this space is not visible nor accessible to other users, even those in the lab

**NOTE**: this space is not visible nor accessible to other users, even those in the lab

This is a small space provided to each user for storage of small scripts and other small
odds and ends while working on the cluster.

As stated in other sections *HIGHLY* recommend against storing anything in this directory
space unless absolutely necessary.

#### User Home Directory Location

`/home/**Blazer ID**/`