Skip to content
Snippets Groups Projects
dev-pipeline-guidelines.md 5.09 KiB
Newer Older
# Pipeline Development Guidelines

An exert from [snaplogic.com](https://www.snaplogic.com/glossary/data-pipeline-architecture) does a great job of describing
data processing pipelines at a high level:

```text
A data pipeline architecture is a system that captures, organizes, and routes data so that it can be used to gain
insights. Raw data contains too many data points that may not be relevant. A data pipeline architecture organizes data
events to make reporting, analysis, and using data easier. A customized combination of software technologies and
protocols automate the management, visualization, transformation, and movement of data from multiple resources
according to research and analysis goals.
```

Why is this important?

1. all `raw` data is processed and handled in the same manner
2. best practices and conventions of the lab can be followed with minimal effort
3. organization of data is maintainable and well defined
4. documentation of process, tools, and organization
5. accessibility to advanced data processing techniques
6. *REPRODUCIBILITY*
7. process transparency

## What Does a Pipeline Look Like

A pipeline is a composed set of processing steps where each step has a set of data input(s) and output(s) and processing
that it must complete to transform the input information into the output information. In this way the out from one (or
multiple steps) becomes the input to a later processing step. The ultimate goal being the transformation of the `raw`
input data into a final form that can be analyzed.

## Where Do Pipelines Fit

Pipelines come in anywhere standardized data processing can be done: On your local machine, on a High Performance
Compute cluster, or in the cloud as a few examples. They can also fit in at any point in the flow of data for a given
process.

A classic example is in a NGS data analysis lab a typical high-level data work flow from sample extraction to final
results would like this: ![NGS-workflow-diagram](img/NGS-workflow.png)

Data processing pipelines are the processes labeled in the diagram as `secondary`. The raw data that comes off of the
sequencer during the `primary` phase needs to be processed, refined, and transformed into a usable format for an analyst
to interpret in the `tertiary` phase.

## Pipeline Development

The following sections cover best practices and required parts of pipelines developed by CGDS

### Source Code Management

- Source code must be tracked using Git and be managed as a repository in [CGDS Gitlab](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science).
- Conventions set forth in [CGDS SCM standards](standards-definitions.md#source-control) should be followed
  - Master branch must be [protected](https://docs.gitlab.com/ee/user/project/protected_branches.html#configuring-protected-branches).
    No one is allowed to push directly to directly to the Master branch, including the repo owner.
  - Any changes (bug fixes, feature additions, etc.) must be made via git branches, submitted for peer review in gitlab,
    and then merged to master branch after it has passed the peer review.

### Tools and Dependencies

Applications/Tools leveraged by the pipeline for processing should be managed through a combination of [Anaconda3](https://www.anaconda.com/distribution/)
environment setup and application/tool containerization (using [Docker](https://www.docker.com/) and/or
[Singularity](https://sylabs.io/singularity/)).

If you don't know where to start take a look at the following workflow management tools for a potential starting point:

- [Snakemake](https://snakemake.readthedocs.io/en/stable/)
- [Nextflow](https://www.nextflow.io/)

as well as check out these CGDS pipelines:

- [Structural Variant caller for NGS data](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/manta_sv_caller_pipeline)
- [MEI Variant caller for NGS data](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mei_pipeline)
- [Sequence Alignment and Small Variant caller for NGS data](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/small_variant_caller_pipeline)

### Documentation

- `Readme.md` includes:
  - Description of the pipeline
  - Tools required
  - Any prerequisites necessary
  - Any dependencies necessary
  - Any configuration set up
  - How to run the pipeline

- `Contributing.md` includes:
  - Tools required
  - Guidelines to be followed for contribution
  - Types of testing implemented and how to run them

- `Changelog.md` includes:
  - Formatting required

- Any known issues and troubleshooting

### Cluster Location

Testing of pipelines in development is restricted to the following space on the Cheaha cluster
`/data/project/worthey_lab/projects/experimental_pipelines/`

## Pipeline "Production" Deployment

Developed pipelines that reach the point of being ready for [In-House Use](standards-definitions.md#in-house-use) are
ready to be considered for "Production" Deployment and use. See the [Production Pipeline page](prod-pipeline-guidelines.md)
for more information on the process and requirements for a developed pipeline to move into production.