Skip to content
Snippets Groups Projects

Pipeline Development Guidelines

An exert from snaplogic.com does a great job of describing data processing pipelines at a high level:

A data pipeline architecture is a system that captures, organizes, and routes data so that it can be used to gain
insights. Raw data contains too many data points that may not be relevant. A data pipeline architecture organizes data
events to make reporting, analysis, and using data easier. A customized combination of software technologies and
protocols automate the management, visualization, transformation, and movement of data from multiple resources
according to research and analysis goals.

Why is this important?

  1. all raw data is processed and handled in the same manner
  2. best practices and conventions of the lab can be followed with minimal effort
  3. organization of data is maintainable and well defined
  4. documentation of process, tools, and organization
  5. accessibility to advanced data processing techniques
  6. REPRODUCIBILITY
  7. process transparency

What Does a Pipeline Look Like

A pipeline is a composed set of processing steps where each step has a set of data input(s) and output(s) and processing that it must complete to transform the input information into the output information. In this way the out from one (or multiple steps) becomes the input to a later processing step. The ultimate goal being the transformation of the raw input data into a final form that can be analyzed.

Where Do Pipelines Fit

Pipelines come in anywhere standardized data processing can be done: On your local machine, on a High Performance Compute cluster, or in the cloud as a few examples. They can also fit in at any point in the flow of data for a given process.

A classic example is in a NGS data analysis lab a typical high-level data work flow from sample extraction to final results would like this: NGS-workflow-diagram

Data processing pipelines are the processes labeled in the diagram as secondary. The raw data that comes off of the sequencer during the primary phase needs to be processed, refined, and transformed into a usable format for an analyst to interpret in the tertiary phase.

Pipeline Development

The following sections cover best practices and required parts of pipelines developed by CGDS

Source Code Management

  • Source code must be tracked using Git and be managed as a repository in CGDS Gitlab.
  • Conventions set forth in CGDS SCM standards should be followed
    • Master branch must be protected. No one is allowed to push directly to directly to the Master branch, including the repo owner.
    • Any changes (bug fixes, feature additions, etc.) must be made via git branches, submitted for peer review in gitlab, and then merged to master branch after it has passed the peer review.

Tools and Dependencies

Applications/Tools leveraged by the pipeline for processing should be managed through a combination of Anaconda3 environment setup and application/tool containerization (using Docker and/or Singularity).

If you don't know where to start take a look at the following workflow management tools for a potential starting point:

as well as check out these CGDS pipelines:

Documentation

  • Readme.md includes:

    • Description of the pipeline
    • Tools required
    • Any prerequisites necessary
    • Any dependencies necessary
    • Any configuration set up
    • How to run the pipeline
  • Contributing.md includes:

    • Tools required
    • Guidelines to be followed for contribution
    • Types of testing implemented and how to run them
  • Changelog.md includes:

    • Formatting required
  • Any known issues and troubleshooting

Cluster Location

Testing of pipelines in development is restricted to the following space on the Cheaha cluster /data/project/worthey_lab/projects/experimental_pipelines/

Pipeline "Production" Deployment

Developed pipelines that reach the point of being ready for In-House Use are ready to be considered for "Production" Deployment and use. See the Production Pipeline page for more information on the process and requirements for a developed pipeline to move into production.