-
Angelina Elizabeth Uno-Antonison authored50f8e5a4
Pipeline Development Guidelines
An exert from snaplogic.com does a great job of describing data processing pipelines at a high level:
A data pipeline architecture is a system that captures, organizes, and routes data so that it can be used to gain
insights. Raw data contains too many data points that may not be relevant. A data pipeline architecture organizes data
events to make reporting, analysis, and using data easier. A customized combination of software technologies and
protocols automate the management, visualization, transformation, and movement of data from multiple resources
according to research and analysis goals.
Why is this important?
- all
raw
data is processed and handled in the same manner - best practices and conventions of the lab can be followed with minimal effort
- organization of data is maintainable and well defined
- documentation of process, tools, and organization
- accessibility to advanced data processing techniques
- REPRODUCIBILITY
- process transparency
What Does a Pipeline Look Like
A pipeline is a composed set of processing steps where each step has a set of data input(s) and output(s) and processing
that it must complete to transform the input information into the output information. In this way the out from one (or
multiple steps) becomes the input to a later processing step. The ultimate goal being the transformation of the raw
input data into a final form that can be analyzed.
Where Do Pipelines Fit
Pipelines come in anywhere standardized data processing can be done: On your local machine, on a High Performance Compute cluster, or in the cloud as a few examples. They can also fit in at any point in the flow of data for a given process.
A classic example is in a NGS data analysis lab a typical high-level data work flow from sample extraction to final
results would like this:
Data processing pipelines are the processes labeled in the diagram as secondary
. The raw data that comes off of the
sequencer during the primary
phase needs to be processed, refined, and transformed into a usable format for an analyst
to interpret in the tertiary
phase.
Pipeline Development
The following sections cover best practices and required parts of pipelines developed by CGDS
Source Code Management
- Source code must be tracked using Git and be managed as a repository in CGDS Gitlab.
- Conventions set forth in CGDS SCM standards should be followed
- Master branch must be protected. No one is allowed to push directly to directly to the Master branch, including the repo owner.
- Any changes (bug fixes, feature additions, etc.) must be made via git branches, submitted for peer review in gitlab, and then merged to master branch after it has passed the peer review.
Tools and Dependencies
Applications/Tools leveraged by the pipeline for processing should be managed through a combination of Anaconda3 environment setup and application/tool containerization (using Docker and/or Singularity).
If you don't know where to start take a look at the following workflow management tools for a potential starting point:
as well as check out these CGDS pipelines:
- Structural Variant caller for NGS data
- MEI Variant caller for NGS data
- Sequence Alignment and Small Variant caller for NGS data
Documentation
-
Readme.md
includes:- Description of the pipeline
- Tools required
- Any prerequisites necessary
- Any dependencies necessary
- Any configuration set up
- How to run the pipeline
-
Contributing.md
includes:- Tools required
- Guidelines to be followed for contribution
- Types of testing implemented and how to run them
-
Changelog.md
includes:- Formatting required
-
Any known issues and troubleshooting
Cluster Location
Testing of pipelines in development is restricted to the following space on the Cheaha cluster
/data/project/worthey_lab/projects/experimental_pipelines/
Pipeline "Production" Deployment
Developed pipelines that reach the point of being ready for In-House Use are ready to be considered for "Production" Deployment and use. See the Production Pipeline page for more information on the process and requirements for a developed pipeline to move into production.