The nf-core framework for community-curated bioinformatics pipelines

PA Ewels, A Peltzer, S Fillinger, H Patel… - Nature …, 2020 - nature.com
Nature biotechnology, 2020nature.com
To the Editor—The standardization, portability and reproducibility of analysis pipelines are
key issues within the bioinformatics community. Most bioinformatics pipelines are designed
for use on-premises; as a result, the associated software dependencies and execution logic
are likely to be tightly coupled with proprietary computing environments. This can make it
difficult or even impossible for others to reproduce the ensuing results, which is a
fundamental requirement for the validation of scientific findings. Here, we introduce the nf …
To the Editor—The standardization, portability and reproducibility of analysis pipelines are key issues within the bioinformatics community. Most bioinformatics pipelines are designed for use on-premises; as a result, the associated software dependencies and execution logic are likely to be tightly coupled with proprietary computing environments. This can make it difficult or even impossible for others to reproduce the ensuing results, which is a fundamental requirement for the validation of scientific findings. Here, we introduce the nf-core framework as a means for the development of collaborative, peerreviewed, best-practice analysis pipelines (Fig. 1). All nf-core pipelines are written in Nextflow and so inherit the ability to be executed on most computational infrastructures, as well as having native support for container technologies such as Docker and Singularity. The nf-core community (Supplementary Fig. 1) has developed a suite of tools that automate pipeline creation, testing, deployment and synchronization. Our goal is to provide a framework for high-quality bioinformatics pipelines that can be used across all institutions and research facilities. Being able to reproduce scientific results is the central tenet of the scientific method. However, moving toward FAIR (findable, accessible, interoperable and reusable) research methods1 in data-driven science is complex2, 3. Central repositories, such as bio. tools4, omictools5 and the Galaxy toolshed6, make it possible to find existing pipelines and their associated tools. However, it is still notoriously challenging to develop analysis pipelines that are fully reproducible and interoperable across multiple systems and institutions—primarily because of differences in hardware, operating systems and software versions. Although the recommended guidelines for some analysis pipelines have become standardized (for example, GATK best practices7), the actual implementations are usually developed on a case-by-case basis. As such, there is often little incentive to test, document and implement pipelines in a way that permits their reuse by other researchers. This can hamper sustainable sharing of data and tools, and results in a proliferation of heterogeneous analysis pipelines, making it difficult for newcomers to find what they need to address a specific analysis question.
As the scale of-omics data and their associated analytical tools has grown, the scientific community is increasingly moving toward the use of specialized workflow management systems to build analysis pipelines8. They separate the requirements of the underlying compute infrastructure from the analysis and workflow description,
nature.com