This lesson is still being designed and assembled (Pre-Alpha version)

Getting Started with Nextflow

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • What is a workflow and what are workflow management systems?

  • Why should I use a workflow management system?

  • What is Nextflow?

  • What are the main features of Nextflow?

  • What are the main components of a Nextflow script?

  • How do I run a Nextflow script?

Objectives
  • Understand what a workflow management system is.

  • Understand the benefits of using a workflow management system.

  • Explain the benefits of using Nextflow as part of your bioinformatics workflow.

  • Explain the components of a Nextflow script.

  • Run a Nextflow script.

Workflows

Analysing data involves a sequence of tasks, including gathering, cleaning, and processing data. These sequence of tasks are called a workflow or a pipeline. These workflows typically require executing multiple software packages, sometimes running on different computing environments, such as a desktop or a compute cluster. Traditionally these workflows have been joined together in scripts using general purpose programming languages such as Bash or Python.



Example bioinformatics variant calling workflow/pipeline diagram from [nf-core](https://nf-co.re/sarek) and simple RNA-Seq pipeline in DAG format.


However, as workflows become larger and more complex, the management of the programming logic and software becomes difficult.

Workflow management systems

Workflow Management Systems (WfMS), such as Snakemake, Galaxy, and Nextflow have been developed specifically to manage computational data-analysis workflows in fields such as Bioinformatics, Imaging, Physics, and Chemistry.

WfMS contain multiple features that simplify the development, monitoring, execution and sharing of pipelines.

Key features include;

Nextflow basic concepts

Nextflow is a workflow management system that combines a runtime environment, software that is designed to run other software, and a programming domain specific language (DSL) that eases the writing of computational pipelines.

Nextflow is built around the idea that Linux is the lingua franca of data science. Nextflow follows Linux’s “small pieces loosely joined” philosophy: in which many simple but powerful command-line and scripting tools, when chained together, facilitate more complex data manipulations.

Nextflow extends this approach, adding the ability to define complex program interactions and an accessible (high-level) parallel computational environment based on the dataflow programming model, whereby processes are connected via their outputs and inputs to other processes, and run as soon as they receive an input.

The diagram below illustrates the differences between a dataflow model and a simple linear program .



A simple program (a) and its dataflow equivalent (b) https://doi.org/10.1145/1013208.1013209.


In a simple program (a), these statements would be executed sequentially. Thus, the program would execute in three units of time. In the dataflow programming model (b), this program takes only two units of time. This is because the read quantitation and QC steps have no dependencies on each other and therefore can execute simultaneously in parallel.

Nextflow core features

  1. Fast prototyping: A simple syntax for writing pipelines that enables you to reuse existing scripts and tools for fast prototyping.

  2. Reproducibility: Nextflow supports several container technologies, such as Docker and Singularity, as well as the package manager Conda. This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to reproduce any former configuration.

  3. Portability: Nextflow’s syntax separates the functional logic (the steps of the workflow) from the execution settings (how the workflow is executed). This allows the pipeline to be run on multiple platforms, e.g. local compute vs. a university compute cluster or a cloud service like AWS, without changing the steps of the workflow.

  4. Simple parallelism: Nextflow is based on the dataflow programming model which greatly simplifies the splitting of tasks that can be run at the same time (parallelisation).

  5. Continuous checkpoints: All the intermediate results produced during the pipeline execution are automatically tracked. This allows you to resume its execution from the last successfully executed step, no matter what the reason was for it stopping.

Scripting language

Nextflow scripts are written using a language intended to simplify the writing of workflows. Languages written for a specific field are called Domain Specific Languages (DSL), e.g., SQL is used to work with databases, and AWK is designed for text processing.

In practical terms the Nextflow scripting language is an extension of the Groovy programming language, which in turn is a super-set of the Java programming language. Groovy simplifies the writing of code and is more approachable than Java. Groovy semantics (syntax, control structures, etc) are documented here.

The approach of having a simple DSL built on top of a more powerful general purpose programming language makes Nextflow very flexible. The Nextflow syntax can handle most workflow use cases with ease, and then Groovy can be used to handle corner cases which may be difficult to implement using the DSL.

DSL1 syntax

Nextflow (in versions < 20.07.1) used to have a different syntax which we called DSL1. Currently the default case of Nextlow is DSL2.

If you encounter a code snippet in a Nextflow pipeline similar to the one below, it’s because in the past, when the default version was DSL1, it was necessary to explicitly state if DSL2 was being used.

nextflow.enable.dsl=2

Processes, channels, and workflows

Nextflow workflows have three main parts; processes, channels, and workflows. Processes describe a task to be run. A process script can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.). Processes spawn a task for each complete input set. Each task is executed independently, and cannot interact with another task. The only way data can be passed between process tasks is via asynchronous queues, called channels.

Processes define inputs and outputs for a task. Channels are then used to manipulate the flow of data from one process to the next. The interaction between processes, and ultimately the pipeline execution flow itself, is then explicitly defined in a workflow section.

In the following example we have a channel containing three elements, e.g., 3 data files. We have a process that takes the channel as input. Since the channel has three elements, three independent instances (tasks) of that process are run in parallel. Each task generates an output, which is passed to another channel and used as input for the next process.

Processes and channels
Nextflow process flow diagram

Workflow execution

While a process defines what command or script has to be executed, the executor determines how that script is actually run in the target system.

If not otherwise specified, processes are executed on the local computer. The local executor is very useful for pipeline development, testing, and small scale workflows, but for large scale computational pipelines, a High Performance Cluster (HPC) or Cloud platform is often required.

Processes and channels
Nextflow Executors

Nextflow provides a separation between the pipeline’s functional logic and the underlying execution platform. This makes it possible to write a pipeline once, and then run it on your computer, compute cluster, or the cloud, without modifying the workflow, by defining the target execution platform in a configuration file.

Nextflow provides out-of-the-box support for major batch schedulers and cloud platforms such as Sun Grid Engine, SLURM job scheduler, AWS Batch service and Kubernetes. A full list can be found here.

Your first script

We are now going to look at a sample Nextflow script that counts the number of lines in a file.

Open the file wc.nf in the script directory with your favourite text editor.

This is a Nextflow script. It contains;

  1. An optional interpreter directive (“Shebang”) line, specifying the location of the Nextflow interpreter.
  2. nextflow.enable.dsl=2 used to enable DSL2 syntax.
  3. A multi-line Nextflow comment, written using C style block comments, followed by a single line comment.
  4. A pipeline parameter params.input which is given a default value, of the relative path to the location of a compressed fastq file, as a string.
  5. An unnamed workflow execution block, which is the default workflow to run.
  6. A Nextflow channel used to read in data to the workflow.
  7. A call to the process NUM_LINES.
  8. An operation on the process output, using the channel operator view().
  9. A Nextflow process block named NUM_LINES, which defines what the process does.
  10. An input definition block that assigns the input to the variable read, and declares that it should be interpreted as a file path.
  11. An output definition block that uses the Linux/Unix standard output stream stdout from the script block.
  12. A script block that contains the bash commands ` printf ‘${read} to print the name of the read file, and gunzip -c ${read} wc -l` to count the number of lines in the gzipped read file.
#!/usr/bin/env nextflow

nextflow.enable.dsl=2

/*  Comments are uninterpreted text included with the script.
    They are useful for describing complex parts of the workflow
    or providing useful information such as workflow usage.

    Usage:
       nextflow run wc.nf --input <input_file>

    Multi-line comments start with a slash asterisk /* and finish with an asterisk slash. */
//  Single line comments start with a double slash // and finish on the same line

/*  Workflow parameters are written as params.<parameter>
    and can be initialised using the `=` operator. */
params.input = "data/yeast/reads/ref1_1.fq.gz"

//  The default workflow
workflow {

    //  Input data is received through channels
    input_ch = Channel.fromPath(params.input)

    /*  The script to execute is called by its process name,
        and input is provided between brackets. */
    NUM_LINES(input_ch)

    /*  Process output is accessed using the `out` channel.
        The channel operator view() is used to print
        process output to the terminal. */
    NUM_LINES.out.view()
}

/*  A Nextflow process block
    Process names are written, by convention, in uppercase.
    This convention is used to enhance workflow readability. */
process NUM_LINES {

    input:
    path read

    output:
    stdout

    script:
    /* Triple quote syntax """, Triple-single-quoted strings may span multiple lines. The content of the string can cross line boundaries without the need to split the string in several pieces and without concatenation or newline escape characters. */
    """
    printf '${read} '
    gunzip -c ${read} | wc -l
    """
}

To run a Nextflow script use the command nextflow run <script_name>.

Run a Nextflow script

Run the script by entering the following command in your terminal:

$ nextflow run wc.nf

Solution

You should see output similar to the text shown below:

N E X T F L O W  ~  version 20.10.0
Launching `wc.nf` [fervent_babbage] - revision: c54a707593
executor >  local (1)
[21/b259be] process > NUM_LINES (1) [100%] 1 of 1 ✔

 ref1_1.fq.gz 58708
  1. The first line shows the Nextflow version number.
  2. The second line shows the run name fervent_babbage (adjective and scientist name) and revision id c54a707593.
  3. The third line tells you the process has been executed locally (executor > local).
  4. The next line shows the process id 21/b259be, process name, number of cpus, percentage task completion, and how many instances of the process have been run.
  5. The final line is the output of the view operator.

Process identification

The hexadecimal numbers, like 61/1f3ef4, identify the unique process execution. These numbers are also the prefix of the directories where each process is executed. You can inspect the files produced by changing to the directory $PWD/work and using these numbers to find the process-specific execution path. We will learn how exactly nextflow using work directory to execute processes in the following sections.

Key Points

  • A workflow is a sequence of tasks that process a set of data.

  • A workflow management system (WfMS) is a computational platform that provides an infrastructure for the set-up, execution and monitoring of workflows.

  • Nextflow is a workflow management system that comprises both a runtime environment and a domain specific language (DSL).

  • Nextflow scripts comprise of channels for controlling inputs and outputs, and processes for defining workflow tasks.

  • You run a Nextflow script using the nextflow run command.