.. _snakefiles-rules:

=====
Rules
=====

Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e.


.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", "path/to/another/outputfile"
        shell: "somecommand {input} {output}"

Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the `python format minilanguage <http://docs.python.org/py3k/library/string.html#formatspec>`_. Here input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. ``path/to/inputfile path/to/other/inputfile``).
From Snakemake 3.8.0 on, adding the special formatting instruction ``:q`` (e.g. ``"somecommand {input:q} {output:q}")``) will let Snakemake quote each of the list or tuple elements that contains whitespace.
Instead of a shell command, a rule can run some python code to generate the output:

.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", somename = "path/to/another/outputfile"
        run:
            for f in input:
                ...
                with open(output[0], "w") as out:
                    out.write(...)
            with open(output.somename, "w") as out:
                out.write(...)

As can be seen, instead of accessing input and output as a whole, we can also access by index (``output[0]``) or by keyword (``output.somename``).
Note that, when adding keywords or names for input or output files, their order won't be preserved when accessing them as a whole via e.g. ``{output}`` in a shell command.

Shell commands like above can also be invoked inside a python based rule, via the function ``shell`` that takes a string with the command and allows the same formatting like in the rule above, e.g.:

.. code-block:: python

    shell("somecommand {output.somename}")

Further, this combination of python and shell commands, allows to iterate over the output of the shell command, e.g.:

.. code-block:: python

    for line in shell("somecommand {output.somename}", iterable=True):
        ... # do something in python

Note that shell commands in Snakemake use the bash shell in `strict mode <http://redsymbol.net/articles/unofficial-bash-strict-mode/>`_ by default.

.. _snakefiles-wildcards:

Wildcards
---------

Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used.
Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems.
Consider the following example.

.. code-block:: python

    rule complex_conversion:
        input:
            "{dataset}/inputfile"
        output:
            "{dataset}/file.{group}.txt"
        shell:
            "somecommand --group {wildcards.group} < {input} > {output}"

Here, we define two wildcards, ``dataset`` and ``group``. By this, the rule can produce all files that follow the regular expression pattern ``.+/file\..+\.txt``, i.e. the wildcards are replaced by the regular expression ``.+``. If the rule's output matches a requested file, the substrings matched by the wildcards are propagated to the input files and to the variable wildcards, that is here also used in the shell command. The wildcards object can be accessed in the same way as input and output, which is described above.

For example, if another rule in the workflow requires the file the file ``101/file.A.txt``, Snakemake recognizes that this rule is able to produce it by setting ``dataset=101`` and ``group=A``.
Thus, it requests file ``101/inputfile`` as input and executes the command ``somecommand --group A  < 101/inputfile  > 101/file.A.txt``.
Of course, the input file might have to be generated by another rule with different wildcards.

Importantly, the wildcard names in input and output must be named identically. Most typically, the same wildcard is present in both input and output, but it is of course also possible to have wildcards only in the output but not the input section.


Multiple wildcards in one filename can cause ambiguity.
Consider the pattern ``{dataset}.{group}.txt`` and assume that a file ``101.B.normal.txt`` is available.
It is not clear whether ``dataset=101.B`` and ``group=normal`` or ``dataset=101`` and ``group=B.normal`` in this case.

Hence wildcards can be constrained to given regular expressions.
Here we could restrict the wildcard ``dataset`` to consist of digits only using ``\d+`` as the corresponding regular expression.
With Snakemake 3.8.0, there are three ways to constrain wildcards.
First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:

.. code-block:: python

    output: "{dataset,\d+}.{group}.txt"

Second, a wildcard can be constrained within the rule via the keyword ``wildcard_constraints``:

.. code-block:: python

    rule complex_conversion:
        input:
            "{dataset}/inputfile"
        output:
            "{dataset}/file.{group}.txt"
        wildcard_constraints:
            dataset="\d+"
        shell:
            "somecommand --group {wildcards.group}  < {input}  > {output}"

Finally, you can also define global wildcard constraints that apply for all rules:

.. code-block:: python

    wildcard_constraints:
        dataset="\d+"

    rule a:
        ...

    rule b:
        ...

See the `Python documentation on regular expressions <http://docs.python.org/py3k/library/re.html>`_ for detailed information on regular expression syntax.


.. _snakefiles-targets:

Targets
-------

By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:

.. code-block:: python

    rule all:
      input: ["{dataset}/file.A.txt".format(dataset=dataset) for dataset in DATASETS]


Here, for each dataset in a python list ``DATASETS`` defined before, the file ``{dataset}/file.A.txt`` is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule ``complex_conversion`` shown above.

Above expression can be simplified to the following:

.. code-block:: python

    rule all:
      input: expand("{dataset}/file.A.txt", dataset=DATASETS)


This may be used for "aggregation" rules for which files from multiple or all datasets are needed to produce a specific output (say, *allSamplesSummary.pdf*).
Note that *dataset* is NOT a wildcard here because it is resolved by Snakemake due to the ``expand`` statement (see below also for more information).



The ``expand`` function thereby allows also to combine different variables, e.g.

.. code-block:: python

    rule all:
      input: expand("{dataset}/file.A.{ext}", dataset=DATASETS, ext=PLOTFORMATS)

If now ``PLOTFORMATS=["pdf", "png"]`` contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.

Further, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.

.. code-block:: python

    expand(["{dataset}/plot1.{ext}", "{dataset}/plot2.{ext}"], dataset=DATASETS, ext=PLOTFORMATS)

leads to

.. code-block:: python

    ["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.pdf", "ds2/plot2.pdf", "ds1/plot1.png", "ds1/plot2.png", "ds2/plot1.png", "ds2/plot2.png"]

Per default, ``expand`` uses the python itertools function ``product`` that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. ``zip``:

.. code-block:: python

    expand("{dataset}/plot1.{ext} {dataset}/plot2.{ext}".split(), zip, dataset=DATASETS, ext=PLOTFORMATS)

leads to

.. code-block:: python

    ["ds1/plot1.pdf", "ds1/plot2.pdf", "ds2/plot1.png", "ds2/plot2.png"]

You can also mask a wildcard expression in expand such that it will be kept, e.g.

.. code-block:: python

    expand("{{dataset}}/plot1.{ext}", ext=PLOTFORMATS)

will create strings with all values for ext but starting with ``"{dataset}"``.


.. _snakefiles-threads:

Threads
-------

Further, a rule can be given a number of threads to use, i.e.

.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", "path/to/another/outputfile"
        threads: 8
        shell: "somecommand --threads {threads} {input} {output}"

Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable ``threads`` rather than hardcoding it into the shell command.
In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. ``threads = min(threads, cores)`` with ``cores`` being the number of cores specified at the command line (option ``--cores``). On a cluster node, Snakemake uses as many cores as available on that node. Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. Note: This behavior is not affected by ``--local-cores``, which only applies to jobs running on the master node.

Starting from version 3.7, threads can also be a callable that returns an ``int`` value. The signature of the callable should be ``callable(wildcards[, input])`` (input is an optional parameter).  It is also possible to refer to a predefined variable (e.g, ``threads: threads_max``) so that the number of cores for a set of rules can be changed with one change only by altering the value of the variable ``threads_max``.


.. _snakefiles-resources:

Resources
---------

In addition to threads, a rule can use arbitrary user-defined resources by specifying them with the resources-keyword:

.. code-block:: python

    rule:
        input:     ...
        output:    ...
        resources:
            mem_mb=100
        shell:
            "..."

If limits for the resources are given via the command line, e.g.

.. code-block:: console

    $ snakemake --resources mem_mb=100

the scheduler will ensure that the given resources are not exceeded by running jobs.
If no limits are given, the resources are ignored.
Apart from making Snakemake aware of hybrid-computing architectures (e.g. with a limited number of additional devices like GPUs) this allows to control scheduling in various ways, e.g. to limit IO-heavy jobs by assigning an artificial IO-resource to them and limiting it via the ``--resources`` flag.
Resources must be ``int`` values.
Note that you are free to choose any names for the given resources.
When defining memory constraints, it is however advised to use ``mem_mb``, because there are
Snakemake execution modes that make use of this information, (e.g., when using :ref:`kubernetes`).

Resources can also be callables that return ``int`` values.
The signature of the callable has to be ``callable(wildcards [, input] [, threads] [, attempt])`` (``input``, ``threads``, and ``attempt`` are optional parameters).

The parameter ``attempt`` allows to adjust resources based on how often the job has been restarted (see :ref:`all_options`, option ``--restart-times``).
This is handy when executing a Snakemake workflow in a cluster environment, where jobs can e.g. fail because of too limited resources.
When Snakemake is executed with ``--restart-times 3``, it will try to restart a failed job 3 times before it gives up.
Thereby, the parameter ``attempt`` will contain the current attempt number (starting from ``1``).
This can be used to adjust the required memory as follows

.. code-block:: python

    rule:
        input:    ...
        output:   ...
        resources:
            mem_mb=lambda wildcards, attempt: attempt * 100
        shell:
            "..."

Here, the first attempt will require 100 MB memory, the second attempt will require 200 MB memory and so on.
When passing memory requirements to the cluster engine, you can by this automatically try out larger nodes if it turns out to be necessary.

Messages
--------

When executing snakemake, a short summary for each running rule is given to the console. This can be overridden by specifying a message for a rule:


.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", "path/to/another/outputfile"
        threads: 8
        message: "Executing somecommand with {threads} threads on the following files {input}."
        shell: "somecommand --threads {threads} {input} {output}"

Note that access to wildcards is also possible via the variable ``wildcards`` (e.g, ``{wildcards.sample}``), which is the same as with shell commands. It is important to have a namespace around wildcards in order to avoid clashes with other variable names.

Priorities
----------

Snakemake allows rules to specify numeric priorities:


.. code-block:: python

    rule:
      input: ...
      output: ...
      priority: 50
      shell: ...

Per default, each rule has a priority of 0. Any rule that specifies a higher priority, will be preferred by the scheduler over all rules that are ready to execute at the same time without having at least the same priority.

Furthermore, the ``--prioritize`` or ``-P`` command line flag allows to specify files (or rules) that shall be created with highest priority during the workflow execution. This means that the scheduler will assign the specified target and all its dependencies highest priority, such that the target is finished as soon as possible.
The ``--dryrun`` or ``-n`` option allows you to see the scheduling plan including the assigned priorities.



Log-Files
---------

Each rule can specify a log file where information about the execution is written to:

.. code-block:: python

    rule abc:
        input: "input.txt"
        output: "output.txt"
        log: "logs/abc.log"
        shell: "somecommand --log {log} {input} {output}"

The variable ``log`` can be used inside a shell command to tell the used tool to which file to write the logging information. Of course the log file can use the same wildcards as input and output files, e.g.

.. code-block:: python

    log: "logs/abc.{dataset}.log"


For programs that do not have an explicit ``log`` parameter, you may always use ``2> {log}`` to redirect standard output to a file (here, the ``log`` file) in Linux-based systems.
Note that it is also supported to have multiple (named) log files being specified:

.. code-block:: python

    rule abc:
        input: "input.txt"
        output: "output.txt"
        log: log1="logs/abc.log", log2="logs/xyz.log"
        shell: "somecommand --log {log.log1} METRICS_FILE={log.log2} {input} {output}"




Non-file parameters for rules
-----------------------------

Sometimes you may want to define certain parameters separately from the rule body. Snakemake provides the ``params`` keyword for this purpose:


.. code-block:: python

    rule:
        input:
            ...
        params:
            prefix="somedir/{sample}"
        output:
            "somedir/{sample}.csv"
        shell:
            "somecommand -o {params.prefix}"

The ``params`` keyword allows you to specify additional parameters depending on the wildcards values. This allows you to circumvent the need to use ``run:`` and python code for non-standard commands like in the above case.
Here, the command ``somecommand`` expects the prefix of the output file instead of the actual one. The ``params`` keyword helps here since you cannot simply add the prefix as an output file (as the file won't be created, Snakemake would throw an error after execution of the rule).

Furthermore, for enhanced readability and clarity, the ``params`` section is also an excellent place to name and assign parameters and variables for your subsequent command.


Similar to ``input``, ``params`` can take functions as well (see :ref:`snakefiles-input_functions`), e.g. you can write

.. code-block:: python

    rule:
        input:
            ...
        params:
            prefix=lambda wildcards, output: output[0][:-4]
        output:
            "somedir/{sample}.csv"
        shell:
            "somecommand -o {params.prefix}"

to get the same effect as above. Note that in contrast to the ``input`` directive, the
``params`` directive can optionally take more arguments than only ``wildcards``, namely ``input``, ``output``, ``threads``, and ``resources``.
From the Python perspective, they can be seen as optional keyword arguments without a default value.
Their order does not matter, apart from the fact that ``wildcards`` has to be the first argument.
In the example above, this allows you to derive the prefix name from the output file.


.. _snakefiles-external_scripts:

External scripts
----------------

A rule can also point to an external script instead of a shell command or inline Python code, e.g.

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile",
            "path/to/other/inputfile"
        output:
            "path/to/outputfile",
            "path/to/another/outputfile"
        script:
            "path/to/script.py"

The script path is always relative to the Snakefile (in contrast to the input and output file paths, which are relative to the working directory).
Inside the script, you have access to an object ``snakemake`` that provides access to the same objects that are available in the ``run`` and ``shell`` directives (input, output, params, wildcards, log, threads, resources, config), e.g. you can use ``snakemake.input[0]`` to access the first input file of above rule.

Apart from Python scripts, this mechanism also allows you to integrate R_ and R Markdown_ scripts with Snakemake, e.g.

.. _R: https://www.r-project.org
.. _Markdown: http://rmarkdown.rstudio.com

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile",
            "path/to/other/inputfile"
        output:
            "path/to/outputfile",
            "path/to/another/outputfile"
        script:
            "path/to/script.R"

In the R script, an S4 object named ``snakemake`` analog to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with ``snakemake@input[[1]]`` (note that the first file does not have index ``0`` here, because R starts counting from ``1``). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. ``snakemake@input[["myfile"]]``.

An example external Python script would could look like this:

.. code-block:: python

    def do_something(data_path, out_path, threads, myparam):
        # python code

    do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])

You can use the Python debugger from within the script if you invoke Snakemake with ``--debug``.
An equivalent script written in R would look like this:

.. code-block:: r

    do_something <- function(data_path, out_path, threads, myparam) {
        # R code
    }

    do_something(snakemake@input[[1]], snakemake@output[[1]], snakemake@threads, snakemake@config[["myparam"]])


To debug R scripts, you can save the workspace with ``save.image()``, and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the ``snakemake`` variable.
It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule.

An R Markdown file can be integrated in the same way as R and Python scripts, but only a single output (html) file can be used:

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile",
            "path/to/other/inputfile"
        output:
            "path/to/report.html",
        script:
            "path/to/report.Rmd"

In the R Markdown file you can insert output from a R command, and access variables stored in the S4 object named ``snakemake``

.. code-block:: R

    ---
    title: "Test Report"
    author:
        - "Your Name"
    date: "`r format(Sys.time(), '%d %B, %Y')`"
    params:
       rmd: "report.Rmd"
    output:
      html_document:
      highlight: tango
      number_sections: no
      theme: default
      toc: yes
      toc_depth: 3
      toc_float:
        collapsed: no
        smooth_scroll: yes
    ---

    ## R Markdown

    This is an R Markdown document.

    Test include from snakemake `r snakemake@input`.

    ## Source
    <a download="report.Rmd" href="`r base64enc::dataURI(file = params$rmd, mime = 'text/rmd', encoding = 'base64')`">R Markdown source file (to produce this document)</a>

A link to the R Markdown document with the snakemake object can be inserted. Therefore a variable called ``rmd`` needs to be added to the ``params`` section in the header of the ``report.Rmd`` file. The generated R Markdown file with snakemake object will be saved in the file specified in this ``rmd`` variable. This file can be embedded into the HTML document using base64 encoding and a link can be inserted as shown in the example above.
Also other input and output files can be embedded in this way to make a portable report. Note that the above method with a data URI only works for small files. An experimental technology to embed larger files is using Javascript Blob object_.

.. _object https://developer.mozilla.org/en-US/docs/Web/API/Blob

Protected and Temporary Files
-----------------------------

A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as ``protected``:

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile"
        output:
            protected("path/to/outputfile")
        shell:
            "somecommand {input} {output}"

A protected file will be write-protected after the rule that produces it is completed.

Further, an output file marked as ``temp`` is deleted after all rules that use it as an input are completed:

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile"
        output:
            temp("path/to/outputfile")
        shell:
            "somecommand {input} {output}"

Ignoring timestamps
-------------------

For determining whether output files have to be re-created, Snakemake checks whether the file modification date (i.e. the timestamp) of any input file of the same job is newer than the timestamp of the output file.
This behavior can be overridden by marking an input file as ``ancient``.
The timestamp of such files is ignored and always assumed to be older than any of the output files:

.. code-block:: python

    rule NAME:
        input:
            ancient("path/to/inputfile")
        output:
            "path/to/outputfile"
        shell:
            "somecommand {input} {output}"

Here, this means that the file ``path/to/outputfile`` will not be triggered for re-creation after it has been generated once, even when the input file is modified in the future.
Note that any flag that forces re-creation of files still also applies to files marked as ``ancient``.

Shadow rules
------------

Shadow rules result in each execution of the rule to be run in isolated temporary directories. This "shadow" directory contains symlinks to files and directories in the current workdir. This is useful for running programs that generate lots of unused files which you don't want to manually cleanup in your snakemake workflow. It can also be useful if you want to keep your workdir clean while the program executes, or simplify your workflow by not having to worry about unique filenames for all outputs of all rules.

By setting ``shadow: "shallow"``, the top level files and directories are symlinked, so that any relative paths in a subdirectory will be real paths in the filesystem. The setting ``shadow: "full"`` fully shadows the entire subdirectory structure of the current workdir. Once the rule successfully executes, the output file will be moved if necessary to the real path as indicated by ``output``.

Shadow directories are stored one per rule execution in ``.snakemake/shadow/``, and are cleared on subsequent snakemake invocations unless the ``--keep-shadow`` command line argument is used.

Typically, you will not need to modify your rule for compatibility with ``shadow``, unless you reference parent directories relative to your workdir in a rule.

.. code-block:: python

    rule NAME:
        input: "path/to/inputfile"
        output: "path/to/outputfile"
        shadow: "shallow"
        shell: "somecommand --other_outputs other.txt {input} {output}"

Flag files
----------

Sometimes it is necessary to enforce some rule execution order without real file dependencies. This can be achieved by "touching" empty files that denote that a certain task was completed. Snakemake supports this via the `touch` flag:

.. code-block:: python

    rule all:
        input: "mytask.done"

    rule mytask:
        output: touch("mytask.done")
        shell: "mycommand ..."

With the ``touch`` flag, Snakemake touches (i.e. creates or updates) the file ``mytask.done`` after ``mycommand`` has finished successfully.


.. _snakefiles-job_properties:

Job Properties
--------------

When executing a workflow on a cluster using the ``--cluster`` parameter (see below), Snakemake creates a job script for each job to execute.
This script is then invoked using the provided cluster submission command (e.g. ``qsub``).
Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters.
As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script.
For convenience, there exists a parser function ``snakemake.utils.read_job_properties`` that can be used to access the properties.
The following shows an example job submission wrapper:

.. code-block:: python

    #!/usr/bin/env python3
    import os
    import sys

    from snakemake.utils import read_job_properties

    jobscript = sys.argv[1]
    job_properties = read_job_properties(jobscript)

    # do something useful with the threads
    threads = job_properties[threads]

    # access property defined in the cluster configuration file (Snakemake >=3.6.0)
    job_properties["cluster"]["time"]

    os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))


.. _snakefiles-dynamic_files:

Dynamic Files
-------------

Snakemake provides experimental support for dynamic files.
Dynamic files can be used whenever one has a rule, for which the number of output files is unknown before the rule was executed.
This is useful for example with cetain clustering algorithms:

.. code-block:: python

    rule cluster:
        input: "afile.csv"
        output: dynamic("{clusterid}.cluster.csv")
        run: ...

Now the results of the rule can be used in Snakemake although it does not know how many files will be present before executing the rule `cluster`, e.g. by:


.. code-block:: python

    rule all:
        input: dynamic("{clusterid}.cluster.plot.pdf")

    rule plot:
        input: "{clusterid}.cluster.csv"
        output: "{clusterid}.cluster.plot.pdf"
        run: ...

Here, Snakemake determines the input files for the rule `all` after the rule `cluster` was executed, and then dynamically inserts jobs of the rule `plot` into the DAG to create the desired plots.

.. note:

    Note that dynamic file support is still experimental.
    Especially, using more than one wildcard within dynamic files can introduce various problems.
    Before using dynamic files, think about alternative, static solutions, where you know beforehand how many output files your rule will produce.
    In four years and hundreds of workflows, I needed dynamic files only once.


.. _snakefiles-input_functions:

Functions as Input Files
------------------------

Instead of specifying strings or lists of strings as input files, snakemake can also make use of functions that return single **or** lists of input files:

.. code-block:: python

    def myfunc(wildcards):
        return [... a list of input files depending on given wildcards ...]

    rule:
        input: myfunc
        output: "someoutput.{somewildcard}.txt"
        shell: "..."

The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files.
Note that you can also use `lambda expressions <https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions>`_ instead of full function definitions.
By this, rules can have entirely different input files (both in form and number) depending on the inferred wildcards. E.g. you can assign input files that appear in entirely different parts of your filesystem based on some wildcard value and a dictionary that maps the wildcard value to file paths.

Note that the function will be executed when the rule is evaluated and before the workflow actually starts to execute. Further note that using a function as input overrides the default mechanism of replacing wildcards with their values inferred from the output files. You have to take care of that yourself with the given wildcards object.

Finally, when implementing the input function, it is best practice to make sure that it can properly handle all possible wildcard values your rule can have.
In particular, input files should not be combined with very general rules that can be applied to create almost any file: Snakemake will try to apply the rule, and will report the exceptions of your input function as errors.

For a practical example, see the :ref:`tutorial` (:ref:`tutorial-input_functions`).

.. _snakefiles-unpack:

Input Functions and ``unpack()``
--------------------------------

In some cases, you might want to have your input functions return named input files.
This can be done by having them return ``dict()`` objects with the names as the dict keys and the file names as the dict values and using the ``unpack()`` keyword.

.. code-block:: python

    def myfunc(wildcards):
        return { 'foo': '{wildcards.token}.txt'.format(wildcards=wildcards)

    rule:
        input: unpack(myfunc)
        output: "someoutput.{token}.txt"
        shell: "..."

Note that ``unpack()`` only necessary for input functions returning ``dict``.
While it also works for ``list``, remember that lists (and nested lists) of strings are automatically flattened.

Also note that if you do not pass in a *function* into the input list but you directly *call a function* then you don't use ``unpack()`` either.
Here, you can simply use Python's double-star (``**``) operator for unpacking the parameters.

Note that as Snakefiles are translated into Python for execution, the same rules as for using the `star and double-star unpacking Python operators <https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists>`_ apply.
These restrictions do not apply when using ``unpack()``.

.. code-block:: python

    def myfunc1():
        return ['foo.txt']

    def myfunc2():
        return {'foo': 'nowildcards.txt'}

    rule:
        input:
            *myfunc1(),
            **myfunc2(),
        output: "..."
        shell: "..."

.. _snakefiles-version_tracking:

Version Tracking
----------------

Rules can specify a version that is tracked by Snakemake together with the output files. When the version changes snakemake informs you when using the flag ``--summary`` or ``--list-version-changes``.
The version can be specified by the version directive, which takes a string:

.. code-block:: python

    rule:
        input:   ...
        output:  ...
        version: "1.0"
        shell:   ...

The version can of course also be filled with the output of a shell command, e.g.:

.. code-block:: python

    SOMECOMMAND_VERSION = subprocess.check_output("somecommand --version", shell=True)

    rule:
        version: SOMECOMMAND_VERSION

Alternatively, you might want to use file modification times in case of local scripts:

.. code-block:: python

    SOMECOMMAND_VERSION = str(os.path.getmtime("path/to/somescript"))

    rule:
        version: SOMECOMMAND_VERSION

A re-run can be automated by invoking Snakemake as follows:

.. code-block:: console

    $ snakemake -R `snakemake --list-version-changes`

With the availability of the ``conda`` directive (see :ref:`integrated_package_management`)
the ``version`` directive has become **obsolete** in favor of defining isolated
software environments that can be automatically deployed via the conda package
manager.


.. _snakefiles-code_tracking:

Code Tracking
-------------

Snakemake tracks the code that was used to create your files.
In combination with ``--summary`` or ``--list-code-changes`` this can be used to see what files may need a re-run because the implementation changed.
Re-run can be automated by invoking Snakemake as follows:

.. code-block:: console

    $ snakemake -R `snakemake --list-code-changes`


.. _snakefiles-job_lifetime_handlers:

Onstart, onsuccess and onerror handlers
---------------------------------------

Sometimes, it is necessary to specify code that shall be executed when the workflow execution is finished (e.g. cleanup, or notification of the user).
With Snakemake 3.2.1, this is possible via the ``onsuccess`` and ``onerror`` keywords:

.. code-block:: python

    onsuccess:
        print("Workflow finished, no error")

    onerror:
        print("An error occurred")
        shell("mail -s "an error occurred" youremail@provider.com < {log}")

The ``onsuccess`` handler is executed if the workflow finished without error. Else, the ``onerror`` handler is executed.
In both handlers, you have access to the variable ``log``, which contains the path to a logfile with the complete Snakemake output.
Snakemake 3.6.0 adds an ````onstart```` handler, that will be executed before the workflow starts.
Note that dry-runs do not trigger any of the handlers.


Rule dependencies
-----------------

From verion 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:

.. code-block:: python

    rule a:
        input:  "path/to/input"
        output: "path/to/output"
        shell:  ...

    rule b:
        input:  rules.a.output
        output: "path/to/output/of/b"
        shell:  ...

Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already.
This feature also allows to resolve dependencies that are ambiguous when using filenames.

Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:

.. code-block:: python

    rule a:
        input:  "path/to/input"
        output: a = "path/to/output", b = "path/to/output2"
        shell:  ...

    rule b:
        input:  rules.a.output.a
        output: "path/to/output/of/b"
        shell:  ...


.. _snakefiles-ambiguous-rules:

Handling Ambiguous Rules
------------------------

When two rules can produce the same output file, snakemake cannot decide per default which one to use. Hence an ``AmbiguousRuleException`` is thrown.
Note: ruleorder is not intended to bring rules in the correct execution order (this is solely guided by the names of input and output files you use), it only helps snakemake to decide which rule to use when multiple ones can create the same output file!
The proposed strategy to deal with such ambiguity is to provide a ``ruleorder`` for the conflicting rules, e.g.

.. code-block:: python

    ruleorder: rule1 > rule2 > rule3

Here, ``rule1`` is preferred over ``rule2`` and ``rule3``, and ``rule2`` is preferred over ``rule3``.
Only if rule1 and rule2 cannot be applied (e.g. due to missing input files), rule3 is used to produce the desired output file.

Alternatively, rule dependencies (see above) can also resolve ambiguities.

Another (quick and dirty) possiblity is to tell snakemake to allow ambiguity via a command line option

.. code-block:: console

    $ snakemake --allow-ambiguity

such that similar to GNU Make always the first matching rule is used. Here, a warning that summarizes the decision of snakemake is provided at the terminal.

Local Rules
-----------

When working in a cluster environment, not all rules need to become a job that has to be submitted (e.g. downloading some file, or a target-rule like `all`, see :ref:`snakefiles-targets`).
The keyword `localrules` allows to mark a rule as local, so that it is not submitted to the cluster and instead executed on the host node:

.. code-block:: python

    localrules: all, foo

    rule all:
        input: ...

    rule foo:
        ...

    rule bar:
        ...

Here, only jobs from the rule ``bar`` will be submitted to the cluster, whereas all and foo will be run locally.
Note that you can use the localrules directive **multiple times**. The result will be the union of all declarations.

Benchmark Rules
---------------

Since version 3.1, Snakemake provides support for benchmarking the run times of rules.
This can be used to create complex performance analysis pipelines.
With the `benchmark` keyword, a rule can be declared to store a benchmark of its code into the specified location. E.g. the rule

.. code-block:: python

    rule benchmark_command:
        input:
            "path/to/input.{sample}.txt"
        output:
            "path/to/output.{sample}.txt"
        benchmark:
            "benchmarks/somecommand/{sample}.txt"
        shell:
            "somecommand {input} {output}"

benchmarks the CPU and wall clock time of the command ``somecommand`` for the given output and input files.
For this, the shell or run body of the rule is executed on that data, and all run times are stored into the given benchmark txt file (which will contain a tab-separated table of run times and memory usage in MiB).
Per default, Snakemake executes the job once, generating one run time.
With ``snakemake --benchmark-repeats``, this number can be changed to e.g. generate timings for two or three runs.
The resulting txt file can be used as input for other rules, just like any other output file.

.. note::

    Note that benchmarking is only possible in a reliable fashion for subprocesses (thus for tasks run through the ``shell``, ``script``, and ``wrapper`` directive).
    In the ``run`` block, the variable ``bench_record`` is available that you can pass to ``shell()`` as ``bench_record=bench_record``.
    When using ``shell(..., bench_record=bench_record)``, the maximum of all measurements of all ``shell()`` calls will be used but the running time of the rule execution including any Python code.
