Bactopia is a flexible, Nextflow based, set of workflows for analysis of bacterial genomes. The default workflow is based on de-novo assembly but workflows for reference based assembly with snippy and several other auxiliary tools and workflows are also provided.
Bactopia can be installed with conda from the Bioconda channels by running:
conda create -n bactopia bactopia
This installs the bactopia
tool and supporting scripts. Remember to always make sure that your
conda environment is correctly configured including with the bioconda channels before running
this command.
Bactopia stores some additional data used by the tool in the bactopia directory, which
by default is $HOME/.bactopia
. These include datasets used for species identification
and AMR prediction which can be downloaded using the bactopia datasets
command.
For running tools, bactopia
supports conda as well as both the Docker and Singularity
container systems. This support is provided by the Nextflow workflow engine.
The bactopia prepare
command scans a directory of FASTQs and prepares a sample sheet
for use by bactopia
. For example:
bactopia prepare --path reads/ >sample-sheet.tsv
This command looks for files ending in .fastq.gz
and with paired end identifiers like _1
or _R1
. These patterns can be configured e.g. using the --fastq-ext
, --pe1-pattern
and --pe2-pattern
options. See bactopia prepare --help
for more details.
Since Bactopia provides multiple workflow, it is useful to have a common setup to
configure the bactopia
runs. A bactopia_common.sh
script can look like:
#!/bin/bash
if [[ -z $(which bactopia) ]] ; then
echo "Please ensure that bactopia is in your PATH" >&2
exit 1
fi
NXF_SINGULARITY_CACHEDIR=${NXF_SINGULARITY_CACHEDIR:-$HOME/singularity_cache}
export NXF_SINGULARITY_CACHEDIR
if [[ ! -d $NXF_SINGULARITY_CACHEDIR ]] ; then
mkdir $NXF_SINGULARITY_CACHEDIR
fi
MAX_MEM=8
MAX_CPUS=2
MAX_RETRY=1
# Singularity
PROFILE="-profile singularity"
#Docker
#PROFILE="-profile docker"
#Conda
#PROFILE=""
BACTOPIA="bactopia --max_memory $MAX_MEM --max_cpus $MAX_CPUS --max_retry $MAX_RETRY $PROFILE -resume -ansi-log false"
This bactopia_common.sh
script should be in the same directory from which you run bactopia. It does a few things:
Check that bactopia is actually in the path (i.e. you have run conda activate bactopia
or similar).
Set up a directory (referred to by the environment variable NXF_SINGULARITY_CACHEDIR
) where Singularity container images are stored (cached). If the user has already
set the NXF_SINGULARITY_CACHEDIR
variable, this is used, otherwith $HOME/singularity_cache
is used by default.
Set the maximum memory and maximum CPUs that a task run by bactopia
can take. This is done to both ensure that the resources
of the computer running the workflow aren’t exausted and also to make efficient use of resources so that more tasks can be run in
parallel.
Set the maximum number of times that bactopia
will retry running a task. This is set to the conservative value of 1
because
if tasks fail to run they often won’t run successfully on retries.
Configure three different profiles, for running with Singularity containers, Docker containers or with conda. Uncomment the
PROFILE
line that corresponds to the system that you want to use to manage your software dependencies.
You can download this file using:
wget https://gist.githubusercontent.com/pvanheus/c30fe84e8c6671a44f7d25f800840ca6/raw/1d5e8a4bb3d2c8061e38358017f5ee0596d47a9b/bactopia_common.sh
or
curl -O https://gist.githubusercontent.com/pvanheus/c30fe84e8c6671a44f7d25f800840ca6/raw/1d5e8a4bb3d2c8061e38358017f5ee0596d47a9b/bactopia_common.sh
For the purposes of our workflow we’ll focus on using Nextflow for reference based construction of a bacterial genome consensus using the Bactopia
snippy
workflow. To prepare for using this workflow we need to do quality control on our reads, which we can do using the cleanyerreads
workflow.
With the sample-sheet.tsv
prepared using bactopia prepare
and our bactopia_common.sh
in place, we can run the workflow with this simple
script:
#!/bin/bash
set -e
set -u
. bactopia_common.sh
$BACTOPIA --wf cleanyerreads --samples sample-sheet.tsv --outdir cleaned_reads
This script can be downloaded (using wget
or curl
) here: https://gist.githubusercontent.com/pvanheus/ac575a1c1adab47c9b44e60eb355650b/raw/5b3649b3db5387de2b2453a7237581d671fd1b2b/run_bactopia_cleanyerreads.sh
This runs the cleanyerreads quality control workflow that,
for Illumina data, using fastp
by default to trim adapters and remove low quality bases. The output of the workflow is placed in the
cleaned_reads
directory. This output includes:
a directory for each input sample (e.g. SRR8364253). Within these directories there is a main
directory (because this is a main
Bactopia workflow) and a directory for outputs from each workflow step. The qc
subdirectory has a summary
subdirectory with outputs
from tools such as fastp
.
a bactopia-runs
folder that contains info on the overall workflow run. Within this directory there will be timestamped directories with
names like cleanyerreads-20240313-151308
where the format is WORKFLOWNAME-YYYYMMDD-HHMMSS
based on when the workflow run started. Inside
this directory is a nf-reports
directory that contains HTML format reports on the steps of the workflow, how long they took and how many
resources they consumed.
It is useful to consolidate the QC reports from the cleanyerreads
workflow. This can be done using MultiQC, which
can be installed with conda create -n multiqc multiqc
. To switch between conda
environments, deactivate your current environment
and activate your new environment like this:
conda deactivate
conda activate multiqc
In the same directory where you ran fastqc
run multiqc
like this:
multiqc --module fastp --outdir fastp_summary cleaned_reads/SRR*
This will gather the fastp
output summaries into a folder called fastp_summary
. In your web browser, open the multiqc_report.html
file
from the fastp_summary
to view the consolidated report. An example output is at this
link.