Task 2: Fetch and analyse data

In the ~/data directory, you will find two files: file_list1.csv and file_list2.csv. Here is a part of file_list1.csv`:

SRR8364252,https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/SRR8364252_1.fastq.gz
SRR8364252,https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/SRR8364252_2.fastq.gz
SRR8364253,https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/SRR8364253_1.fastq.gz
SRR8364253,https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/SRR8364253_2.fastq.gz

The structure of the file is a list of sample names and then URLs for some gzipped FASTQ files. We would like to fetch all the sequence files listed in file_list1.txt. Before that, we need to focus on shell scripting.

Shell variables

Variables in the shell are either user variables or shell environment variables. Environment variables like HOME and PATH are set in the shell environment and can be used to find out more about the shell environment, for example where the user’s home directory is or what locations the shell searches for programs to run.

User variables are set by the user and either used in their current shell session or by programs that the user runs. Setting a variable looks like this:

MYVAR=value

Note that there is no space between the variable name and the value. The value set this way can be retrieved using $. For example echo $MYVAR. Shell variables can either be UPPERCASE or lowercase, but using UPPERCASE is the convention, so that these can be distinguished from e.g. the names of commands.

Shell variables can be specified using braces e.g. echo ${MYVAR}. These braces clearly distinguish where the shell variable name starts and ends so that you can do things like echo ${MYVAR}_word. Without the braces echo $MYVAR_word would be interpreted as a reference to a variable called MYVAR_word.

The variables that you set are only available in the shell session where you set them unless you use the export command. If you say export MYVAR then MYVAR is available to any commands that you run in your shell. For more on export, read this guide from DigitalOcean.

Advanced shell variable expansion

The GNU Bash manual has a section on shell variable expansion. Here are some advanced ways to use variables:

# print the value of MYVAR but if MYVAR is not defined, print Hello
echo ${MYVAR:-Hello}

MYVAR=/home/user/patrice
# print the part of the variable's value that comes after the first pattern match:
# this will print home/user/patrice
echo ${MYVAR#*/}

# print the part of the variable's value that comes after the last pattern match
# this will print patrice
echo ${MYVAR##*/}

# NOTE: the pattern now changes from /* to */ for the next two examples

# print the part of the variable's value that comes before the last pattern match
# this will print /home/user
echo ${MYVAR%/*}

# print the part of the variable's value that comes before the first pattern match
# this will print a blank line
echo ${MYVAR%%/*}

Shell scripts

A shell script is a collection of commands, typically stored in a file whose name ends in .sh. E.g. fetch.sh. A shell script should have

#!/bin/bash

as its first line. This line is not interpreted by Bash itself (it starts with #, the comment character), but it tells the Linux or Unix system what command to use to interpret the commands in the file.

The are two ways to run a shell script. Either you can tell bash to run the script, e.g.:

bash fetch.sh

Or, if you make the shell script file executable with the chmod a+x command, you can run the script by referring to its name (including the full or relative path), e.g.

# only works after chmod a+x fetch.sh
./fetch.sh
Shell parameter variables

Any parameters that you pass on the command line are passed to the shell script as variables named $1, $2, etc based on the position of the parameter in the command line. The number of parameters used in the command line will be listed in the $# variable. E.g. in a script called test1.sh which contains:

#!/bin/bash

echo "Number of parameters: $#"
echo Param1: $1
echo Param2: $2

If you run ./test1.sh one two three you will see:

Number of parameters: 3
Param1: one
Param2: two

And if you run ./test1.sh one` you will see:

Number of parameters: 1
Param1: one
Param2:

Notice that in the first example, the third parameter is ignored and in the second example the second parameter is blank.

Shell flags

You can set some special flags in a shell script. In test.sh with the contents:

#!/bin/bash

# exit if any errors are encountered
set -e

# print the commands as they are executed (useful for debugging)
set -x

# make it an error to use an undefined variable
set -u

# if any part of a pipe files, make trigger an error

set -o pipefile

The set -e option is especially recommended because it stops a shell script from continuing past the first error that it encounters.

Shell loops

The Bash shell supports several different types of loops. The basics of shell loops are explained in this Software Carpentry lesson. In essence, the loop is a foreach loop. For each element provided it will run once.

for LOOPVAR in one two three ; do
    echo $LOOPVAR
done

will print:

one
two
three

The Bash shell $() syntax provides an effective way to create lists to loop over. E.g. $(cat myfile.txt) creates a list of all the lines in the file and runs the loop once for each line (and for each word in a line if the lines have spaces in them). This is called command substitution and what it does is to replace the $() with the output (stdout) of the command inside the brackets.

For example, using an example from Task 1, $(tail -n+2 cases.csv |cut -d, -f3) would give a list of all the dates in the cases.csv file. When this is used with a for loop, all whitespace (spaces, tabs and newlines) acts to separate items and then the loop runs once for each item in the list. For a more formal description of what command substitution is doing, look in the GNU Bash manual.

Shell conditionals (If statements)

One final component of Bash shell scripts is if statements. The basic structure of a conditional (i.e. if statement) is:

if SOMETEST
then
    # do something
fi

For an introductory tutorial see here and for a more complete reference consult the GNU Bash manual.

An important point about the SOMETEST part listed above is that it is actually run as a command. Most commonly one uses the test command, which is expressed using square brackets ([[ ]]) but any command can be used. Any command that returns an exit status of 0 counts as True, and any other exit status counts as False.

The square bracket tests ([[ ]]) allow conditional expressions and combinations of these conditional expressions. Some of the most common tests are -f to test if a file exists and -d to test if a path is a directory, = to test if two strings are equal and -eq to test if two numbers are equal. There are many others, please refer to the linked tutorials.

Expressions can be combined with logical operators: && for AND, || for OR, ! for NOT (changing True to False or False to True) and brackets ( ) for grouping.

A practical shell script

Here is an example shell script that fetches files using a file like the file_list1.csv mentioned previously:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash

set -u
set -e

LISTFILE=$1

if [[ ! -f $LISTFILE ]] ; then
        echo "Incorrect filename, $LISTFILE not found" >&2
        exit 1
fi

if [[ ! -d tracking ]] ; then
        mkdir tracking
fi

for URL in $(cut -d, -f2 $LISTFILE); do
        # links look like
        # https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/SRR8364252_1.fastq.gz
        OUTPUT=${URL##*/}
        if [[ ! -f tracking/$OUTPUT ]] ; then
                wget -c -O $OUTPUT $URL
                # curl -L -O $OUTPUT $URL
                touch tracking/$OUTPUT
        fi
done
  1. Line 1 has the #!/bin/bash, the so-called shebang line, that tells the system how to run this script

  2. Line 3 (set -u) tells the Bash shell that referring to uninitialised values should trigger an error. This will force the user to provide a command line parameter (i.e. the $1 variable)

  3. Line 4 (set -e) tells the Bash shell to exit if there is any error. This means that if e.g. a download fails and wget exits with an error, the script won’t carry on running.

  4. Line 6 sets the LISTFILE variable to the value of $1 and lines 8 through 11 check if the file exists. If the file does not exist (the ! -f means “file does not exist”) then a message is printed to stderr (that’s what the >&2 redirect does) and the program exits with an error (exit 1).

  5. Lines 13 through check for the existence of a directory (a subdirectory of the directory in which this script is run) called tracking and create it if it doesn’t exist. This directory will be used to keep track of which file has been downloaded.

  6. Line 17 sets up the loop. cut -d, -f2 is used to extract the URL part from the LISTFILE and the list of URLs (links) is returned using a $() operator.

  7. Line 20 uses a pattern match to remove everything before the last match of the shell (i.e. glob) pattern */. This leaves only the filename part of the URL. The OUTPUT variable is set to the destination filename.

  8. Line 21 tests to see if there is a file in the tracking directory with the OUTPUT filename. If no such file exists, the download will proceed

  9. Line 22 downloads the file, storing it in the OUTPUT filename. The -c flag of wget will continue a download if it finds a partially downloaded file.

  10. Line 23 is commented out but shows how to do a download using curl. The -L flag to curl ensures that if the server tells curl that the file has moved, curl will download from the new link. Note that this curl command line will not resume partial downloads. The -C - option of curl does support resuming downloads but it has some technical differences from the way wget does resume partial downloads that make it a bit trickier to use.

  11. Line 24 uses touch to make an empty file in the tracking directory with the same name as the OUTPUT filename.

  12. Lines 25 and 26 end the if statement and the for loop.

In summary, this script downloads files as specified in file_list1.csv and uses a tracking system to ensure that already-downloaded files are not downloaded a second time. If it turns out to be necessary to re-download a file, the corresponding tracking file can be deleted from the tracking directory.

Examining the downloaded files

The files are gzip-compressed FASTQ data files from a DNA sequencer. The presumption is that each file is the output of a sequencing run. Let us examine the downloaded files. One simple test is to check the sizes of the files. Are there any files with unusual sizes?

Show answer

Sort the listing of file sizes with ls -l *.fastq.gz |sort -k5nr. The -k5 specifies that sort must use the 5th column for sorting and the n specifies that this must be a numeric sort and finally the r specifies that the sort order must be reversed, with the results in descending, not ascending, order.

Looking at the files sorted this way shows that SRR8364257_1.fastq.gz and SRR8364257_2.fastq.gz are dramatically smaller than the other files. This could be a signal that there is something wrong. Perhaps there were problems with the sequencing run or the sample being sequenced?

Checksums and using them

Checksums are calculations used to verify that a file has not been altered or has been downloaded correctly. The checksums for the downloaded files are available at this link and can be downloaded using e.g. wget https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/checksums.txt. The checksums in this file are SHA256 checksums generated using the sha256sum tool. Another common checksum type is MD5 which can be generated using the md5sum tool.

The checksums.txt file contains lines like:

d4d7dac720ae285317f8d8adba2ba9077a9e5e4f6099bc9aee3f9900c77a518a  SRR8364252_1.fastq.gz
38827e72486e2b1025270c619ea110ead5d80901a1f7f7a7ecc5a5f3f604b69b  SRR8364252_2.fastq.gz

where the first column is the checksum and the second is the filename that the checksum applies to. With this file in hand and the files already downloaded, we can check if the files match their checksums:

sha256sum -c checksums.txt

What does this show? What might be going on?

Show answer

For most of the files, we get the response that the file is OK and matches it’s checksum. For sample SRR8364261_2.fastq.gz we get the message FAILED.

Our download script did not show any errors, and we can verify this by manually re-downloading the file using wget or curl and checking the checksums again. This shows that either there is a problem with the checksum provided or with the file on the remote server. This is a problem that you would raise with the people who are providing the downloadable files and checksums.

Resources

Introduction to Bash Shell Scripting

Advanced Bash Scripting Guide

Introduction to Shell scripting and Intermediate Shell Scripting from Sambit Sinha

What is the difference between the Bash operators [[ vs [ vs ( vs ((?