Tutorial 1: Wrapping samtools sort

Wrapping samtools sort

Time to complete

20 minutes plus Platform execution time (approximately 5 minutes with smaller files and 75 minutes with larger files).

Objective

In this tutorial, we will:

  • identify the options for the command line tool and determine which ones we want to expose to the users in the wrapped tool
  • create the Docker image containing the tool and the environment it needs to run
  • use the Rabix Composer tool editor to describe the tool and the options we are exposing
  • use Rabix Executor to run and test our tool locally
  • upload our tool to the Platform and run it as a task on the Platform.

This tutorial uses the concepts explained in the introduction to tool wrapping. If you are not familiar with tool wrapping, then we recommend you read that section first.

Prerequisites

Before starting this tutorial, you need to:

Step 1: Identify the required options for the command line

In this tutorial, we’ll wrap samtools sort, one of the tools in the samtools suite. samtools sort takes an input BAM-format file containing short DNA sequence reads and sorts it.

To keep the example simple, we will use the default values for most parameters and options, and aim to build a command line that looks like:


samtools sort -O bam -T tmp_ -o <sorted-bam-file.bam>  <input-bam-file.bam>

The command breaks down as follows:

  • samtools sort: the command and subcommand. This corresponds to the base command.
  • -O bam the format of the output file. We are hard-coding this rather than allowing the user to specify it when the tool is run, so this is an argument.
  • -T tmp_: the prefix to use for the temporary files. Again, we are hard-coding this rather than allowing the user to specify it when the tool is run, so this is an argument.
  • -o <sorted-bam-file.bam>: the name of the output file to generate. We want to allow the user to specify the name of the output file as an input to the command, so this is an input port. Note that the file that is generated will be an output port.
  • <input-bam-file.bam>: the BAM file to be sorted. We want to allow the user to specify this file as an input to the command, so this is an input port.

Alternatively, we could use a dynamic expression to derive the output file name from the input file name. For example, we could set the output file name to <input-bam-file>_sorted.bam where <input-bam-file> is the first part of the input file name. In this case, when the tool is run, the user will not need to specify a value for the output filename. So the output filename parameter is no longer an input port (specified when the tool is run) but an argument containing a dynamic expression (either fixed, or derived automatically from other information). We won’t use dynamic expressions in this tutorial, but in the next tutorial, we’ll see how to modify this example to use a dynamic expression.

Step 2: Create the docker image

Note: A Docker image in the image repository can be accessed by anyone who knows the path and name. So you should avoid including any sensitive data in the image.

Open a terminal window and enter


docker run -ti ubuntu

to create a Docker container from the ubuntu base image. Here, we are using a minimal ubuntu base image as that is suitable for samtools, but you can start with any image that is suitable for the tools you want to use.

The terminal prompt changes to root@<containerid> where <containerid> is a unique id for the Docker container you are creating. Make a note of <containerid> as you will need it shortly.

Load the container with the tools you need. In this case, you need to enter the following commands to download and build samtools.


# Update the package index inside the container
apt-get update
# Install the tools we need to download and compile SamTools
apt-get install wget build-essential zlib1g-dev libncurses5-dev
# Download the SAMtools source code (version 1.2 or a later version of you prefer)
wget https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2
# Unpack the archive
tar jxf samtools-1.2.tar.bz2
# Go into the directory containing the unpacked Samtools source code
cd samtools-1.2
# Compile the code
make
# Install the resulting binaries
make install

Test that the samtools executable has been installed and built successfully. Enter


samtools --version

and verify that the version information for samtools is displayed.

Exit the container (remember to make a note of the container id).


exit

Step 3: Save the docker image

To save the container as an image in the Seven Bridges image repository, first log in to the repository. In the terminal window, enter


docker login images.sbgenomics.com

When prompted for a username, enter your Platform username. When prompted for a password, enter your Platform authentication token not your Platform login password.

You will see a message saying the login has succeeded, then you will be returned to the terminal prompt. Note that this login times out after a while, so if you don’t access the Seven Bridges image repository promptly, you may need to log in again in order to do so.

Commit the image to the repository as follows:


docker commit <containerid> images.sbgenomics.com/<username>/samtools:v1

where <containerid> is the containerid you made a note of above, <username> is your Platform username, modified if necessary to be all in lowercase, and with any hyphens or full stops replaced by underscores. In this example, we have called the image samtools, and have tagged it as v1. If the commit is successful, you will see a message similar to this:


sha256:4dcd3c6911776ba0417e322dd40d0d4881e1806f9b3027516888798b21b8203f

Push the image to the image registry


docker push images.sbgenomics.com/<username>/samtools:v1

where <username> is your Platform username, modified if necessary, as above.

If the push is successful, you will see several messages, ending with a message similar to this:


v1: digest: sha256:64a47b5dcdb95a4b6184e880365694b40e1cd85e4151074a11ba1f37c8b56f1f size: 1570

If you want to know more about Docker commands, you will find a list of common Docker commands here.

Step 5: Create the new tool

In Rabix Composer, click +, then select Create a command line tool. Enter the following:

  • Set App Name to samtools-sort
  • Set CWL version to sbg:draft-2
  • Set App Type to Command line tool
  • Set Destination Folder to Choose a local folder, and select a folder in your local workspace. Leave the default Save As name as samtools-sort.cwl. Click Done.

Click Create to create the tool. The tool editor opens.

Step 6: Specify the Docker image

In the Docker Image section of the tool editor, set Docker repository to images.sbgenomics.com/<username>/samtools:v1 where <username> is your Platform username, all in lowercase, and with any hyphens or full stops replaced by underscores.

Step 7: Specify the base command

In the Base Command section of the tool editor, enter samtools sort.

Click Preview at the bottom right to open a preview pane showing a preview of the command we are building up. You should see samtools sort in the preview pane.

Step 8: Specify the arguments

We need to specify the output file format as a fixed argument (the string is -O bam).

In the Arguments section of the tool editor, click Add an Argument. An argument is added and the object inspector opens on the right hand side showing the properties of the argument.

In the object inspector:

  • Leave CommandLineBinding selected
  • Set Prefix to -O
  • Set Expression to bam
  • Leave Separate Value and Prefix selected (the syntax requires a space between the prefix and the expression)
  • Leave Position set to 0 (as long as this argument is after the base command at the beginning of the command line and before the input file at the end of the command line, the actual position relative to the other items on the command line doesn’t matter).

In the preview pane you should see samtools sort -O bam.

We also need to specify the temporary file prefix as a fixed argument (the string is -T tmp_).

In the Arguments section of the tool editor, click Add an Argument. Then, in the object inspector:

  • Leave CommandLineBinding selected.
  • Set Prefix to -T
  • Set Expression to tmp_
  • Leave Separate Value and Prefix selected (the syntax requires a space between the prefix and the expression)
  • Leave Position set to 0 (as long as this argument is after the base command at the beginning of the command line and before the input file at the end of the command line, the actual position relative to the other items on the command line doesn’t matter).

In the preview pane you should see samtools sort -O bam -T tmp.

Step 9: Specify the input ports

We need to specify the name of the output file as an input port (the string is **-o .bam**).

In the Input ports section of the tool editor, click Add an Input. An input port is added, with a default name of input, and the object inspector opens on the right hand side showing the properties of the input.

In the object inspector:

  • Select Required. This will be a mandatory input.
  • Set Id to sorted_file_name.
  • Set Type to String.
  • Leave Allow array as well as single item unselected.
  • Select Include in Command Line.
  • Leave Value blank. This is where we could insert a dynamic expression to derive the name as a function of the input file name if we wanted to. Because we have left this blank, the user of the tool will be asked to specify a value when the tool executes.
  • Set Position to 0 (as long as this part of the command is after the base command at the beginning of the command line and before the input file at the end of the command line, the actual position relative to the other items on the command line doesn’t matter).
  • Set Prefix to -o.
  • Leave Separate Value and Prefix selected.
  • Expand the Description drop-down, and set Label to Sorted file name. Add a Description if you like. When the tool is placed in a workflow, the Label is displayed against the output port (if not supplied, the Id is used instead).

In the preview pane, you should see samtools sort -O bam -T tmp -o sorted_file_name.

  • To specify a dummy value, instead of the placeholder value, sorted_file_name, click the Test tab, and in the Uncategorised section, for sorted-file-name enter sorted.bam.
  • In the preview pane, you should now see samtools sort -O bam -o sorted.bam. Click Visual to go back to the tool editor.

We also need to specify the input file as an input port. In the Input ports section of the tool editor, click Add an Input.

In the object inspector:

  • Select Required.
  • Set Id to input_bam_file.
  • Set Type to File.
  • Leave Value blank.
  • Set Position to 1 (this must appear in the command line after the other arguments and inputs).
  • Leave Prefix blank.
  • Leave Separate Value and Prefix selected.
  • Expand the Description drop-down, and set Label to Input BAM file. Add a Description if you like. Set File types to BAM (only valid for CWL v1.0 workflows). When the tool is placed in a workflow, the Label is displayed against the inout port (if not supplied, the Id is used instead). For CWL V1.0 tools only, File types allows the workflow editor to check that output nodes are connected to input nodes of the correct type.

In the preview pane, you should see samtools sort -O bam -T tmp -o sorted.bam /path/to/input.ext. Click Visual to go back to the tool editor.

  • To specify a dummy value instead of the placeholder value, /path/to/input.ext, click the Test tab, and in the Uncategorised section, for input_bam_file enter unsorted.bam.

In the preview pane, you should now see samtools sort -O bam -o sorted.bam unsorted.bam, which is the command we wanted to generate. Click Visual to go back to the tool editor.

Step 10: Specify the output port

Now we need to specify the output file as an output port. Note that we have already set the name of the output file as an input. But we also need to specify an output port for the file in order to retrieve the output. In the Output ports section of the tool editor, click Add an Output. An output port is added and the object inspector opens on the right hand side showing the properties of the output.

In the object inspector:

  • Select Required.
  • Set Id to sorted_bam_file.
  • Set Type to File.
  • Set Glob to *.bam. This means that any file that matches this filter will be reported as an output of the tool. We could use a dynamic expression instead to specify only files that match the specified output file name, but this simpler option will be enough for now.
  • Expand the Description drop-down, and set Label to Input BAM file. Add a Description if you like. Set File types to BAM (only valid for CWL v1.0 workflows). When the tool is placed in a workflow, the Label is displayed against the output port (if not supplied, the Id is used instead) and, for CWL V1.0 tools only, File types allows the workflow editor to check that output nodes are connected to input nodes of the correct type.

Step 11: Save the tool description

Click the Save icon at the top right to save the tool description.

Step 12: Test the tool locally

Testing the tool locally allows you to rapidly check that you have wrapped it correctly and that it generates the results you expect before uploading it to the Platform.

Firstly, we need to obtain a BAM file. We’ll download a paired end read file from the Platform public files. It’s small, and won’t take too long to download or process. On the Platform, select Data > Public Reference files and enter G26234.HCC1187_1M.aligned.bam in the search box.

Click on the file of this name (not the one ending with .bai) then click Download.

Open a terminal window in the folder where you downloaded Rabix Executor. To run it, we will use the following format of the command


./rabix <app-name> -- \{--<input-id> <input-value>} . . .

where <app-name> is the name of the tool we are testing, <input-id> is the id of one of the tool input ports, and <input-value> is the value we want to assign to that port. In our case, we have two input ports. We will set input port id input_bam_file to the file we have just downloaded, and input port id sorted_file_name to sorted.bam.

So, in the terminal window enter (on a single line)


./rabix <path-to-tool>/samtools_sort.cwl -- --input_bam_file <path-to-bam-file>/G26234.HCC1187_1M.aligned.bam
--sorted_file_name sorted.bam

where <path-to-tool> is the location where you saved samtools_sort.cwl and <path-to-bam-file> is the location where you saved the downloaded BAM file.

When the command completes, you will see


[INFO] Job root has completed

followed by some information about the task, and the location of the output file.

Step 13: Test the tool on the Platform

When you are happy that the tool is working, you can push it to the Platform and run it with real data. Before doing this, ensure you have added at least one Platform project to your workspace.

We are going to use a typical BAM file from the 1000 Genomes project to test the tool on the Platform, so first you need to copy it to your project. On the Platform, select Data > Public Reference files and enter NA12878.ga2.exome.maq.raw.bam in the search box. Note that this is a far bigger file than we used above, and the task will take longer to run (around 75 minutes).

Note: If you prefer, you could use the paired end read file we used to test the tool locally instead, as this will return a result in a few minutes. If so, search for G26234.HCC1187_1M.aligned.bam instead.

Select the file then click Copy to, and specify a destination project of your choice.

In Rabix Composer, open the local samtools-sort app. Click the Push to Platform icon to push it to a Platform project. Specify an App Name of your choice and the Destination Project you copied the test data file to, then click Push. The tool is copied to the Platform and shown below the appropriate project folder in the Rabix Composer navigation pane.

In Rabix Composer, select the Platform copy of the tool and open it in the tool editor. Click the Open on Platform icon to open the app on Platform. From here, click Run to create a task to run the tool. Note that this doesn’t actually run the tool: it just creates a task from which the tool can be run.

You will see warning messages on the Set Input Data tab because no inputs have been specified yet. Click Select Files, and select NA12878.ga2.exome.maq.raw.bam. Click Save.

Go to the Define App Settings tab and for sorted_file_name, enter sorted_bam_file.bam. Click Run.

This analysis will take around 75 minutes to run, and you will receive an email when it completes.

Step 14: View the results

When you receive the notification email, click the link in the email to view the results. You should see that the task was successful and that a single 17.8GB output file, 1_sorted_bam_file.bam was created containing the sorted BAM data.

top