Pegasus GT-FAR


GT-FAR is a RNA seq pipeline that allows users to do  Alignment, Quantification, Differential Expression, and Variant Calling.  This pipeline has been modeled as a Pegasus workflow. Pegasus enables users to execute the pipeline on wide variety of execution environments ranging from local clusters, grids to computational clouds.

With funding support from iSeqTools , we have packaged this pipeline along with Pegasus as a cloud based solution. This Amazon based cloud solution, allows users to start a virtual machine on AmazonEC2, and upload their data-sets to perform GT-FAR analysis and makes outputs available in Amazon S3. The interface, allows scientists to run and monitor their analysis runs. It also generates error reports, that can be used to communicate back to developers if the analysis fail.

The web interface for the cloud based solution allows users to

  • to upload custom datasets,
  • start and monitor the progress of their GT-FAR analysis
  • receive notifications when the analysis is done.
  • make the output datasets available for download from S3.

Instructions for starting the Pegasus GT-FAR Amazon EC2 virtual image can be found here.


What is GT-FAR?


The Genome and Transcriptome-Free analysis of RNA software package is a collection of tools one can use to analyze RNA reads. GT-FAR is different from most RNA analysis programs because it can perform data quantitation with or without a reference genome and gene model. In the simplest example, imagine you have human rna data from a control and treatment group and want to investigate how RNA differences relate to phenotypic differences. In each case running an RNA file through GT-FAR will involve invoking a sequential workflow which involves the following steps:

  1. Read Quality Control and Adaptor Trimming/Removal where reads are verified as high quality or when possible trimmed from either end to remove low quality positions or adaptor contamination.
  2. Alignment of verfied/trimmed reads to a length specific gene/intergene model created using a reference genome and gtf file.
  3. Gapped alignment of reads which did not find a suitable alignment to the gene/intergene model to predict novel genes and splice junctions.
  4. Sequence enumeration and lightweight assembly of the reads which were not able to find alignments in step 2 and 3.

For each read file multiple levels of quantitative analysis are produced. At the highest level each sample is summarized by the data quality, alignment rate to ribosomal RNA, mtRNA, mRNA, preMRNA, and the intergenic portion of the genome. At lower levels samples are quantified by their expression levels to mRNA gene models, preMRNA gene models, specific splice sites or isoforms (including novel splice sites), and finally by the assembled unmapped sequences in the file. This allows users to discover if sample and control groups differ largely because of differential expression of known mRNA gene models as well as differences in lesser known expressed units; for example the most significant difference could be between a novel splice site of a little known gene.

What is Pegasus WMS?

The Pegasus project encompasses a set of technologies that help workflow-based applications execute in a number of different environments including desktops, campus clusters, grids, and clouds. Pegasus bridges the scientific domain and the execution environment by automatically mapping high-level workflow descriptions onto distributed resources. It automatically locates the necessary input data and computational resources necessary for workflow execution.Pegasus enables scientists to construct workflows in abstract terms without worrying about the details of the underlying execution environment or the particulars of the low-level specifications required by the middleware (Condor, Globus, or Amazon EC2). Pegasus also bridges the current cyber infrastructure by effectively coordinating multiple distributed resources.

Pegasus has been used in a number of scientific domains including astronomy, bioinformatics, earthquake science , gravitational wave physics, ocean science, limnology, and others. When errors occur, Pegasus tries to recover when possible by retrying tasks, by retrying the entire workflow, by providing workflow-level checkpointing, by re-mapping portions of the workflow, by trying alternative data sources for staging data, and, when all else fails, by providing a rescue workflow containing a description of only the work that remains to be done. It cleans up storage as the workflow is executed so that data-intensive workflows have enough space to execute on storage-constrained resources]. Pegasus keeps track of what has been done (provenance) including the locations of data used and produced, and which software was used with which parameters.