RseqFlow Description

RseqFlow is a pipeline to analysis RNA-Seq datasets for all annotated species, producing QC analysis, quailifying expression level,identifying differentially expressed genes, and alignment files format conversion. The whole framework is shown as follows:

The whole framework has four running branches, which can be run simultaneously or individually.

  • Branch 1: Based on the merging of alignments to transcriptome and genome, the Quality Control, SNP Calling were produced.
  • Branch 2:  Alignment File Format Conversion will be implemented.
  • Branch 3: Based on only the alignment to transcriptome, expression level for Gene/Exon/Splice Junction
  • Branch 4: Differentially expressed genes identification.

Software environment and Running mode

  • Preinstall software environment is Python 2.7, R 2.11, GCC
  • The RseqFlow2 will be implemented in Pegasus workflow management running mode and Simple Unix Shell running mode, which will be described in another two manual files.

Implementation

Aligning RNA-Seq Datasets

  1. Alignment tools: PerM[1], Bowtie[2], BWA[3]
  2. Alignment targets: Genome and/or Transcriptome
  3. QC, SNP calling, Alignment File Format Conversion modules will be implemented based on the merging of alignments to transcriptome and genome.
  4. Expression Quantification and Differentially Expressed Gene analysis modules will be implemented based on only the alignment to transcriptome.

 

Quality Control

This module output the following pre-alignment (with only fastq, fasta files) and post-alignment QC analysis (with .sam or .bam files)metrics. Some drawing and processing codes are from RSeQC[4]. The detailed processing is as follows:

1. Pre-alignment>>Read Quality: Output the boxplot and heatmap based on the Phred Quality Score. This analysis is available only when the RNA-Seq is FASTQ format. The output files are as following:

Output.qual.boxplot.pdf

Output.qual.heatmap.pdf

2. Pre-alignment>>Nucleotide distribution: Output nucleotide composition for each position of read. This module is used to check the nucleotide composition bias. The output files are as follows:

Output.reads_stat_ATCG.txt
==============================================
PositionInRead          #Reads       A%     C%      G%     T%       N%
==============================================

Output.reads_stat_ATCG_plot.pdf

3. Pre-alignment>>GC content detection: Output two types of GC analysis, one is to count GC content across the read length; the other is reads number histogram of GC content. The output files include two image files and their corresponding value txt files as follows:

Output.reads_stat_GC.txt(Number value for Output.reads_stat_GC_plot.pdf)

===========================================
PositionInRead           #Reads             GC content
===========================================

Output.reads_stat_GC_plot.pdf

Output_hg19.GC.xls (Number value for  Output_hg19.GC_plot.pdf)

===========================================
GC content            read_count
===========================================

Output_hg19.GC_plot.pdf

4. Post-alignment>>Statistics of Reads Alignments: Output the number of sequenced reads, mapped reads, multiple-mapped reads and unique-mapped reads. The output file are as follows:

Output_mapping_report.txt

==========================================
Number of Total Reads: 41,102,255
Number of Mapped Reads: 30,314,711 73.75%
Number of Multiple-mapped reads: 3,265,968 7.94%
Number of Unique-mapped reads: 27,048,743 65.81%
==========================================

5. Post-alignment>>Statistics of Strand specificity: Output how reads were stranded for RNA-seq data. Based on annotation of transcriptome, through comparing reads’ mapping information to the underneath gene model. The output file are as following:

Output_strand_ stat.txt
==========================================
This is SingleEnd Data
For transcriptome Alignments
Percentage of reads mapped with “++”: 23.14%
Percentage of reads mapped with “–“: 26.37%
Percentage of reads mapped with “+-“: 26.67%
Percentage of reads mapped with “-+”: 23.82%
Note:
++:read mapped to ‘+’ strand indicates parental gene on ‘+’ strand
–:read mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
+-:read mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
-+:read mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
==========================================

6. Post-alignment>>Gene Body Coverage Distribution: Output the average reads distribution over the gene body. This module scales all transcripts length to 100. For ①all annotated genes and ② the genes with single annotated transcript, we calculate the average number of reads for each scaled point. Finally, it generates a plot illustrating the The output files include the two images of coverage profile along the gene body for ①② genes and corresponding value txt files.

Output.geneBodyCoverage.AllGene.txt (Number value   of  Output.geneBodyCoverage.AllGene.pdf)

===========================================
Scaled Transcripts Point                     CoveringReads Number
===========================================

Output.geneBodyCoverage.AllGene.pdf

Output.geneBodyCoverage.Single-transcriptGene.txt  (Number value of   Output.geneBodyCoverage.Single-transcriptGene.pdf)

=========================================
Scaled Transcripts Point                  CoveringReads Number
==================================

Output.geneBodyCoverage.Single-transcriptGene.txt

7. Post-alignment>>Reads distribution along genome: Output how mapped reads were distributed over genome, like exon, Intron, Intergenic regions. When genome features are overlapped (e.g. a region could be annotated as both exon and intron by two different transcripts) , they are prioritize as: Exons > Introns > Intergenic regions, for example, if a read was mapped to both Exon and intron, it will be assigned to CDS exons. The output file is as following:

Output_read_distribution.txt

Percentage of Assigned Tags 84.90%
===========================================
Regions Percentage of Tags
Exons 80.41 %
Introns 14.99 %
TSS_up_1kb 0.44 %
TSS_up_5kb 0.78 %
TSS_up_10kb 1.47 %
TES_down_1kb 1.02 %
TES_down_5kb 2.04 %
TES_down_10kb 3.13 %
Note:
If reads spliced once, it will be counted as 2 tags
TSS= Transcription Start Site, TES=Transcription End Site, down=downstream, up=upstream
===========================================

8. Post-alignment-optional>>Alignments to ribosomal RNA and Mitochondrial genome: Output the mapping rate of reads to ribosome RNA and Mitochondrial genome. The output file is as follows:

Output.chrM_and_rRNA_mapping_report.txt

Mitochondrial genome mapping report
==========================================
Number of Reads: 41,021,821
Number of Mapped Reads: 10,331,743
Mapping Rate: 25.19%
==========================================

ribosomal RNA mapping report
==========================================
Number of Reads: 41,021,821
Number of Mapped Reads: 7,870,589
Mapping Rate: 19.19%
==========================================

Expression Level Quantification

 

  1. Four quantification options to deal with multi-mapped reads:
  • Remove the multi-mapped reads directly.
  • Assign the multi-mapped reads randomly.
  • Assign the multi-mapped reads according the proportion of gene expression level
  • Only counting the reads falling into mappable regions(Detail can be found in the supplementary of [5]), and it is based on the unmappable repetitive region library, which can be built automatically with this option (A little time comsuming).

2. All the analysis is based on the alignments from RNA-Seq to reference transcriptome.

3. Output Information

GeneExpressionLevel_method.txt
==========================================
GeneID chromosome Strand Reads_Number RPKM
=========================================

ExonExpressionLevel_method.txt
=========================================
GeneId:ExonStart:ExonEnd Chromosome Strand Reads_Number RPKM
=========================================

SpliceJunctionExpressionLevel_method.txt: RPM (Reads Per Million Mapped Reads)
============================================
GeneId:JuncStart:JuncEnd Chromosome Strand Reads_Number RPM
============================================

 

Differentially Expressed Gene Identification

1. Two calculations for conditions with and without replicates:

  • For conditions with replicates: We compute p-values for differentially expressed genes with DESeq[6].
  • For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test[7].

2. All the analysis is based on the outputs from Expression Level Quantification .

3. Output Information

  • DE_all_With(out)Replicate_Condition1Sample_Condition2Sample_Gene(Exon)ReadCount_Table.txt: This file output the statistical information for all genes.
  • DE_Significant_With(out)Replicate_Condition1Sample_Condition2Sample_Gene(Exon)ReadCount_Table.txt: This file output the signification differentially expressed genes.

The above two files will output in the following format:

=================================================================
GeneId  BaseMean  BaseMeanA  BaseMeanB  FoldChange  log2FoldChange  pval  padj  resVarA resVarB
==================================================================

 

SNPs Calling

  1. This module is using the uniq-mapped alignments based on merged alignments to genome and transcriptome. So generally, this module is implemented with QC part.
  2. The SNP calling is analyzed by Samtools[8]and output .bcf files for bcftools to view.

 

Alignment File Format Conversion module

  1. This module implements the format conversion for backup and visualization: from sam to bam to mrf and wig/bed format, from bam to wig/bed, from mrf to wig/bed format.
  2. The conversion is implemented with samtools[8], RseqTools[9] etc.

 

Input files and their Format

1.Input Files for QC (Combined with SNPs calling)

  • Genome Annotation GTF file* (Required)
  • Genome Annotation Reference Sequences (Reference transcriptome) * (Required)
  • Genome reference seqeuences* (Required)
  • RNA-Seq Fest/Fasta file (Required)
  • Reference sequences of Mitochondrial in Chromosome(Optional)
  • Reference sequences of ribosome RNA(Optional)

2. Input Files for Expression Level Quantification

  • Genome Annotation GTF file* (Required)
  • Genome Annotation Reference Sequences (Reference transcriptome) * (Required)
  • Genome reference seqeuences* (optional for Mappable Region Quantification)
  • RNA-Seq Fest/Fasta file (Required)

3. Input Files for DE Gene Identification

  • Output files from Expression Level Quantification module, Gene(Exon)ExpressionLevel_method.txt

4. Format of Some Input Files

  • Genome Annotation GTF file

Format from GTF 2.0 to the latest GTF3.0 (Sample file) is required;

  • Genome Annotation Reference Sequences (Reference transcriptome)

The transcripts name which start with “>” should accord with the following format:

“>$genomeName_$AnnotationSource_$TranscriptsID=$Chromosome:$Start-$End”

Other information followed the $End with a space as separator will not affect the normal running,

For example,

“>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5’pad=0 3’pad=0 strand=- repeatMasking=none”

  • Genome reference sequences

The chromosome name which start with “>”should be accord with the following format:

“>$chromsome”

And no more information followed the “>$chromsome” information.

For example,

“>chr1”, “>chrM”,”>chrUr”, etc

The above three annotation files should be one file each for the whole genome (or the whole processing data) , they will be split by RseqFlow automatically if necessary in the following processing.

All species with the above formats can be analyzed in the RseqFlow Pipeline.

 Reference

[1]Chen, Y., Souaiaia, T. and Chen, T. (2009) PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds, bioinformatics, 25, 2514-2521.

[2]Langmead B, Trapnell C, Pop M, Salzberg SL.(2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.

[3]Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.

[4]Wang L, Wang S, Li W RSeQC: quality control of RNA-seq experiments Bioinformatics (2012) 28 (16): 2184-2185.

[5]Ying Wang, Gaurang Mehta, Rajiv Mayani, Tade Souaiaia, Yangho Chen, Andrew Clark, Lin Wan, Oleg V. Evgrafov, James A. Knowles, Ewa Deelman and Ting Chen, RseqFlow: workflows for RNA-Seq data analysis. Bioinformatics, 2011, 27 (18): 2598–2600.

[6]Simon Anders and Wolfgang Huber: Differential expression analysis for sequence count data Genome Biology (2010) 11:R106

[7]Wan L and Sun:CEDER: Accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq,(2012), IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(5): 1281-1292.

[8]Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and Subgroup, G.P.D.P. (2009) The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics, 25, 2078-2079.

[9]Lukas Habegger, Andrea Sboner, etc.(2010). RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics.