Primer on RNA Sequencing

Eric Almeida, Kevin Luczak, Dasith Perera

RNA Sequencing is used to investigate the transcriptional profile of a cell

Genome (DNA) -> Transcriptome (RNA) -> Protein

RNA Transcripts are used to:

  • transcribe proteins
  • form the ribosomal translation machinery
  • regulate gene expression
  • modulate proteins

RNA can be sequenced using Next Generation Sequencing

? 2017 Illumina, Inc. All rights reserved.

Illumina sequencing-by-synthesis

Example RNA-sequencing workflow

  • Step 1. Experimental Design
  • Step 2. Library Preparation
  • Step 3. Execute the experiment
  • Step 4. Align sequencing data to reference genome using BowTie2
  • Step 5. Obtain count data using htseq-count
  • Step 6. Move data into R and generate results

NOISeq Install from Bioconductor

source("https://bioconductor.org/biocLite.R") #Downloading from Bioconductor 
biocLite("NOISeq")
library(NOISeq)

Help Viewer in R

Documentation and help can be easily accessed through the “? (command)” command

In example:

  • ?head # takes you to Help viewer for ‘head’ command

  • ?noiseq # takes you to Help viewer for noiseq package

View Input Data

Load your data into R and make your own treatment groups

data(Marioni)
head(mycounts) #Count data (data from upstream)

##                 R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver
## ENSG00000177757          2         1          0         0         1
## ENSG00000187634         49        27         43        34        23
## ENSG00000188976         73        34         77        56        45
## ENSG00000187961         15         8         15        13        11
## ENSG00000187583          1         0          1         1         0
## ENSG00000187642          4         0          5         0         2
##                 R1L7Kidney R1L8Liver R2L2Kidney R2L3Liver R2L6Kidney
## ENSG00000177757          2         0          1         1          3
## ENSG00000187634         41        35         42        25         47
## ENSG00000188976         68        55         70        42         82
## ENSG00000187961         13        12         12        20         15
## ENSG00000187583          3         0          0         2          3
## ENSG00000187642         12         1          9         4          9

head(myfactors) #Treatment groups / replicates as you define it

##            Tissue TissueRun
## R1L1Kidney Kidney  Kidney_1
## R1L2Liver   Liver   Liver_1
## R1L3Kidney Kidney  Kidney_1
## R1L4Liver   Liver   Liver_1
## R1L6Liver   Liver   Liver_1
## R1L7Kidney Kidney  Kidney_1

myfactors = data.frame(Tissue = c("Kidney", "Liver", "Kidney", "Liver", "Liver", "Kidney", "Liver", "Kidney", "Liver", "Kidney"), TissueRun = c("Kidney_1", "Liver_1", "Kidney_1", "Liver_1", "Liver_1", "Kidney_1", "Liver_1",  "Kidney_2", "Liver_2", "Kidney_2"), Run = c(rep("R1", 7), rep("R2", 3)))
myfactors #Should see the other defined treatment groups

##    Tissue TissueRun Run
## 1  Kidney  Kidney_1  R1
## 2   Liver   Liver_1  R1
## 3  Kidney  Kidney_1  R1
## 4   Liver   Liver_1  R1
## 5   Liver   Liver_1  R1
## 6  Kidney  Kidney_1  R1
## 7   Liver   Liver_1  R1
## 8  Kidney  Kidney_2  R2
## 9   Liver   Liver_2  R2
## 10 Kidney  Kidney_2  R2

Additional Bioinformation, from Ensembl Biomart

Provided for us in the Marioni data set, but easy to DIY. In example, first download the unique gene ID and biotype to a csv file from an online database that contains this information on your organism of study, then:

#GENES <- read.csv("C:/Users/labuser/Documents/R/GENES.txt")
#GENES1 <- data.frame(GENES$'Gene.stable.ID')
#GENES1$Gene.Type <- GENES$'Gene.type'
#GENES2 <- unique(GENES1)
#ID <- GENES2$'GENES.Gene.stable.ID'
#TYPE <- GENES2$'Gene.Type'
#BIOTYPES <- data.frame(TYPE)
#row.names(BIOTYPES) <- GENES2$'GENES.Gene.stable.ID'

Creating a NOISeq Object

mydata <- readData(data = mycounts, length = mylength, gc = mygc, biotype = mybiotypes, chromosome = mychroms, factors = myfactors)
mydata

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 5088 features, 10 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: R1L1Kidney R1L2Liver ... R2L6Kidney (10 total)
##   varLabels: Tissue TissueRun Run
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSG00000177757 ENSG00000187634 ...
##     ENSG00000201145 (5088 total)
##   fvarLabels: Length GC ... GeneEnd (6 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

QC Plots

Biases (error) are introduced through RNA extraction, library preparation, and sequencing processes

Quality control tests offered by NOISeq:

  • Biotype detection
  • QC Bias
  • Length Bias
  • RNA composition

Biotype Detection

  • Used to quickly visualize potential contamination or technical biases introduced during library preparation
mybiodetection <- dat(mydata, k=0, type = "biodetection", factor = NULL)

## Biotypes detection is to be computed for:
##  [1] "R1L1Kidney" "R1L2Liver"  "R1L3Kidney" "R1L4Liver"  "R1L6Liver" 
##  [6] "R1L7Kidney" "R1L8Liver"  "R2L2Kidney" "R2L3Liver"  "R2L6Kidney"

par(mfrow = c(1,2))
explo.plot(mybiodetection, samples = c(1,2), plottype ="persample")

Biotype Detection, output 2

mybiodetection <- dat(mydata, k=0, type = "biodetection", factor = NULL)

## Biotypes detection is to be computed for:
##  [1] "R1L1Kidney" "R1L2Liver"  "R1L3Kidney" "R1L4Liver"  "R1L6Liver" 
##  [6] "R1L7Kidney" "R1L8Liver"  "R2L2Kidney" "R2L3Liver"  "R2L6Kidney"

par(mfrow = c(1,2))
explo.plot(mybiodetection, samples = c(1,2), toplot = "protein_coding", plottype ="comparison")

## [1] "Percentage of protein_coding biotype in each sample:"
## R1L1Kidney  R1L2Liver 
##    91.0181    91.7060 
## [1] "Confidence interval at 95% for the difference of percentages: R1L1Kidney - R1L2Liver"
## [1] -1.7991  0.4233
## [1] "The percentage of this biotype is NOT significantly different for these two samples (p-value = 0.2302 )."

Biotype Detection, output 3

mycountsbio <- dat(mydata, factor = NULL, type = "countsbio")

## [1] "Count distributions are to be computed for:"
##  [1] "R1L1Kidney" "R1L2Liver"  "R1L3Kidney" "R1L4Liver"  "R1L6Liver" 
##  [6] "R1L7Kidney" "R1L8Liver"  "R2L2Kidney" "R2L3Liver"  "R2L6Kidney"

explo.plot(mycountsbio, toplot = 1, samples = 1, plottype ="boxplot")

GC Bias

  • Used to gauge whether sequencing bias has favored transcripts with low GC content
myGCbias = dat(mydata, factor = "Tissue", type = "GCbias")

## [1] "GC content bias detection is to be computed for:"
## [1] "Kidney" "Liver" 
## [1] "Kidney"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.136  -4.568   0.832  10.406  48.852 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  89.4686    20.2643   4.415 0.000267 ***
## bx1           0.2447    28.1216   0.009 0.993144    
## bx2         177.5119    36.0617   4.922 8.22e-05 ***
## bx3         -68.5025    57.1321  -1.199 0.244534    
## bx4         107.6385    64.0141   1.681 0.108220    
## bx5               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.26 on 20 degrees of freedom
## Multiple R-squared:  0.6192, Adjusted R-squared:  0.543 
## F-statistic:  8.13 on 4 and 20 DF,  p-value: 0.0004613
## 
## [1] "Liver"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.424  -9.185   0.000   9.384  22.594 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   49.862     15.000   3.324  0.00338 **
## bx1           22.700     20.816   1.091  0.28845   
## bx2           75.753     26.694   2.838  0.01017 * 
## bx3            7.595     42.291   0.180  0.85927   
## bx4           18.511     47.385   0.391  0.70019   
## bx5               NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 20 degrees of freedom
## Multiple R-squared:  0.4385, Adjusted R-squared:  0.3262 
## F-statistic: 3.904 on 4 and 20 DF,  p-value: 0.01678

explo.plot(myGCbias, samples = NULL, toplot = "global")

Length Bias

  • Used to elucidate if bias is brought about by gene length
mylengthbias = dat(mydata, factor = "Tissue", type = "lengthbias")

## [1] "Length bias detection information is to be computed for:"
## [1] "Kidney" "Liver" 
## [1] "Kidney"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.399 -19.603   3.049  25.848  74.827 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   121.33      48.63   2.495  0.02148 * 
## bx1           -34.97      52.24  -0.670  0.51082   
## bx2           269.24      88.41   3.045  0.00639 **
## bx3         -1301.13     719.73  -1.808  0.08570 . 
## bx4          6292.42    4655.39   1.352  0.19158   
## bx5               NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.63 on 20 degrees of freedom
## Multiple R-squared:  0.4804, Adjusted R-squared:  0.3764 
## F-statistic: 4.622 on 4 and 20 DF,  p-value: 0.008332
## 
## [1] "Liver"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.84 -18.19   0.00  21.29  42.30 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    42.18      29.68   1.421 0.170729    
## bx1            10.75      31.88   0.337 0.739524    
## bx2           222.14      53.96   4.117 0.000535 ***
## bx3         -1141.68     439.26  -2.599 0.017161 *  
## bx4          5758.12    2841.23   2.027 0.056243 .  
## bx5               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.68 on 20 degrees of freedom
## Multiple R-squared:  0.5228, Adjusted R-squared:  0.4273 
## F-statistic: 5.477 on 4 and 20 DF,  p-value: 0.003815

explo.plot(mylengthbias, samples = NULL, toplot ="global")

Length Bias, output 2

show(mylengthbias)

## [1] "Kidney"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.399 -19.603   3.049  25.848  74.827 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   121.33      48.63   2.495  0.02148 * 
## bx1           -34.97      52.24  -0.670  0.51082   
## bx2           269.24      88.41   3.045  0.00639 **
## bx3         -1301.13     719.73  -1.808  0.08570 . 
## bx4          6292.42    4655.39   1.352  0.19158   
## bx5               NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.63 on 20 degrees of freedom
## Multiple R-squared:  0.4804, Adjusted R-squared:  0.3764 
## F-statistic: 4.622 on 4 and 20 DF,  p-value: 0.008332
## 
## [1] "Liver"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.84 -18.19   0.00  21.29  42.30 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    42.18      29.68   1.421 0.170729    
## bx1            10.75      31.88   0.337 0.739524    
## bx2           222.14      53.96   4.117 0.000535 ***
## bx3         -1141.68     439.26  -2.599 0.017161 *  
## bx4          5758.12    2841.23   2.027 0.056243 .  
## bx5               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.68 on 20 degrees of freedom
## Multiple R-squared:  0.5228, Adjusted R-squared:  0.4273 
## F-statistic: 5.477 on 4 and 20 DF,  p-value: 0.003815

RNA composition

mycd = dat(mydata, type = "cd", norm = FALSE, refColumn = 1)

## [1] "Reference sample is: R1L1Kidney"
## [1] "Confidence intervals for median of M:"
##            0.28%                 99.72%                Diagnostic Test
## R1L2Liver  "-0.891798603865612"  "-0.766824549577716"  "FAILED"       
## R1L3Kidney "-0.0471703644087826" "-0.0471703644087824" "FAILED"       
## R1L4Liver  "-0.879624287764932"  "-0.7540048872931"    "FAILED"       
## R1L6Liver  "-0.903134087665681"  "-0.761454067082924"  "FAILED"       
## R1L7Kidney "0.0348451027997053"  "0.0348451027997056"  "FAILED"       
## R1L8Liver  "-0.904500838561669"  "-0.760538326376263"  "FAILED"       
## R2L2Kidney "-0.0850229820491386" "-0.0468824681954803" "FAILED"       
## R2L3Liver  "-0.879935949269671"  "-0.749483184477486"  "FAILED"       
## R2L6Kidney "-0.0726984636810064" "-0.0363901813919369" "FAILED"       
## [1] "Diagnostic test: FAILED. Normalization is required to correct this bias."

explo.plot(mycd)

Batch effect

Here, we introduce a batch effect in example, and will later correct for it:

set.seed(123)
mycounts2 = mycounts 
mycounts2[, 1:4] = mycounts2[, 1:4] 
runif(nrow(mycounts2) * 4, 3, 5)
myfactors = data.frame(myfactors, batch = c(rep(1, 4), rep(2, 6)))
mydata2 = readData(mycounts2, factors = myfactors)
myPCA = dat(mydata2, type = "PCA") 
par(mfrow = c(1, 2)) 
explo.plot(myPCA, factor = "Tissue") 
explo.plot(myPCA, factor = "batch")

Sequencing Depth

A key question in RNA-Seq:

  • Are the number of sequencing reads enough to study our genome and to properly quantify feature expression?

Saturation plot

  • How many features in the genome are being detected with more than a given number of counts, k (k=0 or k=5)

  • How the number of detected features would change with increasing sequencing depth

    • For all features or for each biotype

Saturation plot

mysaturation = dat(mydata, k = 0, ndepth = 7, type = "saturation")
explo.plot(mysaturation, toplot = 1, samples = 1:2, yleftlim = NULL, 
yrightlim = NULL)

Distribution of protein-coding genes of a specific biotype

explo.plot(mysaturation, toplot="protein_coding", samples=1:4)

Count distribution for all samples for a certain biotype

explo.plot(mycountsbio, toplot = "protein_coding", samples = NULL, 
plottype = "boxplot")

Sensitivity plot

explo.plot(mycountsbio, toplot = 1, samples = NULL, plottype = "barplot")

Filtering low counts

  • Excluding low counts generally improves differential expression results
    • noise is reduced
  • The best procedure to filter low count features is unknown
  • NOISeq proposes 3 different methods to filter out features with low counts:
    1. CPM Method
    2. Wilcoxon test
    3. Proportion test
myfilt = filtered.data(mycounts, factor = myfactors$Tissue, norm = FALSE,
depth = NULL, method = 1, cv.cutoff = 100, cpm = 1)

## Filtering out low count features...
## 4406 features are to be kept for differential expression analysis with filtering method 1

Normalization

  • Necessary to analyze between samples and remove sequencing depth bias
  • Can normalize in RPKM, Upper Quartile, and TMM
##                 R1L1Kidney  R1L2Liver R1L3Kidney  R1L4Liver  R1L6Liver
## ENSG00000177757   1.866057  0.8160788  0.0000000  0.0000000  0.8375346
## ENSG00000187634  22.597824 10.8910916 19.1928426 13.6356594  9.5215171
## ENSG00000188976  43.360221 17.6638209 44.2649178 28.9256646 23.9932411
## ENSG00000187961   6.946966  3.2406417  6.7235013  5.2356905  4.5730336
## ENSG00000187583   0.270246  0.0000000  0.2615529  0.2350098  0.0000000
## ENSG00000187642   1.334678  0.0000000  1.6146812  0.0000000  0.5990378
##                 R1L7Kidney R1L8Liver R2L2Kidney  R2L3Liver R2L6Kidney
## ENSG00000177757   1.911677  0.000000   0.879631  0.7854861  2.5794742
## ENSG00000187634  19.370633 14.972295  18.261034  9.7063073 19.9748412
## ENSG00000188976  41.377759 30.302685  39.198749 21.0020365 44.8846064
## ENSG00000187961   6.167891  5.155075   5.239510  7.7978955  6.4019183
## ENSG00000187583   0.830558  0.000000   0.000000  0.4550224  0.7471288
## ENSG00000187642   4.101920  0.309504   2.831161  1.1236213  2.7674127

Normalization Cheat Sheet

  • long = Numeric vector containing the length of the features

  • k, Counts equal to 0 are changed to k in order to avoid indeterminations when applying logarithms. By default, k = 0

  • k = NULL, makes all zero values into a value between 0 and the lowest non-zero value.

  • lc = Correction factor for length normalization. This correction is done by dividing the counts vector by (length/1000)lc. If lc = 0, no length correction is applied. By default, lc = 1 for RPKM and lc = 0 for the other methods.

Batch Effect Correction

  • ARSyNseq filters the noise associated to identified or not identified batch effects considering the experimental design and applying Principal Component Analysis (PCA) to the analysis of variance (ANOVA) parameters and residuals.

If the batch for each sample is known:

Batch is unknown

If the batch for each sample is unknown:

NOISeq options

  • Designed to compute differential expression on data:

    • With technical replicates (NOISeq-real)
    • No replicates at all! (NOISeq-sim)
  • NOISeq-real used when technical replicates are available.

    • Summarizes by summing them up
  • NOISeq-real can be used on Biological replicates

    • Averages replicates instead of summing.
    • Manual recommends use of NOISeqBIO for biological replicates

How NOISeq-real works- Summarized

NOISeq-real estimates the probability distribution for M and D in an empirical way, by computing M and D values for every pair of replicates within the same experimental condition and for every feature. Then, all these values are pooled together to generate the noise distribution*

M = log2-ratio of the two conditions

D = the value of the difference between conditions.

A feature is differentially expressed if its corresponding M and D values are likely to be higher than in noise.

Disclaimer: Biological replicates are necessary if the goal is to make any inference about the population. Deriving differential expression from technical replicates is useful for drawing conclusions about the specific samples being compared in the study but not for extending these conclusions to the whole population

*When using NOISeq-real for biological values are averaged instead of summed up.

NOISeq-sim

NOISeq can simulate technical replicates

  • Assumes that read counts follow a multinomial distribution

  • probabilities for each feature in the multinomial distribution are the probablility of a read to map onto that feature.

  • mapping probability are approximated by using counts in the only sample of the corresponding experimental condition.

NOISeq-real & NOISeq-sim

NOISeq-real

mynoiseq = noiseq(mydata, k= 0.5, norm = "rpkm", factor = "Tissue", pnr = 0.2, 
nss = 5, v = 0.02, lc = 1, replicates = "technical")

## [1] "Computing (M,D) values..."
## [1] "Computing probability of differential expression..."

NOISeq-sim

myresults <- noiseq(mydata, factor = "Tissue", k = NULL, norm = "n", pnr = 0.2, 
nss = 5, v = 0.02, lc = 1, replicates = "no")

## [1] "Computing (M,D) values..."
## [1] "Computing probability of differential expression..."

NOISeqBIO

Optimized for use on biological replicates (at least 2 per condition).

Too much stats- see page 20 of NOISeq manual

mynoiseqbio = noiseqbio(mydata, k = 0.5, norm = "rpkm", factor = "Tissue", 
lc = 1, r = 20, adj = 1.5, plot = FALSE, a0per = 0.9, random.seed = 12345, 
filter = 2)

## Computing Z values...
## Filtering out low count features...
## 5088 features are to be kept for differential expression analysis with filtering method 2
## [1] "r = 1"
## [1] "r = 2"
## [1] "r = 3"
## [1] "r = 4"
## [1] "r = 5"
## [1] "r = 6"
## [1] "r = 7"
## [1] "r = 8"
## [1] "r = 9"
## [1] "r = 10"
## [1] "r = 11"
## [1] "r = 12"
## [1] "r = 13"
## [1] "r = 14"
## [1] "r = 15"
## [1] "r = 16"
## [1] "r = 17"
## [1] "r = 18"
## [1] "r = 19"
## [1] "r = 20"
## Computing probability of differential expression...
## p0 = 0.132948760535283
## Probability
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.9049  0.9937  0.8680  1.0000  1.0000

Results

To look at the number of differentially expressed genes (DEGs):

Works-cited

## 
##   Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A., &
##   Conesa, A. (2011). Differential expression in RNA-seq: a matter
##   of depth. Genome Research, 21(12), 2213-2223.
## 
##   Tarazona, S., Furio-Tari, P., Turra, D., Di Pietro, A., Nueda,
##   M.J., Ferrer, A., & Conesa, A. (2015). Data quality aware
##   analysis of differential expression in RNA-seq with NOISeq
##   R/Bioc package. Nucleic Acids Research.

Comments