diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index 0c67f35..0000000 Binary files a/.DS_Store and /dev/null differ diff --git a/data/.DS_Store b/data/.DS_Store deleted file mode 100644 index b31f438..0000000 Binary files a/data/.DS_Store and /dev/null differ diff --git a/episode2.md b/episode2.md index 95e3b77..d7eb7ae 100644 --- a/episode2.md +++ b/episode2.md @@ -111,7 +111,7 @@ download.file(url = "https://zenodo.org/record/8125141/files/E-MTAB-11349.counts In R studio, open your project workbook and read the raw counts data. Then check the dimensions of the matrix to confirm we have the expected number of samples and transcript IDs. -```r +``` r raw.counts.ibd <- read.table(file="data/E-MTAB-11349.counts.matrix.csv", sep=",", header=T, @@ -121,7 +121,7 @@ raw.counts.ibd <- read.table(file="data/E-MTAB-11349.counts.matrix.csv", writeLines(sprintf("%i %s", c(dim(raw.counts.ibd)[1], dim(raw.counts.ibd)[2]), c("rows corresponding to transcript IDs", "columns corresponding to samples"))) ``` -```{.output} +``` output 22751 rows corresponding to transcript IDs 592 columns corresponding to samples ``` @@ -136,11 +136,11 @@ View a small subset of the data, (e.g. first ten rows and 8 columns) to see how :::::::::::::::::::::::: solution -```r +``` r raw.counts.ibd[1:10,1:8] ``` -```{.output} +``` output read Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 1 1 * 13961 16595 20722 17696 25703 20848 2 2 ERCC-00002 0 0 0 0 0 0 @@ -163,7 +163,7 @@ raw.counts.ibd[1:10,1:8] Now let's read the sdrf file (a plain text file) into R and check the dimensions of the file. -```r +``` r # read in the sdrf file samp.info.ibd <- read.table(file="data/E-MTAB-11349.sdrf.txt", sep="\t", header=T, fill=T, check.names=F) @@ -171,26 +171,26 @@ samp.info.ibd <- read.table(file="data/E-MTAB-11349.sdrf.txt", sep="\t", header= sprintf("There are %i rows, corresponding to the samples", dim(samp.info.ibd)[1]) ``` -```{.output} +``` output [1] "There are 590 rows, corresponding to the samples" ``` -```r +``` r sprintf("There are %i columns, corresponding to the available variables for each sample", dim(samp.info.ibd)[2]) ``` -```{.output} +``` output [1] "There are 32 columns, corresponding to the available variables for each sample" ``` If we view the column names, we can see that the file does indeed contain a set of variables describing both phenotypical and experimental protocol information relating to each sample. -```r +``` r colnames(samp.info.ibd) ``` -```{.output} +``` output [1] "Source Name" [2] "Characteristics[organism]" [3] "Characteristics[age]" diff --git a/episode3.md b/episode3.md index 8e03fe7..ec75acf 100644 --- a/episode3.md +++ b/episode3.md @@ -112,36 +112,41 @@ The function `getGEO()` from the `GEOquery` library provides a convenient way to If there is more than one SOFT file for a GEO Series, `getGEO()` will return a list of datasets. Let's download GSE212041. -```r +``` r gse212041 <- GEOquery::getGEO("GSE212041") ``` -```{.output} +``` warning +Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by +'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment' +``` + +``` output Setting options('download.file.method.GEOquery'='auto') ``` -```{.output} +``` output Setting options('GEOquery.inmemory.gpl'=FALSE) ``` -```{.output} +``` output Found 2 file(s) ``` -```{.output} +``` output GSE212041-GPL18573_series_matrix.txt.gz ``` -```{.output} +``` output GSE212041-GPL24676_series_matrix.txt.gz ``` -```r +``` r sprintf("Number of files downloaded: %i", length(gse212041)) ``` -```{.output} +``` output [1] "Number of files downloaded: 2" ``` @@ -154,11 +159,11 @@ Write the code to check that the number of samples in each file gives us the tot :::::::::::::::::::::::: solution -```r +``` r writeLines(sprintf("file %i: %i samples", 1:2, c(dim(gse212041[[1]])[2], dim(gse212041[[2]])[2]))) ``` -```{.output} +``` output file 1: 16 samples file 2: 765 samples ``` @@ -175,13 +180,13 @@ file 2: 765 samples We'll extract the metadata for the larger dataset (765 samples) and examine the column names to verify that the file contains the expected metadata about the experiment. -```r +``` r samp.info.cov19 <- Biobase::pData(gse212041[[2]]) colnames(samp.info.cov19) ``` -```{.output} +``` output [1] "title" "geo_accession" [3] "status" "submission_date" [5] "last_update_date" "type" @@ -215,11 +220,11 @@ colnames(samp.info.cov19) If we use the `exprs()` function to extract the counts data from the expression slot of the downloaded dataset, we'll see that in this data series, the counts matrix is not there. The expression set only contains a list of the accession numbers of the samples included in the expression set, but not the actual count data. (The code below attempts to view a sample of the data). -```r +``` r Biobase::exprs(gse212041[[2]])[,1:10] ``` -```{.output} +``` output GSM6507615 GSM6507616 GSM6507617 GSM6507618 GSM6507619 GSM6507620 GSM6507621 GSM6507622 GSM6507623 GSM6507624 ``` @@ -228,11 +233,11 @@ Biobase::exprs(gse212041[[2]])[,1:10] We can verify this by looking at the dimensions of the object in the exprs slot. -```r +``` r dim(Biobase::exprs(gse212041[[2]])) ``` -```{.output} +``` output [1] 0 765 ``` diff --git a/episode4.md b/episode4.md index 69f6c54..0254354 100644 --- a/episode4.md +++ b/episode4.md @@ -49,7 +49,7 @@ We'll start by reading in the dataset from our `data` subfolder. -```r +``` r samp.info.ibd <- read.table(file="./data/E-MTAB-11349.sdrf.txt", sep="\t", header=T, fill=T, check.names=F) raw.counts.ibd <- read.table(file="./data/E-MTAB-11349.counts.matrix.csv", sep="," , header=T, fill=T, check.names=F) @@ -82,11 +82,11 @@ Item | Check For... | Rationale We'll now apply these steps sequentially to the sample information for the IBD dataset, contained in our variable `samp.info.ibd`. First let's take a look at the data: -```r +``` r dplyr::glimpse(samp.info.ibd) ``` -```{.output} +``` output Rows: 590 Columns: 32 $ `Source Name` "Sample 1", "Sample 2", "Sam… @@ -142,7 +142,7 @@ All of the checklist items apply in this case. Let's go through them... 1. Let's verify that there is a unique identifier that matches between the sample information and counts matrix, and identify the name of the column in the sample information. We manually find the names of the samples in `raw.counts.ibd` (the counts matrix), and set to a new variable `ibd.samp.names`. We then write a `for loop` that evaluates which columns in the sample information match the sample names. There is one match, the column named `Source Name`. Note that there are other columns such as `Assay Name` in this dataset that contain identifiers for some but not all samples. This is a good illustration of why it is important to check carefully to ensure you have a complete set of unique identifiers. -```r +``` r ibd.samp.names <- colnames(raw.counts.ibd)[3:ncol(raw.counts.ibd)] lst.colnames <- c() @@ -153,7 +153,7 @@ for(i in seq_along(1:ncol(samp.info.ibd))){ sprintf("The unique IDs that match the counts matrix are in column: %s", colnames(samp.info.ibd)[which(lst.colnames)]) ``` -```{.output} +``` output [1] "The unique IDs that match the counts matrix are in column: Source Name" ``` @@ -164,7 +164,7 @@ sprintf("The unique IDs that match the counts matrix are in column: %s", colname -```r +``` r samp.info.ibd.sel <- dplyr::select(samp.info.ibd, 'Source Name', 'Characteristics[age]', @@ -178,7 +178,7 @@ samp.info.ibd.sel <- dplyr::select(samp.info.ibd, 3. Let's then rename the variables to something more easy to interpret for humans, avoiding spaces and special characters. We'll save these selected columns to a new variable name `samp.info.ibd.sel`. -```r +``` r samp.info.ibd.sel <- dplyr::rename(samp.info.ibd.sel, 'sampleID' = 'Source Name', 'age' = 'Characteristics[age]', @@ -192,11 +192,11 @@ samp.info.ibd.sel <- dplyr::rename(samp.info.ibd.sel, 4. Now check how each of the variables are encoded in our reduced sample information data to identify any errors, data gaps and inconsistent coding of categorical variables. Firstly let's take a look at the data. -```r +``` r dplyr::glimpse(samp.info.ibd.sel) ``` -```{.output} +``` output Rows: 590 Columns: 4 $ sampleID "Sample 1", "Sample 2", "Sample 3", "Sample 4", "Sample 5", … @@ -219,13 +219,13 @@ What steps do we need to take to ensure that these four data columns are consist * There is clearly an issue with the coding of the `condition` Crohn's disease. We'll fix that first, and then check the consistency of the coding of categorical variables `sex` and `condition`. -```r +``` r samp.info.ibd.sel$condition[agrep("Crohns", samp.info.ibd.sel$condition)] <- "crohns_disease" unique(c(samp.info.ibd.sel$sex, samp.info.ibd.sel$condition)) ``` -```{.output} +``` output [1] "male" "female" "crohns_disease" [4] "normal" "ulcerative colitis" ``` @@ -237,7 +237,7 @@ Before we do this, since we'll be using the pipe operator from the tidyverse lib -```r +``` r `%>%` <- magrittr::`%>%` samp.info.ibd.sel <- samp.info.ibd.sel %>% @@ -255,14 +255,14 @@ samp.info.ibd.sel <- samp.info.ibd.sel %>% -```r +``` r samp.info.ibd.sel <- samp.info.ibd.sel %>% dplyr::mutate(class = dplyr::if_else(condition == 'normal', -1, 1)) ``` -```r +``` r samp.info.ibd.sel[c('sex', 'condition', 'class')] <- lapply(samp.info.ibd.sel[c('sex', 'condition', 'class')], factor) ``` @@ -271,11 +271,11 @@ samp.info.ibd.sel[c('sex', 'condition', 'class')] <- lapply(samp.info.ibd.sel[c( 6. Finally, let's check the distribution of the classes by creating a `table` from the `class` column. -```r +``` r table(samp.info.ibd.sel$class) ``` -```{.output} +``` output -1 1 267 323 @@ -286,11 +286,11 @@ table(samp.info.ibd.sel$class) The two classes are approximately equally represented, so let's check everything one last time. -```r +``` r dplyr::glimpse(samp.info.ibd.sel) ``` -```{.output} +``` output Rows: 590 Columns: 5 $ sampleID "Sample_1", "Sample_2", "Sample_3", "Sample_4", "Sample_5", … @@ -330,11 +330,11 @@ Item | Check For... | Rationale 1. Take a look at a sample of columns and rows to see what the downloaded file looks like -```r +``` r raw.counts.ibd[1:10,1:8] ``` -```{.output} +``` output read Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 1 1 * 13961 16595 20722 17696 25703 20848 2 2 ERCC-00002 0 0 0 0 0 0 @@ -361,7 +361,7 @@ Can you see the irrelevant information that we need to remove from the counts ma * Move the column named `read` that contains to the transcript IDs to the row names -```r +``` r counts.mat.ibd <- raw.counts.ibd[-1,-1] rownames(counts.mat.ibd) <- NULL @@ -371,7 +371,7 @@ counts.mat.ibd <- counts.mat.ibd %>% tibble::column_to_rownames('read') counts.mat.ibd[1:10,1:6] ``` -```{.output} +``` output Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 ERCC-00002 0 0 0 0 0 0 ERCC-00003 0 0 0 0 0 0 @@ -394,19 +394,19 @@ ERCC-00019 0 0 0 0 0 0 2. Let's check for duplicate sampleIDs and transcript IDs. Provided there are no duplicate row or column names, the following should return `interger(0)`. -```r +``` r which(duplicated(rownames(counts.mat.ibd))) ``` -```{.output} +``` output integer(0) ``` -```r +``` r which(duplicated(colnames(counts.mat.ibd))) ``` -```{.output} +``` output integer(0) ``` @@ -415,12 +415,12 @@ integer(0) 3. We'll double check the sampleIDs match the sample information file. -```r +``` r if(!identical(colnames(counts.mat.ibd), samp.info.ibd.sel$sampleID)){stop()} ``` -```{.error} -Error in eval(expr, envir, enclos): +``` error +Error: ``` ::::::::::::::::::::::::::::::::::::: challenge @@ -434,7 +434,7 @@ Why did we get an error here? We renamed the samples in the sample information to remove spaces, so we need to do the same here. -```r +``` r colnames(counts.mat.ibd) <- gsub(x = colnames(counts.mat.ibd), pattern = "\ ", replacement = "_") ``` @@ -447,13 +447,13 @@ colnames(counts.mat.ibd) <- gsub(x = colnames(counts.mat.ibd), pattern = "\ ", r -```r +``` r allMissValues <- function(x){all(is.na(x) | x == "")} allMissValues(counts.mat.ibd) ``` -```{.output} +``` output [1] FALSE ``` @@ -462,11 +462,11 @@ allMissValues(counts.mat.ibd) Take a final look at the cleaned up matrix. -```r +``` r counts.mat.ibd[1:10,1:6] ``` -```{.output} +``` output Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 ERCC-00002 0 0 0 0 0 0 ERCC-00003 0 0 0 0 0 0 @@ -480,19 +480,19 @@ ERCC-00017 2 0 0 0 1 0 ERCC-00019 0 0 0 0 0 0 ``` -```r +``` r sprintf("There are %i rows, corresponding to the transcript IDs", dim(counts.mat.ibd)[1]) ``` -```{.output} +``` output [1] "There are 22750 rows, corresponding to the transcript IDs" ``` -```r +``` r sprintf("There are %i columns, corresponding to the samples", dim(counts.mat.ibd)[2]) ``` -```{.output} +``` output [1] "There are 590 columns, corresponding to the samples" ``` @@ -537,7 +537,7 @@ Read the sdrf file into R and take a look at the data. Make a list of the potent As a help, the code to read the file in from your data directory is: -```r +``` r samp.info.tb <- read.table(file="./data/E-MTAB-6845.sdrf.txt", sep="\t", header=T, fill=T, check.names=F) ``` @@ -605,7 +605,7 @@ dplyr::glimpse() -```r +``` r samp.info.tb %>% dplyr::select( @@ -630,7 +630,7 @@ samp.info.tb %>% dplyr::glimpse() # view output ``` -```{.output} +``` output Rows: 360 Columns: 3 $ sampleID "PR123_S19", "PR096_S13", "PR146_S14", "PR158_S12", "PR095… diff --git a/episode5.md b/episode5.md index 3f332dd..9ca7cc2 100644 --- a/episode5.md +++ b/episode5.md @@ -50,14 +50,14 @@ download.file(url = "https://zenodo.org/record/8125141/files/counts.mat.ibd.txt" And now read the files into R... -```r +``` r # suppressPackageStartupMessages(library(dplyr, quietly = TRUE)) # suppressPackageStartupMessages(library(ggplot2, quietly = TRUE)) # suppressPackageStartupMessages(library(tibble, quietly = TRUE)) ``` -```r +``` r samp.info.ibd.sel <- read.table(file="./data/ibd.sample.info.txt", sep="\t", header=T, fill=T, check.names=F) counts.mat.ibd <- read.table(file="./data/counts.mat.ibd.txt", sep='\t', header=T, fill=T, check.names=F) @@ -68,7 +68,7 @@ Run the following code to view the histogram giving the frequency of the maximum A simple filtering approach is to remove all genes where the maximum read count for that gene over all samples is below a given threshold. The next step is to determine what this threshold should be. -```r +``` r `%>%` <- magrittr::`%>%` data.frame(max_count = apply(counts.mat.ibd, 1, max, na.rm=TRUE)) %>% @@ -79,12 +79,14 @@ data.frame(max_count = apply(counts.mat.ibd, 1, max, na.rm=TRUE)) %>% ggplot2::scale_x_log10(n.breaks = 6, labels = scales::comma) ``` -```{.warning} -Warning: Transformation introduced infinite values in continuous x-axis +``` warning +Warning in ggplot2::scale_x_log10(n.breaks = 6, labels = scales::comma): log-10 +transformation introduced infinite values. ``` -```{.warning} -Warning: Removed 10 rows containing non-finite values (`stat_bin()`). +``` warning +Warning: Removed 10 rows containing non-finite outside the scale range +(`stat_bin()`). ``` @@ -111,7 +113,7 @@ There is no right answer. However, looking at the histogram, you can see an appr In order to be able to compare read counts between samples, we must first adjust ('normalise') the counts to control for differences in sequence depth and sample composition between samples. To achieve this, we run the following code that will normalise the counts matrix using the median-of-ratios method implemented in the R package `DESeq2`. For more information on the rationale for scaling RNA-Seq counts and a comparison of the different metrics used see [RDMBites | RNAseq expression data](https://www.youtube.com/watch?v=tO2H3zuBouw). -```r +``` r # convert the condition variable to a factor as required by DESeq2 samp.info.ibd.sel[c('condition')] <- lapply(samp.info.ibd.sel[c('condition')], factor) @@ -122,12 +124,12 @@ dds.ibd <- DESeq2::DESeqDataSetFromMatrix( design = ~ condition) ``` -```{.warning} +``` warning Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment' ``` -```r +``` r # calculate the normalised count values using the median-of-ratios method dds.ibd <- dds.ibd %>% DESeq2::estimateSizeFactors() @@ -142,7 +144,7 @@ There are a number of methodologies to calculate the appropriate threshold. One Calculating the Jaccard index between all pairs of samples in a dataset does not however scale well with the dimensionality of the data. Here we use an alternative approach that achieves a very similar result using the Multiset Jaccard index, which is faster to compute. Don't worry too much about the code, the main point is that we find a filter threshold, and filter out all the genes where the max count value is below the threshold, on the basis that these are likely technical noise. Run the code to plot the Multiset Jaccard Index for a series of values for the filter threshold. -```r +``` r # For the purpose of illustration, and to shorten the run time, we sample 5,000 genes from the counts matrix. set.seed(10) @@ -184,6 +186,12 @@ ggplot2::ggplot(data=data.frame(t = t.seq, jacc = ms.jac)) + ggplot2::ylab("Multiset Jaccard Index") ``` +``` warning +Warning in ggplot2::geom_point(ggplot2::aes(x = which.max(ms.jac), y = max(ms.jac)), : All aesthetics have length 1, but the data has 25 rows. +ℹ Please consider using `annotate()` or provide this layer with data containing + a single row. +``` + ::::::::::::::::::::::::::::::::::::: challenge @@ -197,11 +205,11 @@ What value should we use for the low counts threshold? The threshold value is given by the following code, which should return a value close to 10 for this dataset. -```r +``` r (t.hold <- which.max(ms.jac)) ``` -```{.output} +``` output [1] 11 ``` @@ -215,13 +223,13 @@ The threshold value is given by the following code, which should return a value Having determined a threshold, we then filter the raw counts matrix on the rows (genes) that meet the threshold criterion based on the normalised counts. As you can see, around 4,000 genes are removed from the dataset, which is approximately 20% of the genes. These genes are unlikely to contain biologically meaningful information but they might easily bias a machine learning classifier. -```r +``` r counts.mat.ibd.filtered <- counts.mat.ibd[which(apply(counts.ibd.norm, 1, function(x){sum(x > t.hold) >= 1})),] sprintf("Genes filtered: %s; Genes remaining: %s", nrow(counts.mat.ibd)-nrow(counts.mat.ibd.filtered), nrow(counts.mat.ibd.filtered)) ``` -```{.output} +``` output [1] "Genes filtered: 3712; Genes remaining: 19038" ``` @@ -235,20 +243,20 @@ of the general population and a result of biological heterogeneity and technical Run the following code to view the top 10 values of read counts in the raw counts matrix. Compare this with the histogram above, and the mean read count. The largest read count values range from 2MM to over 3MM counts for a gene in a particular sample. These require investigation! -```r +``` r tail(sort(as.matrix(counts.mat.ibd)),10) ``` -```{.output} +``` output [1] 2037946 2038514 2043983 2133125 2238093 2269033 2341479 2683585 3188911 [10] 3191428 ``` -```r +``` r sprintf("The mean read count value: %f", mean(as.matrix(counts.mat.ibd))) ``` -```{.output} +``` output [1] "The mean read count value: 506.731355" ``` @@ -259,7 +267,7 @@ As with low counts, multiple methods exist to identify influential outlier read Run the following code to create DESeq Data Set object from the filtered raw counts matrix, with condition as the experimental factor of interest. -```r +``` r dds.ibd.filt <- DESeq2::DESeqDataSetFromMatrix( countData = as.matrix(counts.mat.ibd.filtered), colData = data.frame(samp.info.ibd.sel, row.names = 'sampleID'), @@ -269,56 +277,56 @@ dds.ibd.filt <- DESeq2::DESeqDataSetFromMatrix( Run `DESeq2` differential expression analysis, which automatically calculates the Cook's distances for every read count. This may take a few minutes to run. -```r +``` r deseq.ibd <- DESeq2::DESeq(dds.ibd.filt) ``` -```{.output} +``` output estimating size factors ``` -```{.output} +``` output estimating dispersions ``` -```{.output} +``` output gene-wise dispersion estimates ``` -```{.output} +``` output mean-dispersion relationship ``` -```{.output} +``` output final dispersion estimates ``` -```{.output} +``` output fitting model and testing ``` -```{.output} +``` output -- replacing outliers and refitting for 1559 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) ``` -```{.output} +``` output estimating dispersions ``` -```{.output} +``` output fitting model and testing ``` -```r +``` r cooks.mat <- SummarizedExperiment::assays(deseq.ibd)[["cooks"]] ``` We now calculate the cooks outlier threshold by computing the expected F-distribution for the number of samples in the dataset. -```r +``` r cooks.quantile <- 0.95 m <- ncol(deseq.ibd) # number of samples p <- 3 # number of model parameters (in the three condition case) @@ -329,13 +337,13 @@ h.threshold <- stats::qf(cooks.quantile, p, m - p) Filter the counts matrix to eliminate all genes where the cooks distance is over the outlier threshold. Here you can see that a further ~1,800 genes are filtered out based on having outlier read count values for at least one sample. -```r +``` r counts.mat.ibd.ol.filtered <- counts.mat.ibd.filtered[which(apply(cooks.mat, 1, function(x){(max(x) < h.threshold) >= 1})),] sprintf("Genes filtered: %s; Genes remaining: %s", nrow(counts.mat.ibd.filtered)-nrow(counts.mat.ibd.ol.filtered), nrow(counts.mat.ibd.ol.filtered)) ``` -```{.output} +``` output [1] "Genes filtered: 1776; Genes remaining: 17262" ``` @@ -343,7 +351,7 @@ sprintf("Genes filtered: %s; Genes remaining: %s", nrow(counts.mat.ibd.filtered) Run the same plot as above, and we can see that the genes with very low maximum counts over all samples have been removed. Note that here we are looking at the raw unnormalised counts, so not all maximum counts are below 11, as the filter was applied to the normailsed counts. -```r +``` r data.frame(max_count = apply(counts.mat.ibd.ol.filtered, 1, max, na.rm=TRUE)) %>% ggplot2::ggplot(ggplot2::aes(x = max_count)) + ggplot2::geom_histogram(bins = 200) + diff --git a/episode6.md b/episode6.md index 4c2911f..ee51360 100644 --- a/episode6.md +++ b/episode6.md @@ -39,14 +39,14 @@ download.file(url = "https://zenodo.org/record/8125141/files/counts.mat.ibd.ol.f And now read the files into R... -```r +``` r # suppressPackageStartupMessages(library(dplyr, quietly = TRUE)) # suppressPackageStartupMessages(library(ggplot2, quietly = TRUE)) # suppressPackageStartupMessages(library(tibble, quietly = TRUE)) ``` -```r +``` r samp.info.ibd.sel <- read.table(file="./data/ibd.sample.info.txt", sep="\t", header=T, fill=T, check.names=F) counts.mat.ibd.ol.filtered <- read.table(file="./data/counts.mat.ibd.ol.filtered.txt", sep='\t', header=T, fill=T, check.names=F) @@ -55,7 +55,7 @@ counts.mat.ibd.ol.filtered <- read.table(file="./data/counts.mat.ibd.ol.filtered A simple histogram of the raw counts data illustrates the skewed nature of the distribution. Here we plot a random sample of just 5,000 counts to illustrate the point. -```r +``` r set.seed(seed = 30) `%>%` <- magrittr::`%>%` @@ -75,7 +75,7 @@ counts.mat.ibd.ol.filtered %>% A plot of the mean count against the variance of the counts for a random sample of 5,000 genes (to reduce compute time) illustrates the clear mean variance relationship in the data. The variance is increasing as the mean count increases. -```r +``` r counts.mat.ibd.ol.filtered %>% .[sample(nrow(.), size = 5000, replace = FALSE),] %>% data.frame(row.mean = apply(., 1, mean), @@ -96,7 +96,7 @@ A number of standard transformations exist to change the distribution of RNA-Seq The `vst` transformation is faster to run and more suitable for datasets with a large sample size. Let's conduct the `vst` transformation on our filtered counts data. This may take a few minutes to run on our dataset. -```r +``` r # convert the condition variable to a factor as required by DESeq2 samp.info.ibd.sel[c('condition')] <- lapply(samp.info.ibd.sel[c('condition')], factor) @@ -107,12 +107,12 @@ dds.ibd.filt.ol <- DESeq2::DESeqDataSetFromMatrix( design = ~ condition) ``` -```{.warning} +``` warning Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment' ``` -```r +``` r # calculate the vst count values and extract the counts matrix using the assay() function counts.mat.ibd.vst <- DESeq2::varianceStabilizingTransformation(dds.ibd.filt.ol, blind=FALSE) %>% SummarizedExperiment::assay() @@ -121,7 +121,7 @@ counts.mat.ibd.vst <- DESeq2::varianceStabilizingTransformation(dds.ibd.filt.ol, Plotting the data again, we can see the difference in the distribution of the data following transformation. Although there is still a large number of count values equal to zero, the distribution of vst transformed counts is far less heavily skewed than the original. The dependence of the variance on the mean has essentially been eliminated. -```r +``` r set.seed(seed = 30) counts.mat.ibd.vst %>% @@ -138,7 +138,7 @@ counts.mat.ibd.vst %>% -```r +``` r counts.mat.ibd.vst %>% as.data.frame() %>% .[sample(nrow(.), size = 5000, replace = FALSE),] %>% @@ -160,7 +160,7 @@ Many machine learning algorithms are sensitive to the scale of the data. Even af -```r +``` r counts.mat.ibd.vst %>% as.data.frame() %>% tibble::rownames_to_column("geneID") %>% diff --git a/fig/.DS_Store b/fig/.DS_Store deleted file mode 100644 index 5008ddf..0000000 Binary files a/fig/.DS_Store and /dev/null differ diff --git a/md5sum.txt b/md5sum.txt index fc24f84..d3197e3 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -1,17 +1,17 @@ "file" "checksum" "built" "date" -"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2024-05-14" -"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2024-05-14" -"config.yaml" "0dab73289fecfb94fc3103a200936186" "site/built/config.yaml" "2024-05-14" -"index.md" "a02c9c785ed98ddd84fe3d34ddb12fcd" "site/built/index.md" "2024-05-14" -"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2024-05-14" -"episodes/episode1.Rmd" "926badff629b1d85cfbf29a12e8c9291" "site/built/episode1.md" "2024-05-14" -"episodes/episode2.Rmd" "ffacead9b9c7c4d5ca3af86ca7f59799" "site/built/episode2.md" "2024-05-14" -"episodes/episode3.Rmd" "5852f58c31251027948529f82784e6f2" "site/built/episode3.md" "2024-05-14" -"episodes/episode4.Rmd" "982b6c815f8a49a2d4a6fe2bea3051a3" "site/built/episode4.md" "2024-05-14" -"episodes/episode5.Rmd" "cd73f8beb86205245cb08c773122cb3c" "site/built/episode5.md" "2024-05-14" -"episodes/episode6.Rmd" "4128e539652f9c6dc6285566fdca5f78" "site/built/episode6.md" "2024-05-14" -"instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2024-05-14" -"learners/reference.md" "1c7cc4e229304d9806a13f69ca1b8ba4" "site/built/reference.md" "2024-05-14" -"learners/setup.md" "d0c4fbb2853d84c779714577c6673c4b" "site/built/setup.md" "2024-05-14" -"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2024-05-14" -"renv/profiles/lesson-requirements/renv.lock" "d8e29c4c02c7634e3dc017e6f2421008" "site/built/renv.lock" "2024-05-14" +"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2024-11-05" +"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2024-11-05" +"config.yaml" "0dab73289fecfb94fc3103a200936186" "site/built/config.yaml" "2024-11-05" +"index.md" "a02c9c785ed98ddd84fe3d34ddb12fcd" "site/built/index.md" "2024-11-05" +"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2024-11-05" +"episodes/episode1.Rmd" "926badff629b1d85cfbf29a12e8c9291" "site/built/episode1.md" "2024-11-05" +"episodes/episode2.Rmd" "ffacead9b9c7c4d5ca3af86ca7f59799" "site/built/episode2.md" "2024-11-05" +"episodes/episode3.Rmd" "5852f58c31251027948529f82784e6f2" "site/built/episode3.md" "2024-11-05" +"episodes/episode4.Rmd" "982b6c815f8a49a2d4a6fe2bea3051a3" "site/built/episode4.md" "2024-11-05" +"episodes/episode5.Rmd" "cd73f8beb86205245cb08c773122cb3c" "site/built/episode5.md" "2024-11-05" +"episodes/episode6.Rmd" "4128e539652f9c6dc6285566fdca5f78" "site/built/episode6.md" "2024-11-05" +"instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2024-11-05" +"learners/reference.md" "1c7cc4e229304d9806a13f69ca1b8ba4" "site/built/reference.md" "2024-11-05" +"learners/setup.md" "d0c4fbb2853d84c779714577c6673c4b" "site/built/setup.md" "2024-11-05" +"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2024-11-05" +"renv/profiles/lesson-requirements/renv.lock" "92598a2e0f21d1a17fa11f956dc840e8" "site/built/renv.lock" "2024-11-05"