04-05-reducedims.Rmd

# PCAs and UMAPs {#reducedims}

[Slides](https://docs.google.com/presentation/d/17-AYqsosmKnJFgv_7DHYNbMQFIXrGAP4/edit#slide=id.p1)
<!-- why pca -->
<!-- how to pick genes for pca -->
<!-- why not pca? the blob of not very usefulness. -->
<!-- Elbow plots -->
<!-- oooh umap. -->


## Identification of highly variable features (feature selection)

#### Why do we need to do this? {- .rational}
Identifying the most variable features allows retaining the real biological variability of the data and reduce noise in the data.                            

####  {-}

We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). The Seurat authors and [others](https://www.nature.com/articles/nmeth.2645) have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

The procedure in Seurat is described in detail [here](https://doi.org/10.1016/j.cell.2019.05.031), and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the `FindVariableFeatures()` function. By default, Seurat returns 2,000 features per dataset. These will be used in downstream analysis, like PCA.

```{r var_features}
seurat_object <- FindVariableFeatures(seurat_object, selection.method = 'vst', nfeatures = 2000)
# Identify the 10 most highly variable genes
top_vf <- head(VariableFeatures(seurat_object), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(seurat_object)
plot2 <- LabelPoints(plot1, points = top_vf, repel = TRUE)
plot2
```

<!-- #### Challenge: Labelling Genes of Interest {- .challenge} -->

<!-- What if we wanted to look at genes we are specifically interested in? We can create a character vector of gene names and apply that to this plot. -->

<!-- Make a plot with labels for the genes IL8, IDH2 and CXCL3. -->


## Scaling the data

#### Why do we need to do this? {- .rational}

Highly expresed genes can overpower the signal of other less expresed genes with equal importance. Within the same cell the assumption is that the underlying RNA content is constant. Aditionally, If variables are provided in vars.to.regress, they are individually regressed against each feature, and the resulting residuals are then scaled and centered. This step allows controling for cell cycle and other factors that may bias your clustering.


####  {-}


Next, we apply a linear transformation ('scaling') that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The `ScaleData()` function:

* Shifts the expression of each gene, so that the mean expression across cells is 0
* Scales the expression of each gene, so that the variance across cells is 1
    + This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
* The results of this are stored in `seurat_object$RNA@scale.data`

```{r regress, fig.height=7, fig.width=11, results='hide'}
all.genes <- rownames(seurat_object)
seurat_object <- ScaleData(seurat_object, features = all.genes)
```

<!--omit-begin-->
<details>
  <summary>**This step takes too long! Can I make it faster?**</summary>

Scaling is an essential step in the Seurat workflow, but only on genes that will be used as input to PCA. Therefore, the default in `ScaleData()` is only to perform scaling on the previously identified variable features (2,000 by default). To do this, omit the `features` argument in the previous function call, i.e.
```{r regressvar, fig.height=7, fig.width=11, results='hide',eval = FALSE}
# seurat_object <- ScaleData(seurat_object)
```
Your PCA and clustering results will be unaffected. However, Seurat heatmaps (produced as shown below with `DoHeatmap()`) require genes in the heatmap to be scaled, to make sure highly-expressed genes don't dominate the heatmap. To make sure we don't leave any genes out of the heatmap later, we are scaling all genes in this tutorial. 
</details>
\  
<details>
  <summary>**How can I remove unwanted sources of variation, as in Seurat v2?**</summary>

In `Seurat v2` we also use the `ScaleData()` function to remove unwanted sources of variation from a single-cell dataset. For example, we could 'regress out' heterogeneity associated with (for example) cell cycle stage, or mitochondrial contamination. These features are still supported in `ScaleData()` in `Seurat v3`, i.e.: 
```{r regressvarmt, fig.height=7, fig.width=11, results='hide',eval = FALSE}
# seurat_object <- ScaleData(seurat_object, vars.to.regress = 'percent.mt')
```
However, particularly for advanced users who would like to use this functionality, the Seurat authors strongly recommend the use of their new normalization workflow, `SCTransform()`. The method is described in their [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1), with a separate vignette using Seurat v3 [here](sctransform_vignette.html). As with `ScaleData()`, the function `SCTransform()` also includes a `vars.to.regress` parameter.
</details>
\ 

***
<!--omit-end-->

# Dimensionality reduction

#### Why do we need to do this? {- .rational}

Imagine each gene represents a dimension - or an axis on a plot. We could plot the expression of two genes with a simple scatterplot. But a genome has thousands of genes - how do you collate all the information from each of those genes in a way that allows you to visualise it in a 2 dimensional image. This is where dimensionality reduction comes in, we calculate meta-features that contains combinations of the variation of different genes. From thousands of genes, we end up with 10s of meta-features

####  {-}                    


## Perform linear dimensional reduction

Next we perform PCA on the scaled data. By default, only the previously determined variable features are used as input, but can be defined using `features` argument if you wish to choose a different subset.

```{r pca,results='hide'}
seurat_object <- RunPCA(seurat_object, features = VariableFeatures(object = seurat_object))
```

Seurat provides several useful ways of visualizing both cells and features that define the PCA, including `VizDimReduction()`, `DimPlot()`, and `DimHeatmap()`

```{r pca_viz, message=TRUE}
# Examine and visualize PCA results a few different ways
print(seurat_object$pca, dims = 1:5, nfeatures = 5)
VizDimLoadings(seurat_object, dims = 1:2, reduction = 'pca')
DimPlot(seurat_object, reduction = 'pca')
```

In particular `DimHeatmap()` allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting `cells` to a number plots the 'extreme' cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated feature sets.

```{r single-heatmap}
DimHeatmap(seurat_object, dims = 1, cells = 500, balanced = TRUE)
```

```{r multi-heatmap, fig.height=15, fig.width=9}
DimHeatmap(seurat_object, dims = 1:15, cells = 500, balanced = TRUE)
```


## Determine the 'dimensionality' of the dataset

To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a 'metafeature' that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many components should we choose to include? 10? 20? 100?

-----

*Note*: The Seurat developers suggest using a JackStraw resampling test to determine this. See [Macosko *et  al*](http://www.cell.com/abstract/S0092-8674(15)00549-8), and the original [seurat_object3 vignette](https://satijalab.org/seurat/articles/seurat_object3k_tutorial.html#determine-the-dimensionality-of-the-dataset-1). We're going to use an Elbow Plot instead here, because its much quicker.

-----

An alternative heuristic method generates an 'Elbow plot': a ranking of principle components based on the percentage of variance explained by each one (`ElbowPlot()` function). In this example, we can observe an 'elbow' around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs. 

```{r elbow_plot}
ElbowPlot(seurat_object)
```

Identifying the true dimensionality of a dataset -- can be challenging/uncertain for the user. We therefore suggest these three approaches to consider. The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. The third is a heuristic that is commonly used, and can be calculated instantly. In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC 7-12 as a cutoff. 

We chose 10 here, but encourage users to consider the following:

* In the original version of this vignette with the PBMC3k dataset, genes strongly associated with PCs 12 and 13 defined rare immune subsets (i.e. MZB1 is a marker for plasmacytoid DCs). However, these groups are so rare, they are difficult to distinguish from background noise for a dataset of this size without prior knowledge.
* We encourage users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As you will observe, the results often do not differ dramatically.
* We advise users to err on the higher side when choosing this parameter. For example, performing downstream analyses with only 5 PCs does significantly and adversely affect results.

***


## Run non-linear dimensional reduction (UMAP/tSNE)

Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.

<!-- # If you haven't installed UMAP, you can do so via reticulate::py_install(packages = "umap-learn") -->
```{r tsne, fig.height=5, fig.width=7}
seurat_object <- RunUMAP(seurat_object, dims = 1:10)
```

```{r Umapplot, fig.height=5, fig.width=7}
DimPlot(seurat_object, reduction = 'umap')
```


#### Challenge: PC genes { - .challenge}

You can plot gene expression on the UMAP with the `FeaturePlot()` function. 

Try out some genes that were highly weighted in the principal component analysis. How do they look?


####  {-}


## Save 

You can save the object at this point so that it can easily be loaded back in with `readRDS()` without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.

```{r saveobject, eval=FALSE}
saveRDS(seurat_object, file = "seurat_object_tutorial_saved.rds") 
```

Tip: For faster saving and loading, try the "qs" package.