04-03-qc.Rmd

# (PART) Single Cell Analysis {-}  


# QC Filtering {#qc}


<!-- Discuss counts per cell/gene and make plots -->
<!-- Discuss that there's no one threshold -->
<!-- everyone pick a threshold and go filter -->
<!-- check numbers of cells. -->
<!-- save your object -->
<!-- (other qc metrics = Mt gene content, cell cycle asignment, low seq diversity e.t.c) -->


The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.

## QC and selecting cells for further analysis

#### Why do we need to do this? {- .rational}
Low quality cells can add noise to your results leading you to the wrong biological conclusions. Using only good quality cells helps you to avoid this. Reduce noise in the data by filtering out low quality cells such as dying or stressed cells (high mitochondrial expression) and cells with few features that can reflect empty droplets.

####  {-}


Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics [commonly used](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4758103/) by the community include

* The number of unique genes detected in each cell. 
    + Low-quality cells or empty droplets will often have very few genes
    + Cell doublets or multiplets may exhibit an aberrantly high gene count
* Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
* The percentage of reads that map to the mitochondrial genome
    + Low-quality / dying cells often exhibit extensive mitochondrial contamination
    + We calculate mitochondrial QC metrics with the `PercentageFeatureSet()` function, which calculates the percentage of counts originating from a set of features
    + We use the set of all genes starting with `MT-` as a set of mitochondrial genes

```{r mito, fig.height=7, fig.width=13}
# The $ operator can add columns to object metadata. 
# This is a great place to stash QC stats
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
```

  
#### Challenge: The meta.data slot in the Seurat object { - .challenge}

Where are QC metrics stored in Seurat?

* The number of unique genes and total molecules are automatically calculated during `CreateSeuratObject()`
    + You can find them stored in the object meta data

1. What do you notice has changed within the `meta.data` table now that we have calculated mitochondrial gene proportion?

2. Imagine that this is the first of 
several samples in our experiment. Add a `samplename` column to to the `meta.data` table.

####  {-}


In the example below, we visualize QC metrics, and use these to filter cells.

* We filter cells that have unique feature counts over 2,500 or less than 200
* We filter cells that have >5% mitochondrial counts
    
```{r qc2}
#Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)


# visualize the max and min 
VlnPlot(pbmc, features = "nFeature_RNA")+ scale_y_continuous(limits = c(1000,3000))
VlnPlot(pbmc, features = "nFeature_RNA",y.max =1000)


# FeatureScatter is typically used to visualize feature-feature relationships, 
# but can be used for anything calculated by the object, 
# i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") 
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") 
plot1 + plot2
```

Lets look at the number of features (genes) to the percent mitochondrial genes plot.

```{r}
plot3 <- FeatureScatter(pbmc, feature1 = "nFeature_RNA", feature2 = "percent.mt") 
plot3
```


<!-- #### Challenge: Ribosomal gene expression as a QC metric {- .challenge} -->
<!-- Ribosomal gene expression could be another factor to look into your cells within your experiment. -->

<!-- Create more columns of metadata using `PercentageFeatureSet` function, this time search for ribosomal genes. We can  calculate the percentage for the large subunit (RPL) and small subunit (RPS) ribosomal genes.  -->

<!-- Use `FeatureScatter` to plot combinations of metrics available in metadata. How is the mitochondrial gene percentage related to the ribosomal gene percentage? What can you see? Discuss in break out. -->

<!-- <details> -->
<!-- <summary>**Code for challenge**</summary> -->
<!-- Create new meta.data columns to contain percentages of the large and small ribosomal genes. -->

<!-- Then plot a scatter plot with this new data. You should find that the large and small ribosomal subunit genes are correlated within cell.  -->

<!-- What about with mitochondria and gene, feature counts? -->

<!-- These are the cells you may want to exclude. -->
<!-- </details> -->

<!-- <details> -->
<!-- <summary>**Advanced Challenge**</summary> -->
<!-- Highlight cells with very low percentage of ribosomal genes, create a new column in the meta.data table and with `FeatureScatter` make a plot of the RNA count and mitochondrial percentage with the cells with very low ribosomal gene perentage. -->
<!-- </details> -->
<!-- ### { - } --> 
you can check different thresholds of mito percentage

```{r}
#Number of cells left after filters
# percentage of cell with less than 10% mito
round(sum(pbmc$percent.mt < 10)/dim(pbmc)[2]*100,2) 

# percentage of cell with less than 20% mito
round((sum(pbmc$percent.mt < 20)/dim(pbmc)[2])*100,2)

# percentage of cell with less than 30% mito
round((sum(pbmc$percent.mt < 30)/dim(pbmc)[2])*100,2) 


```

Okay we are happy with our thresholds for mitochondrial percentage in cells, lets apply them and subset our data. This will remove the cells we think are of poor quality. 

```{r}
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
```


Lets replot the feature scatters and see what they look like. 

```{r qc2_sidebar, fig.height=7, fig.width=13}
plot5 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") 
plot6 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") 
plot5 + plot6
```

#### Challenge: Filter the cells {- .challenge}

Apply the filtering thresolds defined above. 

* How many cells survived filtering?

The PBMC3k dataset we're working with in this tutorial is quite old. There are a number of other example datasets available from the 10X website, including [this one](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-3-v3-1-chromium-x-with-intronic-reads-3-1-high) - published in 2022, sequencing 10k PBMCs with a newer chemistry and counting method.
 
* What thresholds would you chose to apply to this modern dataset?

```{r eval=FALSE}
pbmc10k_unfiltered <- readRDS("data/10k_PBMC_v3.1ChromiumX_Intronic.rds")
VlnPlot(pbmc10k_unfiltered, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
```