-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-03-qc.Rmd
executable file
·168 lines (101 loc) · 6.81 KB
/
04-03-qc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
# (PART) Single Cell Analysis {-}
# QC Filtering {#qc}
<!-- Discuss counts per cell/gene and make plots -->
<!-- Discuss that there's no one threshold -->
<!-- everyone pick a threshold and go filter -->
<!-- check numbers of cells. -->
<!-- save your object -->
<!-- (other qc metrics = Mt gene content, cell cycle asignment, low seq diversity e.t.c) -->
The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.
## QC and selecting cells for further analysis
#### Why do we need to do this? {- .rational}
Low quality cells can add noise to your results leading you to the wrong biological conclusions. Using only good quality cells helps you to avoid this. Reduce noise in the data by filtering out low quality cells such as dying or stressed cells (high mitochondrial expression) and cells with few features that can reflect empty droplets.
#### {-}
Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics [commonly used](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4758103/) by the community include
* The number of unique genes detected in each cell.
+ Low-quality cells or empty droplets will often have very few genes
+ Cell doublets or multiplets may exhibit an aberrantly high gene count
* Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
* The percentage of reads that map to the mitochondrial genome
+ Low-quality / dying cells often exhibit extensive mitochondrial contamination
+ We calculate mitochondrial QC metrics with the `PercentageFeatureSet()` function, which calculates the percentage of counts originating from a set of features
+ We use the set of all genes starting with `MT-` as a set of mitochondrial genes
```{r mito, fig.height=7, fig.width=13}
# The $ operator can add columns to object metadata.
# This is a great place to stash QC stats
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
```
#### Challenge: The meta.data slot in the Seurat object { - .challenge}
Where are QC metrics stored in Seurat?
* The number of unique genes and total molecules are automatically calculated during `CreateSeuratObject()`
+ You can find them stored in the object meta data
1. What do you notice has changed within the `meta.data` table now that we have calculated mitochondrial gene proportion?
2. Imagine that this is the first of
several samples in our experiment. Add a `samplename` column to to the `meta.data` table.
#### {-}
In the example below, we visualize QC metrics, and use these to filter cells.
* We filter cells that have unique feature counts over 2,500 or less than 200
* We filter cells that have >5% mitochondrial counts
```{r qc2}
#Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# visualize the max and min
VlnPlot(pbmc, features = "nFeature_RNA")+ scale_y_continuous(limits = c(1000,3000))
VlnPlot(pbmc, features = "nFeature_RNA",y.max =1000)
# FeatureScatter is typically used to visualize feature-feature relationships,
# but can be used for anything calculated by the object,
# i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
```
Lets look at the number of features (genes) to the percent mitochondrial genes plot.
```{r}
plot3 <- FeatureScatter(pbmc, feature1 = "nFeature_RNA", feature2 = "percent.mt")
plot3
```
<!-- #### Challenge: Ribosomal gene expression as a QC metric {- .challenge} -->
<!-- Ribosomal gene expression could be another factor to look into your cells within your experiment. -->
<!-- Create more columns of metadata using `PercentageFeatureSet` function, this time search for ribosomal genes. We can calculate the percentage for the large subunit (RPL) and small subunit (RPS) ribosomal genes. -->
<!-- Use `FeatureScatter` to plot combinations of metrics available in metadata. How is the mitochondrial gene percentage related to the ribosomal gene percentage? What can you see? Discuss in break out. -->
<!-- <details> -->
<!-- <summary>**Code for challenge**</summary> -->
<!-- Create new meta.data columns to contain percentages of the large and small ribosomal genes. -->
<!-- Then plot a scatter plot with this new data. You should find that the large and small ribosomal subunit genes are correlated within cell. -->
<!-- What about with mitochondria and gene, feature counts? -->
<!-- These are the cells you may want to exclude. -->
<!-- </details> -->
<!-- <details> -->
<!-- <summary>**Advanced Challenge**</summary> -->
<!-- Highlight cells with very low percentage of ribosomal genes, create a new column in the meta.data table and with `FeatureScatter` make a plot of the RNA count and mitochondrial percentage with the cells with very low ribosomal gene perentage. -->
<!-- </details> -->
<!-- ### { - } -->
you can check different thresholds of mito percentage
```{r}
#Number of cells left after filters
# percentage of cell with less than 10% mito
round(sum(pbmc$percent.mt < 10)/dim(pbmc)[2]*100,2)
# percentage of cell with less than 20% mito
round((sum(pbmc$percent.mt < 20)/dim(pbmc)[2])*100,2)
# percentage of cell with less than 30% mito
round((sum(pbmc$percent.mt < 30)/dim(pbmc)[2])*100,2)
```
Okay we are happy with our thresholds for mitochondrial percentage in cells, lets apply them and subset our data. This will remove the cells we think are of poor quality.
```{r}
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
```
Lets replot the feature scatters and see what they look like.
```{r qc2_sidebar, fig.height=7, fig.width=13}
plot5 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot6 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot5 + plot6
```
#### Challenge: Filter the cells {- .challenge}
Apply the filtering thresolds defined above.
* How many cells survived filtering?
The PBMC3k dataset we're working with in this tutorial is quite old. There are a number of other example datasets available from the 10X website, including [this one](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-3-v3-1-chromium-x-with-intronic-reads-3-1-high) - published in 2022, sequencing 10k PBMCs with a newer chemistry and counting method.
* What thresholds would you chose to apply to this modern dataset?
```{r eval=FALSE}
pbmc10k_unfiltered <- readRDS("data/10k_PBMC_v3.1ChromiumX_Intronic.rds")
VlnPlot(pbmc10k_unfiltered, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
```