03-01-load.Rmd



# Load data {#load}

[Let's get started with a single cell introduction](https://docs.google.com/presentation/d/1yKxSWL_sYto-alC-1BXIk1WbSnIGxnky/edit#slide=id.p1)

## Setup the Seurat Object

## The data set

The dataset used in this workshop is a *modified* version derived from this study ([see here](https://pubmed.ncbi.nlm.nih.gov/29227470/)). It has been adapted to introduce additional complexity for instructional purposes. *Please refrain from drawing any biological conclusions from this data as it does not represent real experimental results*.

This dataset represents human peripheral blood mononuclear cells (PBMCs), pooled from eight individual donors. Genetic differences among donors enable the identification of some cell doublets, enhancing data complexity. It includes two single-cell sequencing batches, one of which was stimulated with IFN-beta. Additionally, mitochondrial expression levels have been introduced to demonstrate how to interpret and apply mitochondrial thresholds for filtering purposes.


#### Note: What does the data look like? { - .challenge}

What do the input files look like? It varies, but this is the output of the CellRanger pipleine, described [here](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/gex-outputs)

```
├── analysis
│   ├── clustering
│   ├── diffexp
│   ├── pca
│   ├── tsne
│   └── umap
├── cloupe.cloupe
├── filtered_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5.  --> matrix to read in h5 format
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5 
└── web_summary.html
```

### { - }


We start by reading in the data. There are several options for loading the data. The `Read10X()` function reads in the output of the [cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).


We next use the count matrix to create a `Seurat` object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For a technical discussion of the `Seurat` object structure, check out the [GitHub Wiki](https://github.com/satijalab/seurat/wiki). For example, the count matrix is stored in `pbmc@assays$RNA@counts`.

```{r libloader2, results='hide', message=FALSE, warning=FALSE} 
library(dplyr)
library(ggplot2)
library(Seurat)
library(patchwork)
```

## Different ways of loading the data

Example 1. Load your data using the path to the folder: ```filtered_feature_bc_matrix``` that is in the output folder of your cellranger run using the ```Read10X``` function.

```{}
# Load the PBMC dataset
 pbmc.data <- Read10X(data.dir = "outs/filtered_feature_bc_matrix")
# Initialize the Seurat object with the raw (non-normalized data).
 seurat_object <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200)
```

Example 2. Load your data directing the ```ReadMtx``` function to each of the relevant files in the ```filtered_feature_bc_matrix``` folder in the outputs from your cellranger run. MTX is a simple text format for sparse matrices.

```{}
expression_matrix <- ReadMtx(
  mtx = "outs/filtered_feature_bc_matrix/count_matrix.mtx.gz", features = "outs/filtered_feature_bc_matrix/features.tsv.gz",
  cells = "outs/filtered_feature_bc_matrix/barcodes.tsv.gz"
)
seurat_object <- CreateSeuratObject(counts = expression_matrix)
```


Example 3. Load your data using the ```Read10X_h5``` function to each of the relevant HDF5 files. HDF5 is an efficient binary format.


```{r}
pbmc.data <- Read10X_h5("data/filtered_feature_bc_matrix.h5")
metadata <- read.table("data/metadata.txt")
  
seurat_object <- CreateSeuratObject(counts = pbmc.data ,
                                 assay = "RNA", project = 'pbmc')

# Use AddMetaData to add new meta data to object
seurat_object  <- AddMetaData(object = seurat_object, metadata = metadata)
```


<details>
  <summary>**What does data in a count matrix look like?**</summary>

```{r, warning=FALSE}
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D","TCL1A","MS4A1"), 1:30]
```

The `.` values in the matrix represent 0s (no molecules detected). Since most values in an scRNA-seq matrix are 0,  Seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/inDrop/10x data.

```{r}
dense.size <- object.size(as.matrix(pbmc.data))
dense.size
sparse.size <- object.size(pbmc.data)
sparse.size
dense.size / sparse.size
```

</details>