Assignment 3 .Rmd

---
title: "Bioinformatics and Molecular Biology Techniques SLE712 Assignment 3"
author: "Dulini Fernando - 220168099"
date: "12/06/2020"
output:
  rmarkdown::html_document:
    toc: true
theme: cosmo
---
  
Source: https://github.com/fernandoh123/Assignment3_part1


# Part 1: Importing files, data wrangling, mathematical operations, plots and saving code on GitHub

## Introduction

Rstudio is an integrated development environment with a source pane; where the R   scripts are created and edited, a console; where the code is evaluated, a environment tab; that contains all data objects of cureent R session and the viewer tab with all helpful information. R studio allows to import data, prcoess raw data to information and visualize the results. Further the scripts can be saved in Github as a backup to be used in a case of lost code. Github allows reasechers and bioinformaticians to share, download and reproduce a code. 

### Q1. The file gene_expression.tsv downloaded from the github repository

To read in file, the file gene_expression.tsv that contains RNA-seq count data for two samples of interest were imported from [SLE712_files](https://github.com/markziemann/SLE712_files/blob/master/bioinfo_asst3_part1_files/gene_expression.tsv) resource page on April 26 2020. The raw data file can be accessed by this [link](https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part1_files/gene_expression.tsv) and destfile will show the path the file should be saved with the file name. 

```{r, chunk 1, echo=TRUE}
download.file("https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part1_files/gene_expression.tsv", 
              destfile = "gene_expression.tsv")
```

#### Read the file with gene accession numbers as row numbers 
The function `read.table` was used to read the imported file in table format and create a dataframe correspoding to lines and variabels to fields in the file. This is saved as the object "x". As arguments header is set as TRUE and row.names as 1 to indicate that first column in the input is used for the raw names. So that gene accession numbers are saved as row names. 

```{r, chunk 2, echo=TRUE}
x <- read.table("gene_expression.tsv", header = TRUE, row.names = 1)
```

#### Displaying a table of values for the first six genes
The function `head` was used to show a table of first six genes stored in the object x.

```{r, chunk 3, echo=TRUE}
head(x)
```

### Q2. Making a new column which contains the mean of other columns
The function `rowMeans` was used to calculate the mean of the columns of the dataframe stored as x. The result was stored in a new column of data frame x, named as "Mean". The `$` operator address the column by name.

```{r, chunk 4, echo=TRUE}
x$Mean <- rowMeans(x)
```

#### Displaying a table of values for the first six genes
The function `head` was used to show a table of first six genes stored in the object x
```{r, chunk 5, echo=TRUE}
head(x)
```

### Q3. Listing the 10 genes with highest mean expression
The function `order` was used to rearrange the Mean column of the dataframe x. To arrange the Mean column in the descending order, `-` operator was used. The result was stored in the object "ord". The function `head` was used to list the first 10 genes stored in the object ord.

```{r, chunk 6, echo=TRUE}
ord <- x[order(-x$Mean),]
head(ord,10)
```

### Q4. Number of genes with a mean<10
The function `subset` was used to create a subset from the the data frame x that meets the condition of mean<10 in the column Mean. The function `nrow` was used to find the number of rows==genes that is present in the newly formed subset.

```{r, chunk 7, echo=TRUE}
nrow( subset(x, x$Mean<10))
```

### Q5. Histogram plot of the mean values
The function `hist` was used to plot the histogram of Mean column of the data frame x. The histogram was plotted with 50 breakpoints, bars filled with blue colour and x values in the range of 0 to 30000.

```{r, chunk 8, echo=TRUE}
hist(x$Mean, ylab = "Frequency", xlab = "Mean", main = "Histogram Plot of Mean Values" , breaks = 50, col = "blue", xlim = c(0,300000))
```

![](http://118.138.234.73:8787/files/Assignment3_part1/Hist_Mean.png)

### Q6. The file growth_data.csv downloaded from the github repository
The file growth_data.csv that contains measurements for tree circumference growing at two sites; control site and treatment site which were planted 20 years ago was imported from [SLE712_files](https://github.com/markziemann/SLE712_files/tree/master/bioinfo_asst3_part1_files) resource page on 8 May 2020. The raw data can be accessed with this [link](https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part1_files/growth_data.csv) and destfile will show the path the file should be saved with the file name.

```{r, chunk 9, echo=TRUE}
download.file("https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part1_files/growth_data.csv", 
              destfile = "growth_data.csv")
```
              
The function `read.table` was used to read the downloaded file in table format and to a create  data frame. This was saved as the object "y". The header was set to be TRUE, to indicate the file contains the names of the variables as its first line. The character seperator was used to seperate values in each lines of the file. stringAsFactors was set to be FALSE to prevent vectors been converted factors. 

```{r, chunk 10, echo=TRUE}
y <- read.table("growth_data.csv", header = TRUE, sep = ",",stringsAsFactors = FALSE)
```

To find the column names of the data frame y, the function `colnames` was used.

```{r, chunk 11, echo=TRUE}
colnames(y)
```

### Q7. Calculating the mean and standard deviation of tree circumference at both sites
First the data frame was subset according to the site using the function `subset`.

```{r, chunk 12, echo=TRUE}
northeast <- subset (y, Site == "northeast")
southwest <- subset (y, Site == "southwest")
```

Then the function `mean` was used to calculate the mean at both sites in years 2004 and 2019, using the `$` charcter to address the respective columns from the subset. 

```{r, chunk 13, echo=TRUE}
mean(northeast$Circumf_2004_cm)
mean(northeast$Circumf_2019_cm)
mean(southwest$Circumf_2004_cm)
mean(southwest$Circumf_2019_cm)
```

Finally, the function `sd` was used to calculate standard deviation at both sites in years 2004 and 2019, using the `$` charcter to address the respective columns from the subset

```{r, chunk 14, echo=TRUE}
sd(northeast$Circumf_2004_cm)
sd(northeast$Circumf_2019_cm)
sd(southwest$Circumf_2004_cm)
sd(southwest$Circumf_2019_cm)
```

### Q8. Box plot of tree circumference at the start and end of the study at both sites.
The function `boxplot` was used to produce a box plot of the tree circumference at both sites in years 2004 and 2019. Each box was labelled respectively and the main label was set.

```{r, chunk 15, echo=TRUE}
boxplot(southwest$Circumf_2004_cm,southwest$Circumf_2019_cm,northeast$Circumf_2004_cm,northeast$Circumf_2019_cm,
        names = c("SW2004","SW2019","NE2004","NE2019"),ylab="Cirumference (cm)",
        main="Tree Circumference at Two Plantation Sites in Respective Years")
```

### Q9. Mean growth over the past 10 years at each site
First the growth at each site was calculated seperately by substracting the growth in 2009 from growth in 2019. The values were saved in seperate objects as GrowthSW and GrowthNE.

```{r, chunk 16, echo=TRUE}
GrowthSW <- (southwest$Circumf_2019_cm-southwest$Circumf_2009_cm)      
GrowthNE <- (northeast$Circumf_2019_cm-northeast$Circumf_2009_cm)
```

Then the mean growth over 10 years was calculated at each site using the function `mean` and saved R objects.

```{r, chunk 17, echo=TRUE}
mean(GrowthSW)
mean(GrowthNE)
```

### Q10. t.test and wilcox.test functions to estimate the p-value 
The function `t.test` performs two sample t test on growth over 10 years at both sites using the respective R objects with confidence level of 0.95 to estimate the p value. For variance to be estimated seprately, var.equal is set to FALSE. It is saved into res object.   

```{r, chunk 18, echo=TRUE}
res <- t.test(GrowthSW,GrowthNE, var.equal = FALSE)
res
```
The t.test estimates a small p value indicating there is a big difference in the growth of trees at southwest and northeast in the past 10 years.
According to the mean values southwest has a higher growth rate compared to northeast.

The function `wilcox.test` performs two sample t test on growth over 10 years at both sites using the respective R objects with confidence level of 0.95 to estimate the p value. It is saved into res object.

```{r, chunk 19, echo=TRUE}
res <- wilcox.test(GrowthSW,GrowthNE)
res
```
The wilcox test estimates a higher p value than t test and this might be due to stastical sensitivity of the two methods.

# Part 2: Determine the limits of BLAST

## Introduction
BLAST (Basic Local Alignment Search Tool) is a web tool used to search for query sequences in databases. Using command line in Rstudio, large scale analyses of the databases can be done. An index should be created in order to search for similarity.

In this assignment BLAST search will be done on *E.coli* sequences. *E.coli* has a single circular chromosome along with a circular plasmid. Its chromosomal DNA genome contains approximately 4600 kilobases. There are about 4300 potential coding sequences and only 1800 known *E.coli* proteins.  

*E.coli* genomic information can be accessed by [Ensembl genome browser](https://asia.ensembl.org/index.html) and [GenBank](https://www.ncbi.nlm.nih.gov/protein/AAC75595).

### Prerequisite libraries and sources
Few sources are required before performing a BLAST search. `seqinr` is a package required for exploratory data analysis and visualization of bological sequence data. `rBLAST` will be the R interface to run local BLAST searches. `R.utils` package are required for R utility functions such as to extract compressed files. 

```{r, chunk 20, echo=TRUE}
library("seqinr")
library("rBLAST")
library("R.utils")
```

### Q1. Download the E.coli CDS sequence from the Ensembl FTP page
The *E.coli* genome sequence containing 4140 sequences was downloaded from the [Ensembl Genome](ftp://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cds/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz) webserver on 15 May 2020 and destfile will show the path the file should be saved with the file name. The file is in FASTA format, which is a text based format for representing nucleotide sequences as single letter codes.

```{r, chunk 21, echo=TRUE}
download.file("ftp://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cds/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz", 
              destfile = "ecoli.fa.gz")
```

#### Using gunzip to decompress the file
The function `gunzip` unzip the dowloaded file saved at destfile.

```{r, chunk 22, echo=TRUE}
gunzip("ecoli.fa.gz")
```

#### Creating a blast database
The function `makeblastdb` was used to create a BLAST database from the FASTA file. dbtype specifies that this is a nucleotide sequence and `-parse_seqids` is required to identify gene IDs in mapping.

```{r, chunk 23, echo=TRUE}
makeblastdb("ecoli.fa",dbtype="nucl", "-parse_seqids")
```

### Q2. Download the sample fasta sequence
The sample fasta seqiences were downloaded from [SLE_712 files](https://github.com/markziemann/SLE712_files/tree/master/bioinfo_asst3_part2_files) webpage.The raw sample sequences are present [here](https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/sample.fa).

```{r, chunk 24, echo=TRUE}
download.file("https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/sample.fa", 
              destfile = "sample.fa")
```

#### Read the sample file into R
As the file is not compressed, it is read directly. The function `read.fasta` to read the file sample in FASTA format is used and saved as the object "d".

```{r, chunk 25, echo=TRUE}
d <- read.fasta("sample.fa")
```

#### Determining the length (in bp) and the proportion of GC bases in my sequence
From the object d containing sample sequences, the character `[[]]` was used to extract my sequence which is number 6. This was then saved as object "my gene".

```{r, chunk 26, echo=TRUE}
mygene <- d[[6]]
mygene
```
To determine the lenth of my gene sequence in bp, the function `length` was used.
```{r, chunk 27, echo=TRUE}
length(mygene)
```
To determine the GC proportion of bases in my gene, the function `seqinr` was used.
```{r, chunk 28, echo=TRUE}
seqinr::GC(mygene)
```

### Q3. Function to perform BLAST search
To perform this the function formed earlier is sourced from [SLE712_files](https://github.com/markziemann/SLE712_files/tree/master/bioinfo_asst3_part2_files). The function can be downloaded [here](https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/mutblast_functions.R).

```{r, chunk 29, echo=TRUE}
download.file("https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/mutblast_functions.R",
              destfile = "mutblast.R")
source("mutblast.R")
```
Blastn is a unix shell tool that search nulceotide databases using a nucleotide query. This will search the best match to my sequence (mygene) from the ensembl database of *E.coli*. The results will be saved as object "res". 

```{r, chunk 30, echo=TRUE}
res <- myblastn_tab(myseq = mygene, db = "ecoli.fa")
```

#### Top 3 hits from the BLAST search
To see the top 3 hits, the function `head` was used which will show the first 6 hits.
```{r, chunk 31, echo=TRUE}
head(res)
```
My sequence has only 1 hit with 100% similarity and no mismatches. The name of my sequence is AAC75595. evalue referes to the probability of the hit being unrelated to the sequence. So the chance of hit being randomly related is 0. Bit score referes to the sequence similarity. Higher the bit score, better the sequence similarity.

### Q4. Introduing point mutations
The function `mutator` was used to replace DNA bases of my sequence with random bases. 20 mutations were introduced to my sequence and was saved as the object "mygene_mutation".

```{r, chunk 32, echo=TRUE}
mygene_mutation <- mutator(mygene,20)
```
Creating a blast index: mygene sequence was changed into fasta fila format to create a database for blastn search.
```{r, chunk 33, echo=TRUE}
write.fasta(mygene,names="mygene", file.out = "mygene.fa")
makeblastdb(file = "mygene.fa", dbtype = "nucl")
```

Blastn search was run to identify the number of mismatches between the mutated sequence and original sequence. The results were saved as "res" object and viewed.
```{r, chunk 34, echo=TRUE}
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
```
The mutated sequence has 17 mismatches but 98.5% similar to the original sequence.

### Q5. Testing the limits of BLAST
To identify the limit of BLASTing (null result) more mutations were introduced and blastn was run. Since the mutations are random, a loop was created to return the results as 0 or 1. The function `replicate` was used to repeat this 100 times within the conditions. The final result was the mean of all replicates.

#### Test with 100 mismatches
```{r, chunk 35, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,100)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res

loop <- function(mygene_mutation) {
  sample(c(0,1),1,replace = TRUE)
}

mean(replicate(100,loop(mygene_mutation)))

```
#### Test with 200 mismatches
```{r, chunk 36, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,200)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res

loop <- function(mygene_mutation) {
  sample(c(0,1),1,replace = TRUE)
}

mean(replicate(100,loop(mygene_mutation)))
```
#### Test with 300 mismatches
```{r, chunk 37, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,300)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res

loop <- function(mygene_mutation) {
  sample(c(0,1),1,replace = TRUE)
}

mean(replicate(100,loop(mygene_mutation)))
```
#### Test with 305 mismatches
```{r, chunk 38, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,305)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
```
Since the answer is NULL loop function was not carried out.

#### Test with 310 mismatches
```{r, chunk 39, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,310)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
```

### Q6. Chart to show how mutation affects BLAST performance
Graph was plotted to show how increasing number of random bases affects BLAST performance.
![Graph of BLAST performance](http://118.138.234.73:8787/files/Assignment3_part1/BLAST%20plot.jpg)