-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment 3 .Rmd
337 lines (256 loc) · 16.6 KB
/
Assignment 3 .Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
---
title: "Bioinformatics and Molecular Biology Techniques SLE712 Assignment 3"
author: "Dulini Fernando - 220168099"
date: "12/06/2020"
output:
rmarkdown::html_document:
toc: true
theme: cosmo
---
Source: https://github.com/fernandoh123/Assignment3_part1
# Part 1: Importing files, data wrangling, mathematical operations, plots and saving code on GitHub
## Introduction
Rstudio is an integrated development environment with a source pane; where the R scripts are created and edited, a console; where the code is evaluated, a environment tab; that contains all data objects of cureent R session and the viewer tab with all helpful information. R studio allows to import data, prcoess raw data to information and visualize the results. Further the scripts can be saved in Github as a backup to be used in a case of lost code. Github allows reasechers and bioinformaticians to share, download and reproduce a code.
### Q1. The file gene_expression.tsv downloaded from the github repository
To read in file, the file gene_expression.tsv that contains RNA-seq count data for two samples of interest were imported from [SLE712_files](https://github.com/markziemann/SLE712_files/blob/master/bioinfo_asst3_part1_files/gene_expression.tsv) resource page on April 26 2020. The raw data file can be accessed by this [link](https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part1_files/gene_expression.tsv) and destfile will show the path the file should be saved with the file name.
```{r, chunk 1, echo=TRUE}
download.file("https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part1_files/gene_expression.tsv",
destfile = "gene_expression.tsv")
```
#### Read the file with gene accession numbers as row numbers
The function `read.table` was used to read the imported file in table format and create a dataframe correspoding to lines and variabels to fields in the file. This is saved as the object "x". As arguments header is set as TRUE and row.names as 1 to indicate that first column in the input is used for the raw names. So that gene accession numbers are saved as row names.
```{r, chunk 2, echo=TRUE}
x <- read.table("gene_expression.tsv", header = TRUE, row.names = 1)
```
#### Displaying a table of values for the first six genes
The function `head` was used to show a table of first six genes stored in the object x.
```{r, chunk 3, echo=TRUE}
head(x)
```
### Q2. Making a new column which contains the mean of other columns
The function `rowMeans` was used to calculate the mean of the columns of the dataframe stored as x. The result was stored in a new column of data frame x, named as "Mean". The `$` operator address the column by name.
```{r, chunk 4, echo=TRUE}
x$Mean <- rowMeans(x)
```
#### Displaying a table of values for the first six genes
The function `head` was used to show a table of first six genes stored in the object x
```{r, chunk 5, echo=TRUE}
head(x)
```
### Q3. Listing the 10 genes with highest mean expression
The function `order` was used to rearrange the Mean column of the dataframe x. To arrange the Mean column in the descending order, `-` operator was used. The result was stored in the object "ord". The function `head` was used to list the first 10 genes stored in the object ord.
```{r, chunk 6, echo=TRUE}
ord <- x[order(-x$Mean),]
head(ord,10)
```
### Q4. Number of genes with a mean<10
The function `subset` was used to create a subset from the the data frame x that meets the condition of mean<10 in the column Mean. The function `nrow` was used to find the number of rows==genes that is present in the newly formed subset.
```{r, chunk 7, echo=TRUE}
nrow( subset(x, x$Mean<10))
```
### Q5. Histogram plot of the mean values
The function `hist` was used to plot the histogram of Mean column of the data frame x. The histogram was plotted with 50 breakpoints, bars filled with blue colour and x values in the range of 0 to 30000.
```{r, chunk 8, echo=TRUE}
hist(x$Mean, ylab = "Frequency", xlab = "Mean", main = "Histogram Plot of Mean Values" , breaks = 50, col = "blue", xlim = c(0,300000))
```
![](http://118.138.234.73:8787/files/Assignment3_part1/Hist_Mean.png)
### Q6. The file growth_data.csv downloaded from the github repository
The file growth_data.csv that contains measurements for tree circumference growing at two sites; control site and treatment site which were planted 20 years ago was imported from [SLE712_files](https://github.com/markziemann/SLE712_files/tree/master/bioinfo_asst3_part1_files) resource page on 8 May 2020. The raw data can be accessed with this [link](https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part1_files/growth_data.csv) and destfile will show the path the file should be saved with the file name.
```{r, chunk 9, echo=TRUE}
download.file("https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part1_files/growth_data.csv",
destfile = "growth_data.csv")
```
The function `read.table` was used to read the downloaded file in table format and to a create data frame. This was saved as the object "y". The header was set to be TRUE, to indicate the file contains the names of the variables as its first line. The character seperator was used to seperate values in each lines of the file. stringAsFactors was set to be FALSE to prevent vectors been converted factors.
```{r, chunk 10, echo=TRUE}
y <- read.table("growth_data.csv", header = TRUE, sep = ",",stringsAsFactors = FALSE)
```
To find the column names of the data frame y, the function `colnames` was used.
```{r, chunk 11, echo=TRUE}
colnames(y)
```
### Q7. Calculating the mean and standard deviation of tree circumference at both sites
First the data frame was subset according to the site using the function `subset`.
```{r, chunk 12, echo=TRUE}
northeast <- subset (y, Site == "northeast")
southwest <- subset (y, Site == "southwest")
```
Then the function `mean` was used to calculate the mean at both sites in years 2004 and 2019, using the `$` charcter to address the respective columns from the subset.
```{r, chunk 13, echo=TRUE}
mean(northeast$Circumf_2004_cm)
mean(northeast$Circumf_2019_cm)
mean(southwest$Circumf_2004_cm)
mean(southwest$Circumf_2019_cm)
```
Finally, the function `sd` was used to calculate standard deviation at both sites in years 2004 and 2019, using the `$` charcter to address the respective columns from the subset
```{r, chunk 14, echo=TRUE}
sd(northeast$Circumf_2004_cm)
sd(northeast$Circumf_2019_cm)
sd(southwest$Circumf_2004_cm)
sd(southwest$Circumf_2019_cm)
```
### Q8. Box plot of tree circumference at the start and end of the study at both sites.
The function `boxplot` was used to produce a box plot of the tree circumference at both sites in years 2004 and 2019. Each box was labelled respectively and the main label was set.
```{r, chunk 15, echo=TRUE}
boxplot(southwest$Circumf_2004_cm,southwest$Circumf_2019_cm,northeast$Circumf_2004_cm,northeast$Circumf_2019_cm,
names = c("SW2004","SW2019","NE2004","NE2019"),ylab="Cirumference (cm)",
main="Tree Circumference at Two Plantation Sites in Respective Years")
```
### Q9. Mean growth over the past 10 years at each site
First the growth at each site was calculated seperately by substracting the growth in 2009 from growth in 2019. The values were saved in seperate objects as GrowthSW and GrowthNE.
```{r, chunk 16, echo=TRUE}
GrowthSW <- (southwest$Circumf_2019_cm-southwest$Circumf_2009_cm)
GrowthNE <- (northeast$Circumf_2019_cm-northeast$Circumf_2009_cm)
```
Then the mean growth over 10 years was calculated at each site using the function `mean` and saved R objects.
```{r, chunk 17, echo=TRUE}
mean(GrowthSW)
mean(GrowthNE)
```
### Q10. t.test and wilcox.test functions to estimate the p-value
The function `t.test` performs two sample t test on growth over 10 years at both sites using the respective R objects with confidence level of 0.95 to estimate the p value. For variance to be estimated seprately, var.equal is set to FALSE. It is saved into res object.
```{r, chunk 18, echo=TRUE}
res <- t.test(GrowthSW,GrowthNE, var.equal = FALSE)
res
```
The t.test estimates a small p value indicating there is a big difference in the growth of trees at southwest and northeast in the past 10 years.
According to the mean values southwest has a higher growth rate compared to northeast.
The function `wilcox.test` performs two sample t test on growth over 10 years at both sites using the respective R objects with confidence level of 0.95 to estimate the p value. It is saved into res object.
```{r, chunk 19, echo=TRUE}
res <- wilcox.test(GrowthSW,GrowthNE)
res
```
The wilcox test estimates a higher p value than t test and this might be due to stastical sensitivity of the two methods.
# Part 2: Determine the limits of BLAST
## Introduction
BLAST (Basic Local Alignment Search Tool) is a web tool used to search for query sequences in databases. Using command line in Rstudio, large scale analyses of the databases can be done. An index should be created in order to search for similarity.
In this assignment BLAST search will be done on *E.coli* sequences. *E.coli* has a single circular chromosome along with a circular plasmid. Its chromosomal DNA genome contains approximately 4600 kilobases. There are about 4300 potential coding sequences and only 1800 known *E.coli* proteins.
*E.coli* genomic information can be accessed by [Ensembl genome browser](https://asia.ensembl.org/index.html) and [GenBank](https://www.ncbi.nlm.nih.gov/protein/AAC75595).
### Prerequisite libraries and sources
Few sources are required before performing a BLAST search. `seqinr` is a package required for exploratory data analysis and visualization of bological sequence data. `rBLAST` will be the R interface to run local BLAST searches. `R.utils` package are required for R utility functions such as to extract compressed files.
```{r, chunk 20, echo=TRUE}
library("seqinr")
library("rBLAST")
library("R.utils")
```
### Q1. Download the E.coli CDS sequence from the Ensembl FTP page
The *E.coli* genome sequence containing 4140 sequences was downloaded from the [Ensembl Genome](ftp://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cds/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz) webserver on 15 May 2020 and destfile will show the path the file should be saved with the file name. The file is in FASTA format, which is a text based format for representing nucleotide sequences as single letter codes.
```{r, chunk 21, echo=TRUE}
download.file("ftp://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cds/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz",
destfile = "ecoli.fa.gz")
```
#### Using gunzip to decompress the file
The function `gunzip` unzip the dowloaded file saved at destfile.
```{r, chunk 22, echo=TRUE}
gunzip("ecoli.fa.gz")
```
#### Creating a blast database
The function `makeblastdb` was used to create a BLAST database from the FASTA file. dbtype specifies that this is a nucleotide sequence and `-parse_seqids` is required to identify gene IDs in mapping.
```{r, chunk 23, echo=TRUE}
makeblastdb("ecoli.fa",dbtype="nucl", "-parse_seqids")
```
### Q2. Download the sample fasta sequence
The sample fasta seqiences were downloaded from [SLE_712 files](https://github.com/markziemann/SLE712_files/tree/master/bioinfo_asst3_part2_files) webpage.The raw sample sequences are present [here](https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/sample.fa).
```{r, chunk 24, echo=TRUE}
download.file("https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/sample.fa",
destfile = "sample.fa")
```
#### Read the sample file into R
As the file is not compressed, it is read directly. The function `read.fasta` to read the file sample in FASTA format is used and saved as the object "d".
```{r, chunk 25, echo=TRUE}
d <- read.fasta("sample.fa")
```
#### Determining the length (in bp) and the proportion of GC bases in my sequence
From the object d containing sample sequences, the character `[[]]` was used to extract my sequence which is number 6. This was then saved as object "my gene".
```{r, chunk 26, echo=TRUE}
mygene <- d[[6]]
mygene
```
To determine the lenth of my gene sequence in bp, the function `length` was used.
```{r, chunk 27, echo=TRUE}
length(mygene)
```
To determine the GC proportion of bases in my gene, the function `seqinr` was used.
```{r, chunk 28, echo=TRUE}
seqinr::GC(mygene)
```
### Q3. Function to perform BLAST search
To perform this the function formed earlier is sourced from [SLE712_files](https://github.com/markziemann/SLE712_files/tree/master/bioinfo_asst3_part2_files). The function can be downloaded [here](https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/mutblast_functions.R).
```{r, chunk 29, echo=TRUE}
download.file("https://raw.githubusercontent.com/markziemann/SLE712_files/master/bioinfo_asst3_part2_files/mutblast_functions.R",
destfile = "mutblast.R")
source("mutblast.R")
```
Blastn is a unix shell tool that search nulceotide databases using a nucleotide query. This will search the best match to my sequence (mygene) from the ensembl database of *E.coli*. The results will be saved as object "res".
```{r, chunk 30, echo=TRUE}
res <- myblastn_tab(myseq = mygene, db = "ecoli.fa")
```
#### Top 3 hits from the BLAST search
To see the top 3 hits, the function `head` was used which will show the first 6 hits.
```{r, chunk 31, echo=TRUE}
head(res)
```
My sequence has only 1 hit with 100% similarity and no mismatches. The name of my sequence is AAC75595. evalue referes to the probability of the hit being unrelated to the sequence. So the chance of hit being randomly related is 0. Bit score referes to the sequence similarity. Higher the bit score, better the sequence similarity.
### Q4. Introduing point mutations
The function `mutator` was used to replace DNA bases of my sequence with random bases. 20 mutations were introduced to my sequence and was saved as the object "mygene_mutation".
```{r, chunk 32, echo=TRUE}
mygene_mutation <- mutator(mygene,20)
```
Creating a blast index: mygene sequence was changed into fasta fila format to create a database for blastn search.
```{r, chunk 33, echo=TRUE}
write.fasta(mygene,names="mygene", file.out = "mygene.fa")
makeblastdb(file = "mygene.fa", dbtype = "nucl")
```
Blastn search was run to identify the number of mismatches between the mutated sequence and original sequence. The results were saved as "res" object and viewed.
```{r, chunk 34, echo=TRUE}
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
```
The mutated sequence has 17 mismatches but 98.5% similar to the original sequence.
### Q5. Testing the limits of BLAST
To identify the limit of BLASTing (null result) more mutations were introduced and blastn was run. Since the mutations are random, a loop was created to return the results as 0 or 1. The function `replicate` was used to repeat this 100 times within the conditions. The final result was the mean of all replicates.
#### Test with 100 mismatches
```{r, chunk 35, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,100)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
loop <- function(mygene_mutation) {
sample(c(0,1),1,replace = TRUE)
}
mean(replicate(100,loop(mygene_mutation)))
```
#### Test with 200 mismatches
```{r, chunk 36, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,200)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
loop <- function(mygene_mutation) {
sample(c(0,1),1,replace = TRUE)
}
mean(replicate(100,loop(mygene_mutation)))
```
#### Test with 300 mismatches
```{r, chunk 37, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,300)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
loop <- function(mygene_mutation) {
sample(c(0,1),1,replace = TRUE)
}
mean(replicate(100,loop(mygene_mutation)))
```
#### Test with 305 mismatches
```{r, chunk 38, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,305)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
```
Since the answer is NULL loop function was not carried out.
#### Test with 310 mismatches
```{r, chunk 39, echo=TRUE}
mygene_mutation <- mutator(myseq=mygene,310)
res <- myblastn_tab(myseq = mygene_mutation, db = "mygene.fa")
res
```
### Q6. Chart to show how mutation affects BLAST performance
Graph was plotted to show how increasing number of random bases affects BLAST performance.
![Graph of BLAST performance](http://118.138.234.73:8787/files/Assignment3_part1/BLAST%20plot.jpg)