-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
159 lines (104 loc) · 5.26 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>",
out.width = "100%"
)
```
```{r, include = FALSE}
library(seqR)
```
# seqR - fast and comprehensive k-mer counting package
<!-- badges: start -->
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/seqR)](https://cran.r-project.org/package=seqR)
[![R build status](https://github.com/slowikj/seqR/workflows/R-CMD-check/badge.svg)](https://github.com/slowikj/seqR/actions)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![codecov.io](https://codecov.io/github/slowikj/seqR/coverage.svg?branch=master)](https://codecov.io/github/slowikj/seqR?branch=master)
[![Code Quality Status](https://www.code-inspector.com/project/23909/status/svg)](https://www.code-inspector.com/project/23909/status/svg)
[![Code Quality Score](https://www.code-inspector.com/project/23909/score/svg)](https://www.code-inspector.com/project/23909/score/svg)
<!-- badges: end -->
## About
`seqR` is an R package for fast k-mer counting. It provides
* **highly optimized** (the core algorithm is written in C++)
* **in-memory**
* **probabilistic** (with configurable dimensionality of a hash value
used for storing k-mers internally),
* **multi-threaded** (with a configurable size of the batch of sequences (`batch_size`) to process in a single step. If `batch_size` equals 1, the multi-threaded mode is disabled, which potentially causes a longer computation time)
implementation that supports
* **various variants of k-mers** (contiguous, gapped, and positional counterparts)
* **all biological sequences** (e.g., nucleic acids and proteins)
Moreover, the result optimizes memory consumption by the application of **sparse matrices**
(see [package Matrix](https://CRAN.R-project.org/package=Matrix)),
compatible with machine learning packages
such as [ranger](https://CRAN.R-project.org/package=ranger)
and [xgboost](https://CRAN.R-project.org/package=xgboost).
## How to...
### How to install
To install `seqR` from CRAN:
```{r, eval=FALSE}
install.packages("seqR")
```
Alternatively, if you want to use the latest development version:
```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("slowikj/seqR")
```
### How to use
The package provides two functions that facilitate k-mer counting
* `count_kmers` (used for counting k-mers of one type)
* `count_multimers` (a wrapper of `count_kmers`, used for counting k-mers of many types in a single invocation of the function)
and one function used for custom processing of k-mer matrices:
* `rbind_columnwise` (a helper function used for merging several k-mer matrices that do not have same sets of columns)
To learn more, see [features overview vignette](https://slowikj.github.io/seqR/articles/features-overview.html)
and [reference](https://slowikj.github.io/seqR/reference/index.html).
#### Examples
##### counting 5-mers
```{r}
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
k=5)
```
##### counting gapped 5-mers with gaps (0, 1, 0, 2) (XX_XX__X)
```{r}
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
kmer_gaps=c(0, 1, 0, 2))
```
##### counting 1-mers and 2-mers
```{r}
data(CsgA)
CsgA[1L:2]
count_multimers(sequences=CsgA,
k_vector = c(1, 2))
```
### How to cite
For citation type:
```{r, eval=FALSE}
citation("seqR")
```
or use:
Jadwiga Słowik and Michał Burdukiewicz (2021). seqR: fast and comprehensive k-mer counting package. R package version 1.0.0.
## Benchmarks
The `seqR` package has been compared with other existing k-mer counting R packages:
[biogram](https://CRAN.R-project.org/package=biogram),
[kmer](https://CRAN.R-project.org/package=kmer),
[seqinr](https://CRAN.R-project.org/package=seqinr),
and [biostrings](https://bioconductor.org/packages/Biostrings).
All benchmark experiments have been performed using Intel Core i7-6700HQ 2.60GHz 8 cores, using the [microbenchmark](https://CRAN.R-project.org/package=microbenchmark) R package.
### Contiguous k-mers
#### Changing k
<img src = "https://raw.githubusercontent.com/slowikj/seqR/master/man/img/packages_different_k.png" align = "center" width="100%"/>
The input consists of one `DNA` sequence of length `3 000`.
#### Changing the number of sequences
<img src = "https://raw.githubusercontent.com/slowikj/seqR/master/man/img/packages_different_seq_num.png" align = "center" width="100%"/>
Each `DNA` sequence has `3 000` elements, `contiguous 5-mer` counting.
### Gapped k-mers
#### Changing the first contiguous part of a k-mer
<img src = "https://raw.githubusercontent.com/slowikj/seqR/master/man/img/gapped_kmers_changing_the_first_contiguous_part.png" align = "center" width="100%"/>
The input consists of one `DNA` sequence of length `1 000 000`. `Gapped 5-mers` counting with base gaps `(1, 0, 0, 1)`.
#### Changing the first gap size
<img src = "https://raw.githubusercontent.com/slowikj/seqR/master/man/img/gapped_kmers_changing_the_first_gap.png" align = "center" width="100%"/>
The input consists of one `DNA` sequence of length `100 000`. `Gapped 5-mers` counting with base gaps `(1, 0, 0, 1)`.