Skip to content

Commit

Permalink
Merge branch 'devel'
Browse files Browse the repository at this point in the history
  • Loading branch information
rnnh committed Nov 11, 2020
2 parents a264222 + 09ed888 commit 0989e37
Show file tree
Hide file tree
Showing 3 changed files with 85 additions and 3 deletions.
68 changes: 68 additions & 0 deletions docs/bcftools.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
layout: default
title: Bcftools
parent: 2. Program guides
---

# Bcftools

Bcftools are a set of [utilities for variant calling and manipulating VCFs and BCFs](https://samtools.github.io/bcftools/bcftools.html).

## Generating genotype likelihoods for alignment files using `bcftools mpileup`

`bcftools mpileup` can be used to generate VCF or BCF files containing genotype likelihoods for one or multiple alignment (BAM or CRAM) files as follows:

```bash
$ bcftools mpileup --max-depth 10000 --threads n -f reference.fasta -o genotype_likelihoods.bcf reference_sequence_alignmnet.bam
```

In this command...

1. **`--max-depth`** or **`-d`** sets the reads per input file for each position in the alignment. In this case, it is set to 10000
2. **`--threads`** sets the number (*n*) of processors/threads to use.
3. **`--fasta-ref`** or **`-f`** is used to select the [faidx-indexed FASTA](samtools.md#indexing-a-fasta-file-using-samtools-faidx) nucleotide reference file (*reference.fasta*) used for the alignment.
4. **`--output `** or **`-o`** is used to name the ouput file (*genotype_likelihoods.bcf*).
5. The final argument given is the input BAM alignment file (*reference_sequence_alignment.bam*). Multiple input files can be given here.

## Variant calling using `bcftools call`

`bcftools call` can be used to call SNP/indel variants from a BCF file as follows:

```bash
$ bcftools call -O b --threads n -vc --ploidy 1 -p 0.05 -o variants_unfiltered.bcf genotype_likelihoods.bcf
```

In this command...

1. **`--output-type`** or **`-O`** is used to select the output format. In this case, *b* for BCF.
2. **`--threads`** sets the number (*n*) of processors/threads to use.
3. **`-vc`** specifies that we want the output to contain variants only, using the original [SAMtools](samtools.md) consensus caller.
4. **`--ploidy`** specifies the ploidy of the assembly.
5. **`--pval-threshold`** or **`-p`** is used to the set the p-value threshold for variant sites (*0.05*).
6. **`--output `** or **`-o`** is used to name the ouput file (*variants_unfiltered.bcf*).
7. The final argument is the input BCF file (*genotype_likelihoods.bcf*).

## Filtering variants using `bcftools filter`

`bcftools filter` can be used to filter variants from a BCF file as follows...

```bash
$ bcftools filter --threads n -i '%QUAL>=20' -O v -o variants_filtered.vcf variants_unfiltered.bcf
```

In this command...

1. **`--threads`** sets the number (*n*) of processors/threads to use.
2. **`--include`** or **`-i`** is used to define the expression used to filter sites. In this case, *`%QUAL>=20`* results in sites with a quality score greater than or equal to 20.
3. **`--output-type`** or **`-O`** is used to select the output format. In this case, *v* for VCF.
4. **`--output `** or **`-o`** is used to name the ouput file (*variants_filtered.vcf*).
5. The final argument is the input BCF file (*genotype_likelihoods.bcf*).

## See also

- [File formats used in bioinformatics](file_formats.md)
- [SNP calling script](snp_calling.md)

## Futher reading

- [bcftools documentation](https://samtools.github.io/bcftools/bcftools.html)
6 changes: 5 additions & 1 deletion docs/file_formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ A brief introduction to various file formats used in bioinformatics.
- [CRAM](#cram)
- [Stockholm format](#stockholm-format)
- [Example Stockholm file](#example-stockholm-file)
- [VCF](#vcf)
- [VCF](#vcf)
- [BCF](#bcf)
- [Generic Feature Formats](#generic-feature-formats)
- [GFF general structure](#gff-general-structure)
- [GTF](#gtf)
Expand Down Expand Up @@ -202,6 +203,9 @@ The last line in the header section begins with `#`; this line gives the headers
9. `FORMAT` An (optional) extensible list of fields for describing the samples.
10. `SAMPLEs` For each (optional) sample described in the file, values are given for the fields listed in FORMAT. If multiple samples have been aligned to the reference sequence, each sample will have its own column.

### BCF

Binary Call Format (BCF) is a binary representation of [VCF](#vcf), containing the same information in binary format for improved performance.

## Generic Feature Formats

Expand Down
14 changes: 12 additions & 2 deletions docs/samtools.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ In this command...

1. **`sorted_example_alignment.bam`** is the name of the input file.

### Demonstration
### Demonstration 1

In this video, `samtools` is used to convert `example_alignment.sam` into a BAM file, sort that BAM file, and index it.

Expand All @@ -71,12 +71,22 @@ In this command...
1. **`example_nucleotide_sequence.fasta`** is the reference genome input.
2. **`example_reads_1.fastq`** and **`example_reads_2.fastq`** are the names of the simulated read output files.

### Demonstration
### Demonstration 2

In this video, `wgsim` is used to simulate reads from `example_nucleotide_sequence.fasta`.

[![asciicast](https://asciinema.org/a/m89gXtx4cKRnKpI6amWj3BEAH.svg)](https://asciinema.org/a/m89gXtx4cKRnKpI6amWj3BEAH?autoplay=1)

## Indexing a FASTA file using `samtools faidx`

SAMtools can be used to index a FASTA file as follows...

```bash
$ samtools faidx file.fasta
```

After running this command, `file.fasta` can now be used by [bcftools](bcftools.md).

## See also

- [Alignment formats](file_formats.md#alignment-formats)
Expand Down

0 comments on commit 0989e37

Please sign in to comment.