Skip to content

Kareem-Mohamed-Wardany/Paired-Reads-Assemble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paired-Reads-Assemble

This project demonstrates how to assemble paired reads with a known distance between them using the De Bruijn graph representation and the Eulerian path. The approach reconstructs the original sequence from the paired reads.


Problem Description

Input

  • The program reads a file containing:
    1. Length of the sequences on each side and the gap length.
    2. The paired reads, where each pair is separated by |.

Example Input File:

3 1
GACC|GCGC
ACCG|CGCC
CCGA|GCCG
CGAG|CCGG
GAGC|CGGA

Output

The program outputs the assembled genome sequence.

Example Output:

Genome: GACCGAGCGCCGGA


Explanation

Definitions

  1. Paired Reads: Paired reads consist of two DNA sequences separated by a known gap. Example: GACC|GCGC

  2. De Bruijn Graph:

Represents overlaps between k-mers (substrings of length k).

Nodes are k-1-mers, and edges represent k-mers.

  1. Eulerian Path:

A path through the graph that visits every edge exactly once.

The reconstructed sequence is derived by traversing this path.


Steps to Solve

  1. Input

Parse the input file to extract paired reads into left and right reads.

Example: Input reads:

GACC|GCGC
ACCG|CGCC
CCGA|GCCG
CGAG|CCGG
GAGC|CGGA

Left Reads: GACC, ACCG, CCGA, CGAG, GAGC

Right Reads: GCGC, CGCC, GCCG, CCGG, CGGA


  1. Construct De Bruijn Graph

Create nodes by extracting prefixes and suffixes of each k-mer.

Add directed edges from the prefix to the suffix.

Example Graph:

(S) GAC,GCG --> ACC,CGC
ACC,CGC --> CCG,GCC
CCG,GCC --> CGA,CGC
CGA,CCG --> GAG,CGG
GAG,CGG --> AGC,GGA (E)


  1. Eulerian Path

Find the Eulerian path by traversing every edge of the graph once.

Use the first letter of each k-mer (except the last one) to construct the genome.

Prefix Path: GAC -> ACC -> CCG -> CGA -> GAG -> AGC Prefix = GACCGAGC

Suffix Path: GCG -> CGC -> GCC -> CCG -> CGG -> GGA Suffix = GCGCCGGA


  1. Combine Prefix and Suffix

Merge the prefix and suffix to form the final genome sequence.

Ensure the gap between the left and right reads is consistent.

Final Genome: GACCGAGCGCCGGA


Usage

  1. Input File Format: The first line contains the sequence length (k) and gap length (gap). Subsequent lines contain paired reads separated by |.

Example input file:

3 1 GACC|GCGC ACCG|CGCC CCGA|GCCG CGAG|CCGG GAGC|CGGA

  1. Run the Program Use the following command to execute the program and assemble the genome:

python paired_reads_assemble.py input_file.txt


Output

The output will be the assembled genome sequence, printed to the console:

Genome: GACCGAGCGCCGGA


Future Enhancements

  1. Support for Multiple Substitution Matrices

Allow users to choose from matrices like PAM250 or Blosum45.

  1. Statistical Analysis

Provide e-values and bit scores for alignments.

  1. Graphical Visualization

Add support for visualizing alignments in a graphical interface.

  1. Optimization

Improve performance for large-scale sequence databases.


Contributing

Contributions are welcome!

  1. Fork the repository.

  2. Create a new branch for your feature or bugfix.

  3. Submit a pull request with detailed documentation of your changes.


License

This project is licensed under the MIT License. See the LICENSE file for more details.


Acknowledgments

NCBI BLAST: The inspiration for this project comes from the NCBI BLAST tool.

Open-source contributors for Python libraries like Biopython and NumPy.

About

Paired Reads Assembly

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages