Paired-Reads-Assemble

This project demonstrates how to assemble paired reads with a known distance between them using the De Bruijn graph representation and the Eulerian path. The approach reconstructs the original sequence from the paired reads.

Problem Description

Input

The program reads a file containing:
1. Length of the sequences on each side and the gap length.
2. The paired reads, where each pair is separated by |.

Example Input File:

3 1
GACC|GCGC
ACCG|CGCC
CCGA|GCCG
CGAG|CCGG
GAGC|CGGA

Output

The program outputs the assembled genome sequence.

Example Output:

Genome: GACCGAGCGCCGGA

Explanation

Definitions

Paired Reads: Paired reads consist of two DNA sequences separated by a known gap. Example: GACC|GCGC
De Bruijn Graph:

Represents overlaps between k-mers (substrings of length k).

Nodes are k-1-mers, and edges represent k-mers.

Eulerian Path:

A path through the graph that visits every edge exactly once.

The reconstructed sequence is derived by traversing this path.

Steps to Solve

Input

Parse the input file to extract paired reads into left and right reads.

Example: Input reads:

Left Reads: GACC, ACCG, CCGA, CGAG, GAGC

Right Reads: GCGC, CGCC, GCCG, CCGG, CGGA

Construct De Bruijn Graph

Create nodes by extracting prefixes and suffixes of each k-mer.

Add directed edges from the prefix to the suffix.

Example Graph:

(S) GAC,GCG --> ACC,CGC
ACC,CGC --> CCG,GCC
CCG,GCC --> CGA,CGC
CGA,CCG --> GAG,CGG
GAG,CGG --> AGC,GGA (E)

Eulerian Path

Find the Eulerian path by traversing every edge of the graph once.

Use the first letter of each k-mer (except the last one) to construct the genome.

Prefix Path: GAC -> ACC -> CCG -> CGA -> GAG -> AGC Prefix = GACCGAGC

Suffix Path: GCG -> CGC -> GCC -> CCG -> CGG -> GGA Suffix = GCGCCGGA

Combine Prefix and Suffix

Merge the prefix and suffix to form the final genome sequence.

Ensure the gap between the left and right reads is consistent.

Final Genome: GACCGAGCGCCGGA

Usage

Input File Format: The first line contains the sequence length (k) and gap length (gap). Subsequent lines contain paired reads separated by |.

Example input file:

Run the Program Use the following command to execute the program and assemble the genome:

python paired_reads_assemble.py input_file.txt

Output

The output will be the assembled genome sequence, printed to the console:

Genome: GACCGAGCGCCGGA

Future Enhancements

Support for Multiple Substitution Matrices

Allow users to choose from matrices like PAM250 or Blosum45.

Statistical Analysis

Provide e-values and bit scores for alignments.

Graphical Visualization

Add support for visualizing alignments in a graphical interface.

Optimization

Improve performance for large-scale sequence databases.

Contributing

Contributions are welcome!

Fork the repository.
Create a new branch for your feature or bugfix.
Submit a pull request with detailed documentation of your changes.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

NCBI BLAST: The inspiration for this project comes from the NCBI BLAST tool.

Open-source contributors for Python libraries like Biopython and NumPy.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
PairedReads.py		PairedReads.py
README.md		README.md
ReadPairsInput.txt		ReadPairsInput.txt
ReadPairsOutput.txt		ReadPairsOutput.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paired-Reads-Assemble

Problem Description

Input

About

Releases

Packages

Languages

License

Kareem-Mohamed-Wardany/Paired-Reads-Assemble

Folders and files

Latest commit

History

Repository files navigation

Paired-Reads-Assemble

Problem Description

Input

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages