This project demonstrates how to assemble paired reads with a known distance between them using the De Bruijn graph representation and the Eulerian path. The approach reconstructs the original sequence from the paired reads.
- The program reads a file containing:
- Length of the sequences on each side and the gap length.
- The paired reads, where each pair is separated by
|
.
Example Input File:
3 1
GACC|GCGC
ACCG|CGCC
CCGA|GCCG
CGAG|CCGG
GAGC|CGGA
Output
The program outputs the assembled genome sequence.
Example Output:
Genome: GACCGAGCGCCGGA
Explanation
Definitions
-
Paired Reads: Paired reads consist of two DNA sequences separated by a known gap. Example: GACC|GCGC
-
De Bruijn Graph:
Represents overlaps between k-mers (substrings of length k).
Nodes are k-1-mers, and edges represent k-mers.
- Eulerian Path:
A path through the graph that visits every edge exactly once.
The reconstructed sequence is derived by traversing this path.
Steps to Solve
- Input
Parse the input file to extract paired reads into left and right reads.
Example: Input reads:
GACC|GCGC
ACCG|CGCC
CCGA|GCCG
CGAG|CCGG
GAGC|CGGA
Left Reads: GACC, ACCG, CCGA, CGAG, GAGC
Right Reads: GCGC, CGCC, GCCG, CCGG, CGGA
- Construct De Bruijn Graph
Create nodes by extracting prefixes and suffixes of each k-mer.
Add directed edges from the prefix to the suffix.
Example Graph:
(S) GAC,GCG --> ACC,CGC
ACC,CGC --> CCG,GCC
CCG,GCC --> CGA,CGC
CGA,CCG --> GAG,CGG
GAG,CGG --> AGC,GGA (E)
- Eulerian Path
Find the Eulerian path by traversing every edge of the graph once.
Use the first letter of each k-mer (except the last one) to construct the genome.
Prefix Path: GAC -> ACC -> CCG -> CGA -> GAG -> AGC Prefix = GACCGAGC
Suffix Path: GCG -> CGC -> GCC -> CCG -> CGG -> GGA Suffix = GCGCCGGA
- Combine Prefix and Suffix
Merge the prefix and suffix to form the final genome sequence.
Ensure the gap between the left and right reads is consistent.
Final Genome: GACCGAGCGCCGGA
Usage
- Input File Format: The first line contains the sequence length (k) and gap length (gap). Subsequent lines contain paired reads separated by |.
Example input file:
3 1 GACC|GCGC ACCG|CGCC CCGA|GCCG CGAG|CCGG GAGC|CGGA
- Run the Program Use the following command to execute the program and assemble the genome:
python paired_reads_assemble.py input_file.txt
Output
The output will be the assembled genome sequence, printed to the console:
Genome: GACCGAGCGCCGGA
Future Enhancements
- Support for Multiple Substitution Matrices
Allow users to choose from matrices like PAM250 or Blosum45.
- Statistical Analysis
Provide e-values and bit scores for alignments.
- Graphical Visualization
Add support for visualizing alignments in a graphical interface.
- Optimization
Improve performance for large-scale sequence databases.
Contributing
Contributions are welcome!
-
Fork the repository.
-
Create a new branch for your feature or bugfix.
-
Submit a pull request with detailed documentation of your changes.
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Acknowledgments
NCBI BLAST: The inspiration for this project comes from the NCBI BLAST tool.
Open-source contributors for Python libraries like Biopython and NumPy.