Skip to content

A Python utility to extract closest reference strain data from the NCBI database using assembly identifiers from a TSV input file (gtdbtk.bac120.summary.tsv). The script outputs detailed strain information in a structured TSV format.

Notifications You must be signed in to change notification settings

vsmicrogenomics/NCBIClosestStrainFetcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

NCBIClosestStrainFetcher

NCBIClosestStrainFetcher.py: This script is a Python utility designed to query the NCBI database for closest reference strain information using genome assembly identifiers extracted from an TSV file (gtdbtk.bac120.summary.tsv) of the GTDB-TK pipeline. It outputs the fetched details in a structured TSV format.

Prerequisites:

Python 3.7 or above Biopython library Usage: Place the input file in a directory named input. Execute the script using the command: python NCBIClosestStrainFetcher.py

The script will generate an output TSV file (output_strains.tsv) in the output directory, detailing the user genome, assembly ID, original strain info, and closest reference strain for each entry.

Description: The script begins by reading the fastani_reference column from the specified input TSV file. It then leverages the Biopython library to make queries to the NCBI database, retrieving species names and strain details associated with each assembly identifier. The results are then organized and written to the output TSV file.

The core functionality revolves around the fetch_strain_info function which manages the querying and extraction process for each assembly identifier. Any identifier labeled "N/A" or empty is handled accordingly, outputting "N/A" for strain details.

Test files: The test directory contains input and output sub-directories containing test files for the NCBIClosestStrainFetcher.py script. These files are used to verify that the script is working correctly.

Acknowledgements: This script, NCBIClosestStrainFetcher.py, is tailored for researchers and bioinformaticians aiming to cross-reference their genome assembly data with the NCBI database, offering a clear and consolidated view of the closest reference strains.

Citation: If you utilize the NCBIClosestStrainFetcher.py in your work, please cite it as follows:Sharma, V. (2023). NCBIClosestStrainFetcher.py [Python script]. Retrieved from https://github.com/vsmicrogenomics/NCBIClosestStrainFetcher

About

A Python utility to extract closest reference strain data from the NCBI database using assembly identifiers from a TSV input file (gtdbtk.bac120.summary.tsv). The script outputs detailed strain information in a structured TSV format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages