Skip to content

A python script to parse the human proteome by molecular length and visualize results

Notifications You must be signed in to change notification settings

kwakim1/parse_the_human_proteome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

parse_the_human_proteome

A Python script to parse and bin proteins in the human proteome by protein length and visualize results.

alt_text

Notes

To challenge myself, this project was coded using the Python base package only (no libraries). However, to visualize results, matplotlib.pyplot was used to generate figures.

Requirements

  1. Python 3
  2. Any text editor
  3. The file called "uniprot-9606-reviewed.fasta" which is included in this repository

How to Run This Program

  1. Download this repository
  2. Run the script "human_proteome_parser.py" from inside the folder called "run_from_this_dict"

Steps Performed

  1. Set parameters
    • User is prompted to enter the following:
      1. Desired lower bound of protein lengths: (e.g. 0 residues)
      2. Desired upper bound of protein lengths: (e.g. 1500 residues)
      3. Desired bin width: (e.g. 50 residues)
        • Note: Please enter only the integer, not the string "residues"
  2. Define helper functions
    • make_dict
      • Parses the .fasta file, isolating the protein name from the molecular sequence
    • identify_bin
      • Based on user-specified input (bin width, range of protein sizes to consider), bin proteins based on number of residues
    • cumilative_sum
      • Creates running sum of elements in a list. For example, the list [1,2,3,4,5] becomes [1,3,6,10,15].
  3. Generate dictionary by calling make_dict
    • (protein name) : (protein length)
  4. Calculate number of proteins and relative frequency of proteins in each bin by calling identify_bin
  5. Calculate cumilative frequency of proteins by calling cumilative_sum
  6. Generate new dictionary based on binned data
    • (cur bin) : [ (protein length) , (relative frequency of occurance in the human proteome), (cumilitive frequency) ]
  7. Display output in commend window
  8. Write output to CSV
  9. Plot and save histograms for data visualization

About

A python script to parse the human proteome by molecular length and visualize results

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages