A Python script to parse and bin proteins in the human proteome by protein length and visualize results.
To challenge myself, this project was coded using the Python base package only (no libraries). However, to visualize results, matplotlib.pyplot was used to generate figures.
- Python 3
- Any text editor
- The file called "uniprot-9606-reviewed.fasta" which is included in this repository
- Download this repository
- Run the script "human_proteome_parser.py" from inside the folder called "run_from_this_dict"
- Set parameters
- User is prompted to enter the following:
- Desired lower bound of protein lengths: (e.g. 0 residues)
- Desired upper bound of protein lengths: (e.g. 1500 residues)
- Desired bin width: (e.g. 50 residues)
- Note: Please enter only the integer, not the string "residues"
- User is prompted to enter the following:
- Define helper functions
- make_dict
- Parses the .fasta file, isolating the protein name from the molecular sequence
- identify_bin
- Based on user-specified input (bin width, range of protein sizes to consider), bin proteins based on number of residues
- cumilative_sum
- Creates running sum of elements in a list. For example, the list [1,2,3,4,5] becomes [1,3,6,10,15].
- make_dict
- Generate dictionary by calling make_dict
- (protein name) : (protein length)
- Calculate number of proteins and relative frequency of proteins in each bin by calling identify_bin
- Calculate cumilative frequency of proteins by calling cumilative_sum
- Generate new dictionary based on binned data
- (cur bin) : [ (protein length) , (relative frequency of occurance in the human proteome), (cumilitive frequency) ]
- Display output in commend window
- Write output to CSV
- Plot and save histograms for data visualization