Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safer online QC statistics on the SAM output stream for post-mortem analysis of alignment (or other processes) #37

Open
vinjana opened this issue Nov 15, 2018 · 0 comments

Comments

@vinjana
Copy link
Member

vinjana commented Nov 15, 2018

Motivation: If an alignment job (or whatever other job) takes exceptionally long to run, requires exceptionally large memory resources, or shows similar anomalies, it is time-consuming to identify possible characteristics in the data itself that may cause the anomaly. In OTP the online statistics could be shown and indicate to the researcher problems with their sample.

Goal: Do certain QC statistics "on line" on the output (SAM) stream of the alignment (VCF or whatever) and secure these at regular intervals. "On line" here means, that the statistics should be written for individual chunks of reads in the stream, e.g. every 10e6 reads, and/or aggregated over the full sample seen up to the moment. The online statistics need to saved repeatedly to disc during the processing and must not be deleted in the end. Currently, all statistics file are empty and QC scripts just dump their results at the end of processing.

For the alignment and merging steps, the following statistics should be culled from the per-lane SAM stream at regular intervals

  • Insert-Size Statistics (min, max, Q1, Q2, Q3) (too long fragments)
  • Fraction of read pairs aligned to different chromosomes
  • Soft-Clipping Rate (shorter reads are ambiguous to align)
  • Length distribution of remaining soft-clipped reads (shorter reads are ambiguous to align)
  • Number+Proportion of FF, FR, RR read pairs
  • Number+Proportion of reads aligned with large gaps
  • Number+Proportion of reads aligned with Smith-Waterman secondary alignment step in BWA (slower; XT attribute)
  • Number+Proportion of unaligned reads
  • Distribution parameters of suboptimal hits in BWA (X1 attribute)
  • Distribution parameters of number of best hits (X0 attribute)
  • others? (please add!)

All statistics are interesting that may relate to an exceptionally long runtime or otherwise failing jobs during alignment or any of its follow-up processing steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant