Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join results in more than 2^31 rows for format_sumstats #164

Open
Snigireva opened this issue Aug 17, 2023 · 3 comments
Open

Join results in more than 2^31 rows for format_sumstats #164

Snigireva opened this issue Aug 17, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@Snigireva
Copy link

1. Bug description

Hi! I run this code to standardize the summary statistics:

data = fread('C:/Folder/trait_qc.sumstats.csv.gz')
reformatted <- MungeSumstats::format_sumstats(path=data,  ref_genome="GRCh37", compute_z = TRUE, return_data = TRUE)

Any idea of what to do with that?

Console output


Formatted summary statistics will be saved to ==>  C:\Users\P70~1\AppData\Local\Temp\RtmpQNDQJX\file371020c95a84.tsv.gz
Standardising column headers.
First line of summary statistics file: 
SNP	CHR	BP	PVAL	A1	A2	N	Z	BETA	SE	NSTUDY	
Summary statistics report:
   - 45,984,943 rows
   - 23,134,502 unique variants
   - 114,938 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
1,391,650 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 24,304,912 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 240 seconds.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Data (for the first 50 rows)

df = structure(list(SNP = c("rs367896724", "rs145", "rs534229142",
"rs537182", "rs376342519", "rs5586", "rs575272151", "rs544419",
"rs5611", "rs54", "rs62635286", "rs62", "rs53173", "rs538791886",
"rs558318514", "rs55476", "rs574697788", "rs554", "rs546169444",
"rs7", "rs54194", "rs6682385", "rs199856693", "rs3982632", "rs576",
"rs2758118", "rs2758118", "rs53363", "rs564", "rs374", "rs2691317",
"rs2691315", "rs5575142", "rs541172944", "rs548165136", "rs755466349",
"rs539235482", "rs199745162", "rs578", "rs564", "rs533", "rs8",
"rs545414834", "rs54", "rs532819925", "rs1", "rs5677884", "rs553572247",
"rs539322794", "rs542415"), CHR = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), BP = c(10177L, 10352L,
10511L, 10539L, 10616L, 10642L, 11008L, 11012L, 11063L, 13110L,
13116L, 13118L, 13273L, 13289L, 13445L, 13483L, 13494L, 13550L,
14464L, 14599L, 14604L, 14930L, 14933L, 15211L, 15245L, 15274L,
15274L, 15585L, 15644L, 15774L, 15777L, 15820L, 15903L, 16071L,
16142L, 16226L, 16542L, 16949L, 17641L, 18643L, 18849L, 30923L,
46285L, 47159L, 47267L, 49298L, 49315L, 49343L, 49554L, 50891L
), PVAL = c(0.942, 0.682, 0.891, 0.393, 0.383, 0.297, 0.474,
0.474, 0.848, 0.729, 0.545, 0.545, 0.778, 0.0499, 0.109, 0.00465,
0.591, 0.0709, 0.643, 0.328, 0.328, 0.333, 0.901, 0.141, 0.116,
0.201, 0.259, 0.289, 0.689, 0.836, 0.35, 0.0248, 0.333, 0.565,
0.46, 0.497, 0.206, 0.595, 0.773, 0.197, 0.205, 0.684, 0.155,
0.69, 0.821, 0.311, 0.806, 0.745, 0.972, 0.394), A1 = c("AC",
"TA", "A", "A", "CCGCCGTTGCAAAGGCGCGCCG", "A", "G", "G", "G",
"A", "G", "G", "C", "C", "G", "C", "G", "A", "T", "A", "G", "A",
"A", "T", "T", "A", "G", "A", "A", "A", "G", "T", "GC", "A",
"A", "A", "A", "C", "A", "A", "C", "G", "A", "C", "G", "T", "A",
"C", "G", "C"), A2 = c("A", "T", "G", "C", "C", "G", "C", "C",
"T", "G", "T", "A", "G", "CCT", "C", "G", "A", "G", "A", "T",
"A", "G", "G", "G", "C", "T", "A", "G", "G", "G", "A", "G", "G",
"G", "G", "AG", "C", "A", "G", "G", "G", "T", "ATAT", "T", "T",
"C", "T", "T", "A", "T"), N = c(8160L, 8160L, 361237L, 16026L,
372627L, 361266L, 8160L, 8160L, 357928L, 363969L, 8160L, 8160L,
3701L, 378761L, 357928L, 357928L, 358181L, 367239L, 6832L, 8160L,
8160L, 8160L, 358725L, 8160L, 362555L, 3701L, 3701L, 369481L,
362738L, 364049L, 362923L, 2373L, 8160L, 375575L, 367282L, 26547L,
357680L, 364788L, 357928L, 361989L, 368762L, 3701L, 359800L,
364512L, 361256L, 10040L, 362387L, 362834L, 6832L, 367281L),
Z = c(0.0727563581760374, -0.409735480321281, 0.137038959961148,
-0.854189500094597, 0.872382030909752, 1.04288836267464,
-0.715985989610205, -0.715985989610205, 0.19167090224842,
0.346456061065837, -0.605269414941509, -0.605269414941509,
0.281926329587061, -1.96082020683793, 1.60270409055176, -2.83033010490082,
0.537387465090095, 1.80611742223106, -0.463508393356937,
0.978150286262472, 0.978150286262472, -0.968088845878538,
-0.124398198069055, 1.47207731715937, 1.57178681650986, 1.27870772031991,
1.1287578451833, 1.06031789670761, 0.400212511707879, -0.207012623385187,
-0.93458929107348, -2.24450387316539, 0.968088845878538,
-0.575430768607773, -0.738846849185214, 0.679217595655219,
1.26464113566108, 0.531604424103706, 0.288453003564521, -1.29014591650869,
-1.26743441691691, 0.407010876264466, -1.42209043212232,
0.398855065642337, -0.226258980439831, 1.01312595979589,
0.245589523422081, -0.325239256402395, 0.0351000017727088,
0.852385797957575), BETA = c(0.00198916, -0.0109805, 0.00765789,
-0.149708, 0.0225852, 0.148159, -0.0281357, -0.028136, 0.103634,
0.00314893, -0.0212581, -0.0212581, 0.0161786, -0.0745136,
0.139501, -0.0774387, 0.0209628, 0.0577324, -0.0191033, 0.0330887,
0.0330901, -0.025562, -0.00126148, 0.0439155, 0.0906229,
0.0540921, 0.0478291, 0.0255675, 0.0135413, -0.00585945,
-0.0164868, -0.119141, 0.0259418, -0.183099, -0.0257248,
0.0400081, 0.182568, 0.00773019, 0.0147548, -0.0327346, -0.0154651,
0.0315515, -0.0640722, 0.0034205, -0.0238865, 0.0309572,
0.0157055, -0.0169812, 0.00182556, 0.0274896), SE = c(0.0274895,
0.0268163, 0.0558682, 0.175335, 0.0258707, 0.141956, 0.0392787,
0.0392787, 0.542386, 0.00908721, 0.0351191, 0.0351191, 0.0574542,
0.0380054, 0.0869389, 0.0273598, 0.0389586, 0.0319694, 0.0412681,
0.0338204, 0.0338204, 0.0264114, 0.0100911, 0.0298549, 0.0576995,
0.0423158, 0.0423857, 0.0241328, 0.033891, 0.0282659, 0.0176259,
0.0530988, 0.0268215, 0.317943, 0.0348059, 0.0589221, 0.144412,
0.0145595, 0.0512095, 0.0253839, 0.0122108, 0.0776434, 0.0450702,
0.00857457, 0.105857, 0.0305461, 0.0639575, 0.0521867, 0.0527002,
0.0322444), NSTUDY = c(5L, 5L, 2L, 5L, 7L, 2L, 5L, 5L, 2L,
5L, 5L, 5L, 4L, 8L, 2L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 4L, 5L,
3L, 4L, 4L, 4L, 3L, 4L, 4L, 3L, 5L, 2L, 4L, 7L, 2L, 6L, 2L,
4L, 6L, 4L, 3L, 6L, 2L, 7L, 3L, 3L, 4L, 4L)), row.names = c(NA,
-50L), class = c("data.table", "data.frame"))

3. Session info

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2
[4] BiocGenerics_0.44.0 data.table_1.14.8

@Snigireva Snigireva added the bug Something isn't working label Aug 17, 2023
@Al-Murphy
Copy link
Owner

This looks like an issue with the sheer number of SNPs so I'm not sure working on a bigger RAM machine would help (but it might be worth a try) - to test options I would need the full summary statistics however. Can you share them?

@Snigireva
Copy link
Author

I found a way to overcome this by dividing the sumstats into smaller tables and then formatting them separately (and then join back into one), but I just hoped that there is a more beautiful way to handle this

@Al-Murphy
Copy link
Owner

Yeah I guess you could inspect SNPs in chunks rather than checking all at once - this would be a good feature enhancement. You would need to work out the cut-off for the number of SNPs and the chunk size as there would be a time trade-off. If you would like to make a PR with code for this it would be much appreciated, I don't have time to actively work on a solution for this at the minute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants