You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Formatted summary statistics will be saved to ==> C:\Users\P70~1\AppData\Local\Temp\RtmpQNDQJX\file371020c95a84.tsv.gz
Standardising column headers.
First line of summary statistics file:
SNP CHR BP PVAL A1 A2 N Z BETA SE NSTUDY
Summary statistics report:
- 45,984,943 rows
- 23,134,502 unique variants
- 114,938 genome-wide significant variants (P<5e-8)
- 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
1,391,650 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Found Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 24,304,912 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 240 seconds.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
This looks like an issue with the sheer number of SNPs so I'm not sure working on a bigger RAM machine would help (but it might be worth a try) - to test options I would need the full summary statistics however. Can you share them?
I found a way to overcome this by dividing the sumstats into smaller tables and then formatting them separately (and then join back into one), but I just hoped that there is a more beautiful way to handle this
Yeah I guess you could inspect SNPs in chunks rather than checking all at once - this would be a good feature enhancement. You would need to work out the cut-off for the number of SNPs and the chunk size as there would be a time trade-off. If you would like to make a PR with code for this it would be much appreciated, I don't have time to actively work on a solution for this at the minute.
1. Bug description
Hi! I run this code to standardize the summary statistics:
Any idea of what to do with that?
Console output
Data (for the first 50 rows)
df = structure(list(SNP = c("rs367896724", "rs145", "rs534229142",
"rs537182", "rs376342519", "rs5586", "rs575272151", "rs544419",
"rs5611", "rs54", "rs62635286", "rs62", "rs53173", "rs538791886",
"rs558318514", "rs55476", "rs574697788", "rs554", "rs546169444",
"rs7", "rs54194", "rs6682385", "rs199856693", "rs3982632", "rs576",
"rs2758118", "rs2758118", "rs53363", "rs564", "rs374", "rs2691317",
"rs2691315", "rs5575142", "rs541172944", "rs548165136", "rs755466349",
"rs539235482", "rs199745162", "rs578", "rs564", "rs533", "rs8",
"rs545414834", "rs54", "rs532819925", "rs1", "rs5677884", "rs553572247",
"rs539322794", "rs542415"), CHR = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), BP = c(10177L, 10352L,
10511L, 10539L, 10616L, 10642L, 11008L, 11012L, 11063L, 13110L,
13116L, 13118L, 13273L, 13289L, 13445L, 13483L, 13494L, 13550L,
14464L, 14599L, 14604L, 14930L, 14933L, 15211L, 15245L, 15274L,
15274L, 15585L, 15644L, 15774L, 15777L, 15820L, 15903L, 16071L,
16142L, 16226L, 16542L, 16949L, 17641L, 18643L, 18849L, 30923L,
46285L, 47159L, 47267L, 49298L, 49315L, 49343L, 49554L, 50891L
), PVAL = c(0.942, 0.682, 0.891, 0.393, 0.383, 0.297, 0.474,
0.474, 0.848, 0.729, 0.545, 0.545, 0.778, 0.0499, 0.109, 0.00465,
0.591, 0.0709, 0.643, 0.328, 0.328, 0.333, 0.901, 0.141, 0.116,
0.201, 0.259, 0.289, 0.689, 0.836, 0.35, 0.0248, 0.333, 0.565,
0.46, 0.497, 0.206, 0.595, 0.773, 0.197, 0.205, 0.684, 0.155,
0.69, 0.821, 0.311, 0.806, 0.745, 0.972, 0.394), A1 = c("AC",
"TA", "A", "A", "CCGCCGTTGCAAAGGCGCGCCG", "A", "G", "G", "G",
"A", "G", "G", "C", "C", "G", "C", "G", "A", "T", "A", "G", "A",
"A", "T", "T", "A", "G", "A", "A", "A", "G", "T", "GC", "A",
"A", "A", "A", "C", "A", "A", "C", "G", "A", "C", "G", "T", "A",
"C", "G", "C"), A2 = c("A", "T", "G", "C", "C", "G", "C", "C",
"T", "G", "T", "A", "G", "CCT", "C", "G", "A", "G", "A", "T",
"A", "G", "G", "G", "C", "T", "A", "G", "G", "G", "A", "G", "G",
"G", "G", "AG", "C", "A", "G", "G", "G", "T", "ATAT", "T", "T",
"C", "T", "T", "A", "T"), N = c(8160L, 8160L, 361237L, 16026L,
372627L, 361266L, 8160L, 8160L, 357928L, 363969L, 8160L, 8160L,
3701L, 378761L, 357928L, 357928L, 358181L, 367239L, 6832L, 8160L,
8160L, 8160L, 358725L, 8160L, 362555L, 3701L, 3701L, 369481L,
362738L, 364049L, 362923L, 2373L, 8160L, 375575L, 367282L, 26547L,
357680L, 364788L, 357928L, 361989L, 368762L, 3701L, 359800L,
364512L, 361256L, 10040L, 362387L, 362834L, 6832L, 367281L),
Z = c(0.0727563581760374, -0.409735480321281, 0.137038959961148,
-0.854189500094597, 0.872382030909752, 1.04288836267464,
-0.715985989610205, -0.715985989610205, 0.19167090224842,
0.346456061065837, -0.605269414941509, -0.605269414941509,
0.281926329587061, -1.96082020683793, 1.60270409055176, -2.83033010490082,
0.537387465090095, 1.80611742223106, -0.463508393356937,
0.978150286262472, 0.978150286262472, -0.968088845878538,
-0.124398198069055, 1.47207731715937, 1.57178681650986, 1.27870772031991,
1.1287578451833, 1.06031789670761, 0.400212511707879, -0.207012623385187,
-0.93458929107348, -2.24450387316539, 0.968088845878538,
-0.575430768607773, -0.738846849185214, 0.679217595655219,
1.26464113566108, 0.531604424103706, 0.288453003564521, -1.29014591650869,
-1.26743441691691, 0.407010876264466, -1.42209043212232,
0.398855065642337, -0.226258980439831, 1.01312595979589,
0.245589523422081, -0.325239256402395, 0.0351000017727088,
0.852385797957575), BETA = c(0.00198916, -0.0109805, 0.00765789,
-0.149708, 0.0225852, 0.148159, -0.0281357, -0.028136, 0.103634,
0.00314893, -0.0212581, -0.0212581, 0.0161786, -0.0745136,
0.139501, -0.0774387, 0.0209628, 0.0577324, -0.0191033, 0.0330887,
0.0330901, -0.025562, -0.00126148, 0.0439155, 0.0906229,
0.0540921, 0.0478291, 0.0255675, 0.0135413, -0.00585945,
-0.0164868, -0.119141, 0.0259418, -0.183099, -0.0257248,
0.0400081, 0.182568, 0.00773019, 0.0147548, -0.0327346, -0.0154651,
0.0315515, -0.0640722, 0.0034205, -0.0238865, 0.0309572,
0.0157055, -0.0169812, 0.00182556, 0.0274896), SE = c(0.0274895,
0.0268163, 0.0558682, 0.175335, 0.0258707, 0.141956, 0.0392787,
0.0392787, 0.542386, 0.00908721, 0.0351191, 0.0351191, 0.0574542,
0.0380054, 0.0869389, 0.0273598, 0.0389586, 0.0319694, 0.0412681,
0.0338204, 0.0338204, 0.0264114, 0.0100911, 0.0298549, 0.0576995,
0.0423158, 0.0423857, 0.0241328, 0.033891, 0.0282659, 0.0176259,
0.0530988, 0.0268215, 0.317943, 0.0348059, 0.0589221, 0.144412,
0.0145595, 0.0512095, 0.0253839, 0.0122108, 0.0776434, 0.0450702,
0.00857457, 0.105857, 0.0305461, 0.0639575, 0.0521867, 0.0527002,
0.0322444), NSTUDY = c(5L, 5L, 2L, 5L, 7L, 2L, 5L, 5L, 2L,
5L, 5L, 5L, 4L, 8L, 2L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 4L, 5L,
3L, 4L, 4L, 4L, 3L, 4L, 4L, 3L, 5L, 2L, 4L, 7L, 2L, 6L, 2L,
4L, 6L, 4L, 3L, 6L, 2L, 7L, 3L, 3L, 4L, 4L)), row.names = c(NA,
-50L), class = c("data.table", "data.frame"))
3. Session info
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2
[4] BiocGenerics_0.44.0 data.table_1.14.8
The text was updated successfully, but these errors were encountered: