genome reference assembly extraction from variants/reads objects? #13

ttriche · 2014-10-30T00:25:01Z

One of the handy defaults in Bioconductor GRanges/VRanges objects is a slot for the genome (reference assembly) of each chromosome (should all be the same for any sane object, of course). This helps prevent comparisons of (e.g.) hg18 and hg19, or hg19 and hg38, data as if on the same coordinate system (a practice which is such a terrible idea that it's essentially never worth allowing).

Unfortunately, this safeguard can't be enforced when no genome is specified.

In the course of adding a default seqlevelsStyle for *Ranges, I realized it would be nice to have the genome reference automatically specified when retrieving variants or aligned reads. This doesn't seem to be possible from the current data structure. What am I overlooking here?

pgrosu · 2014-10-30T02:58:38Z

Hi Tim,

Actually I was able to find it using the java api client using this command:

java -jar target/genomics-tools-client-java-v1beta.jar custom --custom_endpoint "variantsets/search" --custom_method POST --custom_body '{"datasetIds":["10473108253681171589"]}' --pretty_print | less

Then in the the metadata section you will see listed as GRCh37 like this:

...
 "metadata" : [ {
      "key" : "FORMAT",
      "id" : "DS",
      "type" : "float",
      "number" : "1",
      "description" : "Genotype dosage from MaCH/Thunder"
    }, {
      "key" : "reference",
      "value" : "GRCh37",
      "type" : "unknownType"
    }, {
...

Hope it helps,
~p

ttriche · 2014-10-30T03:04:29Z

That is terrific; it's in there, at least.

This sort of thing may become more of an issue (generally) as "variant" calls move towards graph traversals from assembly A to variant assembly representation B. But at least there's a way to extract it and prevent some issues when using the data from R, which avoids certain types of derp.

Thank you!

--t

On Oct 29, 2014, at 7:58 PM, Paul Grosu notifications@github.com wrote:

Hi Tim,

Actually I was able to find it using the java api client using this command:

java -jar target/genomics-tools-client-java-v1beta.jar custom --custom_endpoint "variantsets/search" --custom_method POST --custom_body '{"datasetIds":["10473108253681171589"]}' --pretty_print | less
Then in the the metadata section you will see listed as GRCh37 like this:

...
"metadata" : [ {
"key" : "FORMAT",
"id" : "DS",
"type" : "float",
"number" : "1",
"description" : "Genotype dosage from MaCH/Thunder"
}, {
"key" : "reference",
"value" : "GRCh37",
"type" : "unknownType"
}, {
...
Hope it helps,
~p

—
Reply to this email directly or view it on GitHub.

pgrosu · 2014-10-30T03:36:18Z

Yeah, using a graph representation of the variants will definitely be a different approach, and that's the way GA4GH is also going. It will take a bit of work to do it efficiently, but that's the fun part :)

Also below are the steps if you want to do what I did above in Java through R:

First paste the following after you perform the authenticate("client_secrets.json"):

library(httr)

getMetaData <- function(datasetIds="10473108253681171589", endpoint="https://www.googleapis.com/genomics/v1beta/",
  pageToken=NULL) {

  # Fetch data from the Genomics API
  body <- list(datasetIds=list(datasetIds), pageToken=pageToken)

  message("Fetching read data page")

  #queryParams <- list(fields="nextPageToken,reads(name,cigar,position,originalBases,flags)")
  queryParams <- list(fields="nextPageToken,variantSets(metadata)")
  queryConfig <- config()
  if (GoogleGenomics:::.authStore$use_api_key) {
    queryParams <- c(queryParams, key=GoogleGenomics:::.authStore$api_key)
  } else {
    queryConfig <- config(token=GoogleGenomics:::.authStore$google_token)
  }
  res <- POST(paste(endpoint, "variantsets/search", sep=""),
    query=queryParams,
    body=rjson::toJSON(body), queryConfig,
    add_headers("Content-Type"="application/json"))
  if("error" %in% names(httr::content(res))) {
    print(paste("ERROR:", httr::content(res)$error$message))
  }
  stop_for_status(res)

  message("Parsing read data page")
  res_content <- httr::content(res)
  res_content
}

dataset_metadata_info <- getMetaData()
dataset_metadata_info$variantSets[[1]]$metadata[[2]]$value

Once you run the last two commmands you should see something like this:

> dataset_metadata_info <- getMetaData()
dataset_metadata_info$variantSets[[1]]$metadata[[2]]$value
Fetching read data page
Parsing read data page
> [1] "GRCh37"

~p

deflaux · 2015-01-15T22:30:23Z

This is a great feature request and is something we should do soon.

For reads, its the referenceSetId which can then be used to look up the reference set

java -jar target/genomics-tools-client-java-v1beta2.jar getreadgroupset --id CMvnhpKTFhDnk4_9zcKO3_YB --pretty_print
...
    "id" : "ChhDTXZuaHBLVEZoRG5rNF85emNLTzNfWUIQAQ",
    "name" : "ERR016214",
    "predictedInsertSize" : 465,
    "referenceSetId" : "EOSt9JOVhp3jkwE",
    "sampleId" : "HG00261"
...

For variants, it comes from the VCF the header and we can get it from the variant set metadata.

java -jar target/genomics-tools-client-java-v1beta2.jar getvariantset --id 3049512673186936334 --pretty_print
...
 }, {
    "key" : "reference",
    "type" : "UNKNOWN_TYPE",
    "value" : "file:///illumina/development/Isis/Genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFASTA/genome.fa"
  }, {
...

Per @calbach down the road we should have https://github.com/googlegenomics/api-client-java/issues/66

deflaux assigned siddharthab Jan 15, 2015

ttriche unassigned siddharthab May 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genome reference assembly extraction from variants/reads objects? #13

genome reference assembly extraction from variants/reads objects? #13

ttriche commented Oct 30, 2014

pgrosu commented Oct 30, 2014

ttriche commented Oct 30, 2014

pgrosu commented Oct 30, 2014

deflaux commented Jan 15, 2015

genome reference assembly extraction from variants/reads objects? #13

genome reference assembly extraction from variants/reads objects? #13

Comments

ttriche commented Oct 30, 2014

pgrosu commented Oct 30, 2014

ttriche commented Oct 30, 2014

pgrosu commented Oct 30, 2014

deflaux commented Jan 15, 2015