Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChainAPI should support multiple resolutions of data for querying large timespans #28

Closed
nicklee-998 opened this issue Oct 6, 2014 · 6 comments · Fixed by #77 or #83
Closed

Comments

@nicklee-998
Copy link
Member

Now it's a little slow for the client to fetch the datas of a month or even a year, is it possible to speed up the request process and maybe the client can choose how much details of data they want to acquire.

@ssfrr ssfrr changed the title Hope chain-api can improve the speed of fetching large range of datas. ChainAPI should support multiple resolutions of data for querying large timespans Oct 9, 2014
@ssfrr
Copy link
Member

ssfrr commented Apr 18, 2017

Influx support "continuous queries" to maintain data aggregations. I think a reasonable approach is to maintain a few different resolutions of data in the database:

  • hour
  • day
  • week

For each of these aggregation periods we'll store a few different aggregations:

  • min
  • max
  • mean
  • count (the number of measurements in that time window)

@bmayton @d0nd3r3k @dramsay9 @slash6475 does this seem like it captures your use cases?

A few other random issues/ideas:

  • We want to aggregate on a per-metric, per-sensor basis. I think we can just handle that with a GROUP_BY clause in the CQ
  • it looks like there may have been some changes in backfilling behavior (whether it applies the CQ to data already in the DB), so we should confirm that we can backfill on the latest influxDB
  • Because our aggregation periods can be nested (24 hours/day, 7 days/week) we should be able to pull each aggregate from the higher-resolution one rather than from the original data, but that implies that the CQs need to run in the right order (smallest-to-largest period). I'm not sure how to express that.
  • I think that the min/max/mean/count aggregates will be written as fields within sensordata_1h, sensordata_1d, etc. measurements, but I'm open to other architectures if they make more sense.
  • when aggregating the 1-week mean from the 1-day means, I'm thinking we should weight each of the 1-day means by their count, so we get the answer we would have gotten by averaging all the individual measurements. There's also an argument though that getting the weekly average from the average from each day might be more meaningful, even if there was an uneven number of measurements in each day.

@ellachao does this seem pretty reasonable to you?

@slash6475
Copy link
Member

That works perfectly for me.
I'm just thinking like you about backfilling behavior of CQ when we will update some past records.
But apparently there is an open issue about this topic. We can hope that will fit in a hypothetic future :) else ??

@davidbramsay
Copy link
Member

davidbramsay commented Apr 19, 2017 via email

@ssfrr
Copy link
Member

ssfrr commented Apr 19, 2017

@dramsay9 do you mean first/last timestamps for the aggregate timeslices?

@ssfrr ssfrr closed this as completed in #77 May 8, 2017
@ssfrr ssfrr reopened this May 8, 2017
@ssfrr
Copy link
Member

ssfrr commented May 8, 2017

@slash6475 see #78.

So we have the back-end stuff running now to store the data in 1h, 1d, and 1w aggregates, but we don't have an API for it yet. Here's what I'm thinking:

Option 1

Add optional URL parameters to the ch:dataHistory link in the Sensor resource, and have the data that gets returned look more or less the same as the raw data does now. Currently that link looks like:

http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689

In the new scheme it would be:

http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689{&aggtime,aggfunc}

where aggtime can be 1h, 1d, or 1w and aggfunc could be min, max, mean, or count. So an appropriate expansion of this would be:

http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689&aggtime=1h,aggfunc=max

The data returned would be a list of timestamp/value pairs, just like the current raw data.

Option 2

Currently we have the ch:dataHistory link to the raw data. We could add an additional link ch:aggregateData:

http://chain-api.media.mit.edu/scalar_data/aggregate/?sensor_id=3689{&aggtime}

where aggtime can be 1h, 1d, or 1w. The response would be a list of timestamped statistic objects, like:

{
  "dataType": "float",
  "data": [
    {
      "timestamp": "2017-05-08T11:20:08.81114496Z",
      "min": 5.3,
      "max": 8.1,
      "mean": 7.2,
      "count": 207
    },
    {
      "timestamp": "2017-05-08T11:20:38.968259072Z",
      "min": 8.1,
      "max": 12.0,
      "mean": 8.5,
      "count": 220
    }
  ],
  "_links": {
    "curies": [
      {
        "href": "http://chain-api.media.mit.edu/rels/{rel}",
        "name": "ch",
        "templated": true
      }
    ],
    "self": {
      "href": "http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689&aggtime=1h&timestamp__gte=1494242399&timestamp__lt=1494263999"
    },
    "next": {
      "href": "http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689&aggtime=1h&timestamp__gte=1494263999&timestamp__lt=1494285599",
      "title": "2017-05-08 17:19:59.137467+00:00 to 2017-05-08 23:19:59.137467+00:00"
    },
    "previous": {
      "href": "http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689&aggtime=1h&timestamp__gte=1494220799&timestamp__lt=1494242399",
      "title": "2017-05-08 05:19:59.137467+00:00 to 2017-05-08 11:19:59.137467+00:00"
    }
  }
}

Feedback welcome.

References and General Info

Both of these options make the most sense when combined with adding user-supplied time ranges (#80)

We'll want to add support to chainclient.py: ssfrr/chainclient.py#4

When parsing these links in the server, we should just ignore anything the client leaves inside of braces, so that template-unaware clients can keep using the links as opaque strings

IETF RFC for templated links

Looks like there are a couple python libraries for dealing with templated links:

@ssfrr
Copy link
Member

ssfrr commented May 17, 2017

After meeting with @ellachao today:

  1. we'll go with option 2
  2. the URL will be more like:
    http://chain-api.media.mit.edu/aggregate_data/?sensor_id=3689{&aggtime}
    as it will probably be a separate resource (though clients shouldn't care about the URL)
  3. the aggtime parameter will be required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants