ChainAPI should support multiple resolutions of data for querying large timespans #28

nicklee-998 · 2014-10-06T20:32:41Z

Now it's a little slow for the client to fetch the datas of a month or even a year, is it possible to speed up the request process and maybe the client can choose how much details of data they want to acquire.

ssfrr · 2017-04-18T21:22:48Z

Influx support "continuous queries" to maintain data aggregations. I think a reasonable approach is to maintain a few different resolutions of data in the database:

hour
day
week

For each of these aggregation periods we'll store a few different aggregations:

min
max
mean
count (the number of measurements in that time window)

@bmayton @d0nd3r3k @dramsay9 @slash6475 does this seem like it captures your use cases?

A few other random issues/ideas:

We want to aggregate on a per-metric, per-sensor basis. I think we can just handle that with a GROUP_BY clause in the CQ
it looks like there may have been some changes in backfilling behavior (whether it applies the CQ to data already in the DB), so we should confirm that we can backfill on the latest influxDB
Because our aggregation periods can be nested (24 hours/day, 7 days/week) we should be able to pull each aggregate from the higher-resolution one rather than from the original data, but that implies that the CQs need to run in the right order (smallest-to-largest period). I'm not sure how to express that.
I think that the min/max/mean/count aggregates will be written as fields within sensordata_1h, sensordata_1d, etc. measurements, but I'm open to other architectures if they make more sense.
when aggregating the 1-week mean from the 1-day means, I'm thinking we should weight each of the 1-day means by their count, so we get the answer we would have gotten by averaging all the individual measurements. There's also an argument though that getting the weekly average from the average from each day might be more meaningful, even if there was an uneven number of measurements in each day.

@ellachao does this seem pretty reasonable to you?

slash6475 · 2017-04-18T21:46:53Z

That works perfectly for me.
I'm just thinking like you about backfilling behavior of CQ when we will update some past records.
But apparently there is an open issue about this topic. We can hope that will fit in a hypothetic future :) else ??

davidbramsay · 2017-04-19T00:37:16Z

Sounds useful to me! Agreed on average weighting by count. The other addition that I'd nominate (that I'm not sure if there is an elegant solution for atm) is throwing a known timestamp of first/last readings for each sensorData somewhere in there.

ssfrr · 2017-04-19T23:12:58Z

@dramsay9 do you mean first/last timestamps for the aggregate timeslices?

ssfrr · 2017-05-08T17:28:39Z

@slash6475 see #78.

So we have the back-end stuff running now to store the data in 1h, 1d, and 1w aggregates, but we don't have an API for it yet. Here's what I'm thinking:

Option 1

Add optional URL parameters to the ch:dataHistory link in the Sensor resource, and have the data that gets returned look more or less the same as the raw data does now. Currently that link looks like:

http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689

In the new scheme it would be:

http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689{&aggtime,aggfunc}

where aggtime can be 1h, 1d, or 1w and aggfunc could be min, max, mean, or count. So an appropriate expansion of this would be:

http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689&aggtime=1h,aggfunc=max

The data returned would be a list of timestamp/value pairs, just like the current raw data.

Option 2

Currently we have the ch:dataHistory link to the raw data. We could add an additional link ch:aggregateData:

http://chain-api.media.mit.edu/scalar_data/aggregate/?sensor_id=3689{&aggtime}

where aggtime can be 1h, 1d, or 1w. The response would be a list of timestamped statistic objects, like:

{
  "dataType": "float",
  "data": [
    {
      "timestamp": "2017-05-08T11:20:08.81114496Z",
      "min": 5.3,
      "max": 8.1,
      "mean": 7.2,
      "count": 207
    },
    {
      "timestamp": "2017-05-08T11:20:38.968259072Z",
      "min": 8.1,
      "max": 12.0,
      "mean": 8.5,
      "count": 220
    }
  ],
  "_links": {
    "curies": [
      {
        "href": "http://chain-api.media.mit.edu/rels/{rel}",
        "name": "ch",
        "templated": true
      }
    ],
    "self": {
      "href": "http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689&aggtime=1h&timestamp__gte=1494242399&timestamp__lt=1494263999"
    },
    "next": {
      "href": "http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689&aggtime=1h&timestamp__gte=1494263999&timestamp__lt=1494285599",
      "title": "2017-05-08 17:19:59.137467+00:00 to 2017-05-08 23:19:59.137467+00:00"
    },
    "previous": {
      "href": "http://chain-api.media.mit.edu/scalar_data/?sensor_id=3689&aggtime=1h&timestamp__gte=1494220799&timestamp__lt=1494242399",
      "title": "2017-05-08 05:19:59.137467+00:00 to 2017-05-08 11:19:59.137467+00:00"
    }
  }
}

Feedback welcome.

References and General Info

Both of these options make the most sense when combined with adding user-supplied time ranges (#80)

We'll want to add support to chainclient.py: ssfrr/chainclient.py#4

When parsing these links in the server, we should just ignore anything the client leaves inside of braces, so that template-unaware clients can keep using the links as opaque strings

IETF RFC for templated links

Looks like there are a couple python libraries for dealing with templated links:

ssfrr · 2017-05-17T17:08:49Z

After meeting with @ellachao today:

we'll go with option 2
the URL will be more like:
http://chain-api.media.mit.edu/aggregate_data/?sensor_id=3689{&aggtime}
as it will probably be a separate resource (though clients shouldn't care about the URL)
the aggtime parameter will be required

ssfrr changed the title ~~Hope chain-api can improve the speed of fetching large range of datas.~~ ChainAPI should support multiple resolutions of data for querying large timespans Oct 9, 2014

ssfrr added the enhancement label Oct 9, 2014

ellachao mentioned this issue May 3, 2017

add continuous queries #77

Merged

ssfrr closed this as completed in #77 May 8, 2017

ssfrr reopened this May 8, 2017

ellachao mentioned this issue May 25, 2017

add support for multiple resolutions #83

Merged

ssfrr closed this as completed in #83 Nov 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChainAPI should support multiple resolutions of data for querying large timespans #28

ChainAPI should support multiple resolutions of data for querying large timespans #28

nicklee-998 commented Oct 6, 2014

ssfrr commented Apr 18, 2017

slash6475 commented Apr 18, 2017

davidbramsay commented Apr 19, 2017 via email •

edited by ssfrr

Loading

ssfrr commented Apr 19, 2017 •

edited

Loading

ssfrr commented May 8, 2017

ssfrr commented May 17, 2017

ChainAPI should support multiple resolutions of data for querying large timespans #28

ChainAPI should support multiple resolutions of data for querying large timespans #28

Comments

nicklee-998 commented Oct 6, 2014

ssfrr commented Apr 18, 2017

slash6475 commented Apr 18, 2017

davidbramsay commented Apr 19, 2017 via email • edited by ssfrr Loading

ssfrr commented Apr 19, 2017 • edited Loading

ssfrr commented May 8, 2017

Option 1

Option 2

References and General Info

ssfrr commented May 17, 2017

davidbramsay commented Apr 19, 2017 via email •

edited by ssfrr

Loading

ssfrr commented Apr 19, 2017 •

edited

Loading