Optimizing MIN / MAX queries on time-series data

Question

I have several big time-series tables having a lot of Nulls (each table may have up to 300 columns), for example:

Time-series table

time                |   a     | b        | c       | d
--------------------+---------+----------+---------+---------
2016-05-15 00:08:22 |         |          |         |         
2016-05-15 13:50:56 |         |          | 26.8301 |
2016-05-15 01:41:58 |         |          |         |            
2016-05-15 00:01:37 |         |          |         |            
2016-05-15 01:45:18 |         |          |         |         
2016-05-15 13:45:32 |         |          | 26.9688 |
2016-05-15 00:01:48 |         |          |         |         
2016-05-15 13:47:56 |         |          |         | 27.1269
2016-05-15 00:01:22 |         |          |         |            
2016-05-15 13:35:36 | 26.7441 | 29.8398  |         | 26.9981
2016-05-15 00:08:53 |         |          |         |         
2016-05-15 00:08:30 |         |          |         |         
2016-05-15 13:14:59 |         |          |         |         
2016-05-15 13:33:36 | 27.4277 | 29.7695  |         |                            
2016-05-15 13:36:36 | 27.4688 | 29.6836  |         |            
2016-05-15 13:37:36 | 27.1016 | 29.8516  |         |

I want to optimize queries for searching first and last values in every column, i.e.:

select MIN(time), MAX(time) from TS where a is not null

(Those queries can run for several minutes)

I plan to create a metadata table holding column names and pointing to the first and last timestamp:

Metadata table

col_name | first_time          | last_time
---------+---------------------+--------------------
a        | 2016-05-15 13:35:36 | 2016-05-15 13:37:36
b        | 2016-05-15 13:35:36 | 2016-05-15 13:37:36
c        | 2016-05-15 13:50:56 | 2016-05-15 13:45:32
d        | 2016-05-15 13:47:56 | 2016-05-15 13:35:36

This way no Null search will occur during the query and I will just access the value in the first and last timestamps.

But I want to prevent the need to update the metadata table on every time-series data modification. Instead I want to create a generic Trigger Function to which will update first_time and last_time columns of the metadata table on every Insert, Update or Delete to Time-Series table. The trigger function should compare existing timestamps in the metadata table against inserted / deleted rows.

Any idea if it's possible to create a generic Trigger Function which will not hold the exact column names of time-series table?

Thanks

Rather try and put indexes on (a asc, time asc) and (a asc, time desc) (and the same for b, c and d). If you want you can than go for a "metadata" view. Better try to avoid creating redundancy. — sticky bit
– sticky bit, Commented Nov 15, 2019 at 0:28
Are you referring a "materialized" view which will be refreshed periodically? I don't see any optimization in a standard view.... — Meir Tseitlin
– Meir Tseitlin, Commented Nov 15, 2019 at 0:43
I haven't mentioned it initially, but I can have up to 300 columns in TS table. I suspect that having 300 indexes will affect Insert performance more than Trigger functions.... — Meir Tseitlin
– Meir Tseitlin, Commented Nov 15, 2019 at 1:00
I also would expect indexes to be slower then trigger because the trigger will only need to update a row when a new min or max value is inserted. While the indexes will need to be updated for every value inserted. BTW you might be tempted to write dynamic code in your trigger that loops over all columns but in my exprience it is better to write a script that generates the trigger with specific code for each column. — Eelke
– Eelke, Commented Nov 16, 2019 at 12:24

TmTron · Accepted Answer · 2019-11-18 19:14:37Z

1

Creating a dynamic query in a trigger function is possible, see this example from how-to-implement-dynamic-sql-in-postgresql-10

CREATE OR REPLACE FUNCTION car_portal_app.get_account (predicate TEXT)
RETURNS SETOF car_portal_app.account AS
$$
BEGIN
RETURN QUERY EXECUTE 'SELECT * FROM car_portal_app.account WHERE ' || predicate;
END;
$$ LANGUAGE plpgsql;

The format function is also helpful to build the query string.

You can implement a trigger that fires once per statement (not for every row): the postgres docs have a great example: look at "Example 43.7. Auditing with Transition Tables" in 43.10. Trigger Functions

This will work great for inserts.
But when the min/max of a column is updated/deleted you must check all rows again to find the new min/max. And if this takes several minutes, it should not be done in the trigger.

edited Nov 18, 2019 at 19:14

answered Nov 16, 2019 at 10:08

TmTron

19.7k13 gold badges108 silver badges165 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Meir Tseitlin Over a year ago

Sorry if this was not clear - the question was how to write a trigger function that updates min/max timestamps in a separate table, base on timestamp available in the data to insert / available rows to delete. (and not how to write a generic trigger function).

TmTron Over a year ago

@Miro: copied directly from your question: Any idea if it's possible to create a generic Trigger Function.. :D

Meir Tseitlin Over a year ago

maybe I understood your answer incorrectly or you did..... I don't see any relation between the function you published and the question asked. I am not sure why trigger will run several minutes as it supposed to go over the inserted data only. Yet, "Transition Tables" is an interesting approach.....

TmTron Over a year ago

@Miro my answer is only meant to guide you in the right direction, not to provide the exact solution. Using the format function, you can build a dynamic query that can do anything. You can get the column names of your tables from the information_schema. The "several minutes" part only refers to update/delete. For inserts it will be okay.

Gordon Linoff · Accepted Answer · 2019-11-15 00:32:09Z

It might be best to do this using multiple columns:

select min(time) filter (where a is not null) as a_min,
       max(time) filter (where a is not null) as a_max,
       min(time) filter (where b is not null) as b_min,
       max(time) filter (where b is not null) as b_max,
       min(time) filter (where c is not null) as c_min,
       max(time) filter (where c is not null) as c_max,
       min(time) filter (where d is not null) as d_min,
       max(time) filter (where d is not null) as d_max,    
from t;

You can then unpivot after this step:

select v.*
from (select min(time) filter (where a is not null) as a_min,
             max(time) filter (where a is not null) as a_max,
             min(time) filter (where b is not null) as b_min,
             max(time) filter (where b is not null) as b_max,
             min(time) filter (where c is not null) as c_min,
             max(time) filter (where c is not null) as c_max,
             min(time) filter (where d is not null) as d_min,
             max(time) filter (where d is not null) as d_max,    
      from metadata
     ) x cross join lateral
     (values ('a', min_a, max_a),
             ('b', min_b, max_b),
             ('c', min_c, max_c),
             ('d', min_d, max_d)
     ) v(which, min_val, max_val);

Instead of creating a trigger, I would opt for indexes, which can be used with GMB's approach.

@Miro . . . I think indexes on the time column where each of the other four columns are NULL (four separate indexes).

GMB · Accepted Answer · 2019-11-15 00:42:29Z

0

You can unpivot with union all. I would suggest using a view instead of using a trigger. This has the advantage of being much more flexible, simpler to maintain, and of not slowing down your DML statements:

create view metadata_view as
select 'a' col_name, min(time) first_time, max(time) last_time from ts where a is not null
union all select 'b', min(time), max(time) from ts where b is not null
union all select 'c', min(time), max(time) from ts where c is not null
union all select 'd', min(time), max(time) from ts where d is not null

For performance, you want the following indexes:

ts(a, time)
ts(b, time)
ts(c, time)
ts(d, time)

edited Nov 15, 2019 at 0:42

answered Nov 15, 2019 at 0:27

GMB

224k25 gold badges102 silver badges151 bronze badges

8 Comments

sticky bit Over a year ago

That a, b, c, d in the list of projected columns should better be in 's, if I got the OP's intent right. Being as it is, that query should throw an error because they are not in a GROUP BY.

Meir Tseitlin Over a year ago

I am trying to understand the idea - query like select 'b', min(time), max(time) from ts where b is not null can run for several minutes. As far as I understand, having a standard view on top of those queries will not create any optimization when I am trying to find First and Last value quickly. So you are suggesting to create a materialized view?

GMB Over a year ago

@Miro: ok, you did not initially mentioned that these queries were slow. Before going for a materialized view, please ensure that you have the indexes that I just added to my answer.

wildplasser Over a year ago

@Miro And: before worrying about performance, maybe reconsider your database-design?

Meir Tseitlin Over a year ago

What if I have 300 columns? Are you sure that insert performance will still be reasonable with 300 indexes? I am working with Time-Series data, so I am not sure how to reconsider DB design.....

|

Collectives™ on Stack Overflow

Optimizing MIN / MAX queries on time-series data

3 Answers 3

4 Comments

3 Comments

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

3 Comments

8 Comments

Linked

Related