17

I have a few tables with about 17M rows that all have a date column I would like to be able to utilize frequently for searches. I am considering either just throwing an index on the column and see how things go or sorting the items by date as a one time operation and then inserting everything into a new table so that the primary key ascends as the date ascends.

Since these are both pretty time consuming I thought it might be worth it to ask here first for input.

The end goal is for me to load sql queries into pandas for some analysis if that is relevant here.

1
  • Just create it concurrent the first time so that even go it takes a while it's not going to hinder perf. Commented Oct 29, 2016 at 18:55

2 Answers 2

24

The index on a date column makes sense when you are going to search the table for a given date(s), e.g.:

select * from test
where the_date = '2016-01-01';
-- or

select * from test
where the_date between '2016-01-01' and '2016-01-31';
-- etc

In these queries there is no matter whether the sort order of primary key and the date column are the same or not. Hence rewriting the data to the new table will be useless. Just create an index.

However, if you are going to use the index only in ORDER BY:

select * from test
order by the_date;

then a primary key integer index may be significantly (2-4 times) faster then an index on a date column.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. I was under the impression adding an index to the date field would speed up orderings as well so this cleared that up for me. On the other hand it seems like having the primary keys increase in tandem with the date field would be more versatile in that I get order by performance as well as the ability to indirectly query date ranges as long as I have a general idea which IDs map to say the first day in a month.
Well, such additional artificial indexing is possible, but it seems too complicated and impractical. In fact, an ordinary index significantly improves performance and should be sufficient.
But what if I use queries for deleting like WHERE date_added > now() - '1 month'? Is index useful in this case? Thnx
@Sonique - yes, the index may be used by the planner in this case.
10

Postgres supports to some extend clustered indexes, which is what you suggest by removing and reinserting the data.

In fact, removing and reinserting the data in the order you want will not change the time the query takes. Postgres does not know the order of the data.

If you know that the table's data does not change. Then cluster the data based on the index you create.

This operation reorders the table based on the order in the index. It is very effective until you update the table. The syntax is:

CLUSTER tableName USING IndexName;

See the manual for details.

I also recommend you use

explain <query>;

to compare two queries, before and after an index. Or before and after clustering.

1 Comment

Just for extra context, clustering only affects the efficiency of disk IO when reading rows. Postgres doesn't actually create a "clustered index" that's used when making query plan decisions even if you never change the table after running cluster

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.