0

I have a postgres table ("dist_mx") that indicates the distances between two points (geographic space). The points are defined in the "hex_0" and "hex_1" columns. The table will eventually be 10^7 to 10^8 rows. The table is structured as such:

enter image description here

One of the purposes of this table is to query the shortest distance from a list of points (1000s) to the points that correspond to locations of interest. For example, I want to know the shortest distance from each point to a grocery stores (we know how each grocery store corresponds to point ids).

I'm using a UNION statement to run the query. The OR statement is used because the order of the points is arbitrary (i.e., pairs aren't repeated in reverse order). See below:

SELECT MIN(distances) FROM dist_mx
WHERE ((point_id_0= '8829abb139fffff' AND point_id_1 IN ('8829abb555fffff', ...))
    OR (point_id_1= '8829abb139fffff' AND point_id_0 IN ('8829abb555fffff', ...))
UNION
SELECT MIN(distances) FROM dist_mx
WHERE ((point_id_0= '8829abb469fffff' AND point_id_1 IN ('8829abb555fffff', ...))
    OR (point_id_1= '8829abb469fffff' AND point_id_0 IN ('8829abb555fffff', ...))
...

The query seems to be working as intended but it is slow. It takes 20 minutes for the query to run on a list of ~4500 points. I have tried chunking the query so I only include 500 queries at a time (i.e., connected by the UNION statement), but this does not significantly change performance.

I'm relatively new to postgres so I am hoping that there is a fairly simple speedup (or a not fairly simple speedup)?

EDIT: adding schema enter image description here

12
  • 1
    Can you clarify what the ellipses really are in your query - what is the real query here that you are working with. Also it might be better if you can include the hex_0 and hex_1 in the sample data (since they are important to the query). And you should say what the datatypes are of those values that appear to be hexadecimal (are the string data? binary data, numeric data?) Commented Aug 31, 2022 at 16:45
  • Please show us your schema and their indexes. \d dist_mx in psql will do it. Commented Aug 31, 2022 at 16:45
  • 1
    @jtam You're likely prematurely optimizing, and you haven't done the basic optimizations (ie. indexes). Try it with PostGIS and see if it's fast enough. If you need to precalculate, do it with PostGIS. Commented Aug 31, 2022 at 17:27
  • 1
    The ellipsis just indicate that there is a long list of point_ids inside the parentheses. In this case , there are ~500 but could be 1000s in future queries. --> are you writing these queries by hand (or copy pasta)? Massive IN clauses suggest the data be in a table (select * from table1 join table2). I suspect there are performance gains there too. Commented Aug 31, 2022 at 20:41
  • 1
    Ah. In MSSQL I would pass in the list as a table-valued parameter. I don't know if there is an equivalent in postgres. A long list of values in an IN clause does not sound performant to me (the execution plan will probably not be optimal). Generally agree with all the other answers that suggest making sure you have the right indexes. I believe you would need two indexes - one for (point_0, point_1), and another for (point_1, point_0). Commented Sep 1, 2022 at 13:40

2 Answers 2

1

Without seeing an explain analyze for your query, and also the whole query, I can't give specific advice. There's also probably a better way to write your query, but it's unclear what you're doing.

Here's some general advice.


The basic performance tool is indexes. Without indexes, Postgres must scan the whole table, probably repeatedly. See Use The Index, Luke for more.

A multi-column index on (point_id_0, point_id_1) will allow Postgres to quickly find the matching rows without having to scan the whole table.

create index dist_mx_points_idx on dist_mx(point_id_0, point_id_1)

That should help significantly.

One of the purposes of this table is to query the shortest distance from a list of points (1000s) to the points that correspond to locations of interest. For example, I want to know the shortest distance from each point to a grocery stores (we know how each grocery store corresponds to point ids).

Use PostGIS.


Other notes.

  • Don't store hex as a string, store it as a bigint and convert. This will take less space and is faster.
  • Don't store numbers as text, use an integer.
  • Don't store your points as two columns, use a single point column. Then you can use geometric operators. However, these are 2D calculations and only accurate for GIS over short distances.
  • Since you're doing GIS, don't do this by hand. Use PostGIS.
Sign up to request clarification or add additional context in comments.

1 Comment

thanks for this. I will try indexing the columns. The distances are actually driving distance so it looks like I will have to use pgRouting if I were going to do the calculations in postgres. I'm worried the calculations will take too long for my needs, but I'll test it out.
-1

checkout this gem for the faster query execution,

https://rubygems.org/gems/pg_query_optimizer

it will enable parallel execution for the query and improves performance 50 % comapred to original query execution.

let say if your query takes around 20ms after all the optimizatoin, by utilizing this simply, you can reduce the query time to around 8-9 ms .

Blog post to refer for its usage: https://medium.com/@shanmugamjanarthan24/improve-performance-of-your-postgresql-queries-with-pg-query-optimizer-1703da97356e

1 Comment

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.