1

I have the following select, which, on a large database, is slow:

SELECT eventid 
FROM track_event 
WHERE inboundid IN (SELECT messageid FROM temp_message);

The temp_message table is small (100 rows) and only one column (messageid varchar), with a btree index on the column.

The track_event table has 19 columns and nearly 13 million rows. The columns used in this query (eventid bigint and inboundid varchar) both have btree indexes.

I can't copy/paste the explain plan from the big database, but here's the plan from a smaller database (only 348 rows in track_event) with the same schema:

 explain analyse SELECT eventid FROM track_event WHERE inboundid IN (SELECT messageid FROM temp_message);
                                                           QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.00..60.78 rows=348 width=8) (actual time=0.033..3.186 rows=348 loops=1)
->  Seq Scan on track_event  (cost=0.00..8.48 rows=348 width=25) (actual time=0.012..0.860 rows=348 loops=1)
->  Index Scan using temp_message_idx on temp_message  (cost=0.00..0.48 rows=7 width=32) (actual time=0.005..0.005 rows=1 loops=348)
      Index Cond: ((temp_message.messageid)::text = (track_event.inboundid)::text)
Total runtime: 3.349 ms
(5 rows)

On the large database, this query takes about 450 seconds. Can anyone see any obvious speed-ups? I notice there's a Seq Scan on track_event in the explain plan - I think I'd like to lose that, but cannot work out which index I could use instead.

EDITS

Postgres 9.0

The track_event table is part of a very large complicated schema which I can't make significant changes to. Here's the information, including a new index I just added :

            Table "public.track_event"
       Column       |           Type           | Modifiers 
--------------------+--------------------------+-----------
 eventid            | bigint                   | not null
 messageid          | character varying        | not null
 inboundid          | character varying        | not null
 newid              | character varying        | 
 parenteventid      | bigint                   | 
 pmmuser            | bigint                   | 
 eventdate          | timestamp with time zone | not null
 routeid            | integer                  | 
 eventtypeid        | integer                  | not null
 adminid            | integer                  | 
 hostid             | integer                  | 
 reason             | character varying        | 
 expiry             | integer                  | 
 encryptionendpoint | character varying        | 
 encryptionerror    | character varying        | 
 encryptiontype     | character varying        | 
 tlsused            | integer                  | 
 tlsrequested       | integer                  | 
 encryptionportal   | integer                  | 
Indexes:
    "track_event_pk" PRIMARY KEY, btree (eventid)
    "foo" btree (inboundid, eventid)
    "px_event_inboundid" btree (inboundid)
    "track_event_idx" btree (messageid, eventtypeid)
Foreign-key constraints:
    "track_event_parent_fk" FOREIGN KEY (parenteventid) REFERENCES track_event(eventid)
    "track_event_pmi_route_fk" FOREIGN KEY (routeid) REFERENCES pmi_route(routeid)
    "track_event_pmim_smtpaddress_fk" FOREIGN KEY (pmmuser) REFERENCES pmim_smtpaddress(smtpaddressid)
    "track_event_track_adminuser_fk" FOREIGN KEY (adminid) REFERENCES track_adminuser(adminid)
    "track_event_track_encryptionportal_fk" FOREIGN KEY (encryptionportal) REFERENCES track_encryptionportal(id)
    "track_event_track_eventtype_fk" FOREIGN KEY (eventtypeid) REFERENCES track_eventtype(eventtypeid)
    "track_event_track_host_fk" FOREIGN KEY (hostid) REFERENCES track_host(hostid)
    "track_event_track_message_fk" FOREIGN KEY (inboundid) REFERENCES track_message(messageid)
Referenced by:
    TABLE "track_event" CONSTRAINT "track_event_parent_fk" FOREIGN KEY (parenteventid) REFERENCES track_event(eventid)
    TABLE "track_eventaddress" CONSTRAINT "track_eventaddress_track_event_fk" FOREIGN KEY (eventid) REFERENCES track_event(eventid)
    TABLE "track_eventattachment" CONSTRAINT "track_eventattachment_track_event_fk" FOREIGN KEY (eventid) REFERENCES track_event(eventid)
    TABLE "track_eventrule" CONSTRAINT "track_eventrule_track_event_fk" FOREIGN KEY (eventid) REFERENCES track_event(eventid)
    TABLE "track_eventthreatdescription" CONSTRAINT "track_eventthreatdescription_track_event_fk" FOREIGN KEY (eventid) REFERENCES track_event(eventid)
    TABLE "track_eventthreattype" CONSTRAINT "track_eventthreattype_track_event_fk" FOREIGN KEY (eventid) REFERENCES track_event(eventid)
    TABLE "track_quarantineevent" CONSTRAINT "track_quarantineevent_track_event_fk" FOREIGN KEY (eventid) REFERENCES track_event(eventid)
6
  • 1) is there any reason for mesaage_id to be a varchar type ? 2) maybe add a surrogate primary key, use that as a FK, and and add a unique constraint on the text column? 3) maybe prefer EXISTS(...) to IN (...)` ? (though the plan already shows an index scan) 0) please add the table definitions to the question. 0a) and the tunables (random_page_cost, work_mem, ?) Commented Aug 29, 2013 at 10:55
  • 1
    BTW: the cast ((temp_message.messageid)::text = (track_event.inboundid)::text) on the index condition is very suspect I cannot reproduce it here (pg9.3beta: I only get hashjoins; even for varchar keys). Postgres version ? TABLE definitions ? Commented Aug 29, 2013 at 12:11
  • As joop said add your Postgres version. No answer should be given without the plan and the version. Commented Aug 29, 2013 at 12:30
  • Solved! Some of the answers were quite useful, although none quite hit the nail on the head. I was running into problem because the test DB was so small so the queries behaved differently between the two DBs, causing some confusion. IN the end, running analyse on the track_event table got it to use the index, and a query of several minutes dropped down to milliseconds. Commented Aug 29, 2013 at 19:29
  • 1
    Next time please post the real query. With the real schema and the real data. And the real tuning. That would save some real people a lot of guesswork. Thank you. Commented Aug 29, 2013 at 19:34

5 Answers 5

1

Your query is doing a full table scan on the larger table. An obvious speed up is to add an index on event_track(inboundid, eventid). Postgres should be able to use the index on your query as written. You can rewrite the query as:

SELECT te.eventid
FROM track_event te join
     temp_message tm
     on te.inboundid  = tm.messageid;

which should definitely use the index. (You might need select distinct te.eventid if there are duplicates in the temp_message table.)

EDIT:

The last attempted rewrite is to invert the query:

select (select eventid from track_event te WHERE tm.messageid = te.inboundid) as eventid
from temp_message tm;

This should force the use of the index. If there are non-matches, you might want:

select eventid
from (select (select eventid from track_event te WHERE tm.messageid = te.inboundid) as eventid
      from temp_message tm
     ) tm
where eventid is not null;
Sign up to request clarification or add additional context in comments.

9 Comments

There's still a Seq Scan on track_event
Could be that the statistics are wrong or absent. Or that the number of expected rows is lower than 10 percent (IIRC) Or the DBMS tuning constants are wrong.
Query plans may be different depending on a size of tables. You should test it on a large table.
Tried the modified statements. It chokes because the nested select returns multiple results: ERROR: more than one row returned by a subquery used as an expression
The nested query should have used an EXISTS (...) instead.
|
1

Try this technique:

SELECT eventid FROM track_event te WHERE inboundid IN (SELECT messageid FROM temp_message where messageid = te.inboundid);   

OR you can also use the following code for better results

Select eventid From track_event te Where (Select count(*) from temp_message where messageid = te.inboundid) > 0

Comments

1

It Depends on the number of records in database (Specific Tables). If there is small database and nature of database is static and very rare record increase then join usage is better then IN and where Because after join thay behave as a table and in small tables join takes micro seconds . Because Where and IN have specific time of execution thay remain better in large database if database is large then thay get quick result if database is small then in case of using In Statment Query takes more time

For Small Database

SELECT t1.column_name,t1.column_name,t2.column_name,t2.column_name FROM tbl1 t1
INNER JOIN tbl2 t2
ON tbl1.column_name=tbl2.column_name;

For Large Database

SELECT column_name,column_name FROM tbl1 t1 WHERE tbl1.column_name IN (SELECT column_name FROM tbl2 t2 where t2.column_name = t1.column_name);   

Comments

0

Could be that NULLs are not indexed so if track_event.inboundid and maybe temp_message.messageid can be null then it can't make an access plan that does not involve a scan.

Even with an index there's no guarantee that using it is the best plan if it's not selective.

What are the stats on the track_event table/indexes?

Comments

0

The issue could be the result of (mis)tuning.

I am able to reproduce the behaviour by disabling hashjoin and sorting. With large enough queries the same behaviour could probably be invoked on larger input data sets, once work_mem is reached.

The below settings reproduce the OP's query plan here (PG9.3beta)

-- SET work_mem = 64 ;
-- SET enable_material = 0; -- only needed for pg9.3 ?
SET enable_hashjoin = 0 ;
SET enable_sort = 0 ;

effective_cache_size and random_page_cost seem to have no influence (yet).

BTW: the question says: (only 348 rows in track_event) , which implies that a seqscan will be superior to anything else. The plan expects 348 rows: there is no selectivity to be gained by using an index. (the main table is needed to get the event_id value from the main table)

So, my guess is that the problem (on the production DB) could probably be solved by setting work_mem and effective_cache_size to usable values. This will of course only work if the index is selective; for a query that retrieves 100% of the rows, an index will barely be useful.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.