postgresql two nested joins and arrays in join

Question

I've never had to ask an SO question before, I always eventually stumble upon what I'm looking for through searching, but there's a lot going on here and I've gotten to the end of my searching prowess... Thanks for your help / suggestions.

The problem

I need to generate reports on the fly about how many 'students' in my organization have taken each 'assessment', and the standards associated with each 'assessment'.

Student totals are calculated ahead of time and stored in assessments_report
- The report row needs to be joined with the "assessment" to display info
Standards have sub-standards, but assessments only know about the sub-standard.
Assessments have multiple sub-standards and store them in an array of ids.

I'm trying to do this in the most performant way that minimizes the load on Postgres but I don't believe I've achieved that goal...

The Setup

Server: Amazon EC2 running Ubuntu

The App (Server): Node.js (0.10.26)

The App (Client): Angular.js (1.2.13)

Database (Queries): PostgreSQL (9.3.1)

Database (Cache): MongoDB (Hope to cache report in future)

The Tables

SQL FIDDLE

assessments

_id                int4
description        text
category           varchar(60)
sub_standards      int4 ARRAY
...

assessments_report (pre-calculated sums)

assessment_id(FK)  int4
client_id(FK)      int4
students           int4
completed          int4
incomplete         int4
...

sub_standards

_id
standard_id(FK)    int4
description        varchar(255)
...

standards

_id                int4
name               varchar(60)
description        varchar(255)
...

The Query

//Stored as array for readability in Node-Postgres use
SELECT r.*, as.* FROM assessments_report r
INNER JOIN (
  SELECT a._id AS assessment_id, a.description, a.category, a.states,
  array_to_json(array_agg(ss.*)) AS standards
  FROM assessments a LEFT JOIN (
    SELECT ss.*, s.name AS parent_name, s.description AS parent_description
    FROM sub_standards ss
    INNER JOIN standards s ON ss.standard_id = s._id
  ) ss ON ss._id = ANY (a.sub_standards) GROUP BY a._id
) as
ON as.assessment_id = r.assessment_id
WHERE r.client_id = $1

The Desired Output For Each Row (shown as JSON)

assessment_id: 2,
students: 2,
complete: 1
incomplete: 1,
description: "...",
category: "...",
states: ["AL","AZ",...],
standards: [
  {
    _id: 1,
    standard_id: 3,
    description: "...",
    parent_name: "...",
    parent_description: "..."
  },
  {
    _id: 2,
    standard_id: 4,
    description: "...",
    parent_name: "...",
    parent_description: "..."
  },
]

SQL FIDDLE

To usefully help optimize the query, usable schema + sample data (ideally a sqlfiddle.com) would be strongly preferable. You should also show explain analyze output for your current query. — Craig Ringer
– Craig Ringer, Commented Apr 16, 2014 at 23:48
@CraigRinger Thanks for the call out! I've created the fiddle and added it to the question. In it I got a little more specific about my setup and future improvements/changes, let me know if this helps. — joel.software
– joel.software, Commented Apr 17, 2014 at 1:52

Kirk Roybal · Accepted Answer · 2014-04-24 00:34:54Z

EXPLAIN of your query produces a max cost of 537. You can take that down to 59 (factor 10 improvement) by just creating an index:

CREATE INDEX idx_sstandard_sid ON sub_standards (standard_id);

But this is even faster with a cost of 21:

WITH r AS (SELECT * FROM assessments_report WHERE client_id = $1)
, ss AS (SELECT r.*, a.description, a.category, a.states , a.sub_standards
    FROM r INNER JOIN assessments a ON a._id = r.assessment_id)
, ss3 AS (SELECT ss2.*, s.name parent_name, s.description parent_description
            FROM sub_standards ss2
            INNER JOIN standards s
            ON ss2.standard_id = s._id)
, ss1 AS (SELECT ss.*, array_to_json(array_agg(ss3.*)) standards
    FROM ss
    LEFT JOIN ss3
    ON ss3._id = ANY (ss.sub_standards)
    GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14 ) 
SELECT _id, client_id, region_id, group_id, assessment_id, students, basic, 
    proficient, advanced, in_progress, description, category, states, standards
FROM ss1;

You could probably get rid of the grouping and go somewhat faster than this.

The basic idea is to start with the most limiting constraint (client_id), expand that by the next criteria (sub_standards) and never expand the recordset any more than necessary after that. Lookups against primary keys are faster than complex joins. Hope this helps.

Final query plan:

CTE Scan on ss1  (cost=21.79..21.81 rows=1 width=434)
CTE r
->  Seq Scan on assessments_report  (cost=0.00..1.06 rows=1 width=40)
      Filter: (client_id = 4)
CTE ss
->  Nested Loop  (cost=0.14..8.19 rows=1 width=434)
      ->  CTE Scan on r  (cost=0.00..0.02 rows=1 width=40)
      ->  Index Scan using assesments_pkey on assessments a  (cost=0.14..8.16 rows=1 width=398)
            Index Cond: (_id = r.assessment_id)
CTE ss3
->  Hash Join  (cost=1.07..12.34 rows=3 width=1346)
      Hash Cond: (s._id = ss2.standard_id)
      ->  Seq Scan on standards s  (cost=0.00..10.90 rows=90 width=818)
      ->  Hash  (cost=1.03..1.03 rows=3 width=532)
            ->  Seq Scan on sub_standards ss2  (cost=0.00..1.03 rows=3 width=532)
CTE ss1
->  HashAggregate  (cost=0.19..0.20 rows=1 width=462)
      ->  Nested Loop Left Join  (cost=0.00..0.15 rows=1 width=462)
            Join Filter: (ss3._id = ANY (ss.sub_standards))
            ->  CTE Scan on ss  (cost=0.00..0.02 rows=1 width=434)
            ->  CTE Scan on ss3  (cost=0.00..0.06 rows=3 width=32)

Thanks for the helpful answer! The query has evolved a bit since I originally asked this question but I will evaluate your solution as soon as I can and accept the answer once I can confirm my intended output. Also my mistake in not creating the index on my SQL Fiddle, I do have that index in my local DB.

Collectives™ on Stack Overflow

postgresql two nested joins and arrays in join

The problem

The Setup

The Tables

The Query

The Desired Output For Each Row (shown as JSON)

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

The problem

The Setup

The Tables

The Query

The Desired Output For Each Row (shown as JSON)

1 Answer 1

1 Comment

Related