Aggregate rows to get unique arrays without subsets

Question

Initial data (actual table contains more than 2,000,000 rows):

+--------+--------+-------+
| note   | factor | label |
+--------+--------+-------+
| note_1 | 1      | 2     |
+--------+--------+-------+
| note_1 | 1      | 3     |
+--------+--------+-------+
| note_1 | 2      | 4     |
+--------+--------+-------+
| note_2 | 123    | 2     |
+--------+--------+-------+
| note_2 | 123    | 3     |
+--------+--------+-------+
| note_2 | 2      | 4     |
+--------+--------+-------+
| note_3 | 456    | 4     |
+--------+--------+-------+
| note_4 | 434    | 5     |
+--------+--------+-------+
| note_5 | 456    | 3     |
+--------+--------+-------+
| note_5 | 456    | 4     |
+--------+--------+-------+

What I want to get (further final table):

+----+-----------------+
| id | notes           |
+----+-----------------+
| 1  | {note_1,note_2} |
+----+-----------------+
| 2  | {note_4}        |
+----+-----------------+
| 3  | {note_3,note_5} |
+----+-----------------+

More clearly:

I need to group notes by factor and label columns. Note can be in the result table only once. Result table should contains two columns: id - row number, notes - array of notes.

I have written a query to group by factor and label:

select row_number() over (order by factor) as id
     , array_agg(note order by note) as notes
from test_brand
group by factor, label

It gives these results:

+---+-----------------+
| 1 | {note_1}        |
+---+-----------------+
| 2 | {note_1}        |
+---+-----------------+
| 3 | {note_2}        |
+---+-----------------+
| 4 | {note_2}        |
+---+-----------------+
| 5 | {note_1,note_2} |
+---+-----------------+
| 6 | {note_4}        |
+---+-----------------+
| 7 | {note_5}        |
+---+-----------------+
| 8 | {note_3,note_5} |
+---+-----------------+

But I do not know how to get the final table proceeding from here.

If we omit identifiers and return to ordinary numbers, then this task looks like a union of sets (which in fact it is).
Let's say we have 8 sets: {1}, {1}, {2}, {2}, {1,2}, {4}, {5}, {3,5}. We need to get three sets: {1,2}, {4}, {3,5}.

How it happened in my opinion:
Sets {1}, {1}, {2}, {2}, {1,2} merged into one set {1,2}, because there is intersection between {1} and {2} with {1,2}.
Sets {3,5}, {5} merged into one set {3,5}, because there is intersection between {5} and {3,5}.
Set {4} does not intersect with anyone, so it remains as it is.

Erwin Brandstetter · Accepted Answer · 2021-04-16 04:12:30Z

There may be more efficient ways, but this does it:

WITH cte AS (
   SELECT min(rn) AS rn, notes  -- to remove dupes cheaply
   FROM  (
      SELECT row_number() OVER (ORDER BY factor, label) AS rn  -- ORDER BY factor, label?!
           , array_agg(note ORDER BY note) AS notes
      FROM   test_brand
      GROUP  BY factor, label
      ) sub
   GROUP  BY notes
   )
SELECT row_number() OVER (ORDER BY rn) AS rn, notes
FROM   cte c
WHERE  NOT EXISTS (
   SELECT FROM cte c1
   WHERE c1.notes @> c.notes
   AND   c1.rn <> c.rn
   )
ORDER  BY 1;

db<>fiddle here

After your initial query, remove duplicates in the CTE and remember minimum row number.

In the final SELECT, eliminate all rows where the set is contained in another set (except itself). Compact row numbers with another instance of row_number().
Voilá.

Optimize performance

more than 2,000,000 rows.

If note can be an integer instead of the string type, computation will be substantially faster, even more so after installing the additional module intarray, which provides a faster implementation of @> operator for integer arrays.

If the derived table from the CTE is still big, it might pay to create a temporary table, add an index (and ANALYZE!), and run the outer SELECT based on that temp table:


CREATE TEMP TABLE tmp AS (
   SELECT min(rn) AS rn, notes  -- to remove dupes cheaply
   FROM  (
      SELECT row_number() OVER (ORDER BY factor, label) AS rn
           , array_agg(note ORDER BY note) AS notes
      FROM   test_brand
      GROUP  BY factor, label
      ) sub
   GROUP  BY notes
   );

CREATE INDEX ON tmp USING gin (notes gin__int_ops);
ANALYZE tmp;

SELECT row_number() OVER (ORDER BY rn) AS rn, notes
FROM   tmp c
WHERE  NOT EXISTS (
   SELECT FROM tmp c1
   WHERE c1.notes @> c.notes
   AND   c1.rn <> c.rn
   )
ORDER  BY 1;

See:

thanks for the answer! I'm not sure, but something went wrong. You can see run example here: fiddle I just add two new rows: note_7, 123, 2 and note_8, 656, 8. Correct result will be: {note_1,note_2,note_7} {note_4} {note_3,note_5} {note_8} But running the script gave the following result: {note_1,note_2} {note_2,note_7} {note_4} {note_3,note_5} {note_8}
@ErwinBrandstetter . . . Somehow your starting an answer with "there may be more efficient ways" is cognitively dissonant. Almost by definition, I expect your answers to be the most efficient way of doing something in Postgres.
@Moon But note_1 and note_7 do not share (factor, label)?
@Gordon: Wasn't so sure of it this time. @> is an expensive operation without index. The added bit about optimization made it better for me. Still, I feel like there may be more potential to speed it up. (And, thank you.)
@ErwinBrandstetter, yes, they are not. But note_1 share (factor, label) with note_2, and note_2 share (factor, label) with note_7. And than we need to group them all together. In other words: {1}, {1,2}, {2,7} should create set of {1,2,7}.

Collectives™ on Stack Overflow

Aggregate rows to get unique arrays without subsets

1 Answer 1

Optimize performance

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Optimize performance

8 Comments

Linked

Related