Redshift - Array returning single data per record

Question

I have a table containing the following fields:

email - logged user email

allowed_id - A ID of another User

The table contains multiple entries for the same email, each one containing a different allowed_id.

I'm trying to aggregate this in an array in order to save it on Redis to speed up one of the internal processes.

Usually, I'd use ArrayAgg, but this is not available in Redshift. Redshift has a ListAgg function that kind of works the same, but it transforms everything into a string and it has a 64k length limit, which I've already hit in my first tries. When moving this to production I'll face an even larger dataset.

It's important to know that the time of the query is not really important, it will run as a cronjob everyday round 2:00 AM.

I've been trying to use the Array function, but it returns something like:

email, [id]
same_email, [another_id]

And this is not what I'm looking for.

This is my query:


    SELECT
      email,
      ARRAY(allowed_id) AS user_ids
    FROM
      sec_table
    GROUP BY
      email, allowed_id;

Just to make it clearer, this is the type of result I'm trying to achieve:

email, [id1, id2, id3]

can you access the table(s) through something like python? (i.e. to avoid the the listagg restriction and provide arrays) — Paul Maxwell
– Paul Maxwell, Commented Aug 18, 2023 at 1:33
I have a Node server, the issue is that manipulating this data raises memory issues, because it's really large. That's why I was trying to use SQL only. — andrepz
– andrepz, Commented Aug 18, 2023 at 11:22
you are saying that you have an email that is associated with, let's do a back of envelope calc: id 9 chars + comma ~ 11. 64K / 10 ~ 6400 users . seems like your design might be wrong — Mitch Wheat
– Mitch Wheat, Commented Aug 19, 2023 at 3:10

Paul Maxwell · Accepted Answer · 2023-08-19 02:19:52Z

I believe the 64k listagg limit is just that - a hard limit.

see: how to handle Listagg size limit in redshift? (nb adjust the 10000 used below to suit your data)

WITH numbered_rows AS (
  SELECT 
    email,
    allowed_id,
    NTILE(10000) OVER (PARTITION BY email ORDER BY allowed_id) AS chunk
  FROM your_table
)
SELECT 
  email,
  chunk,
  LISTAGG(allowed_id, ',') WITHIN GROUP (ORDER BY allowed_id) AS allowed_ids
FROM numbered_rows
GROUP BY email, chunk

Following this approach you could arrive at fewer rows, and of these some would need further stitching together - (perhaps using python? not sure if this solves the memory issue).

Alternatively - and I almost never suggest this - try a procedural approach

Create a summary table with a super column e.g.:

CREATE TABLE email_summary (
    email VARCHAR(256),
    allowed_ids SUPER
);

Now use a stored procedure to populate that table e.g:

CREATE OR REPLACE PROCEDURE create_summary()
LANGUAGE plpgsql
AS $$
DECLARE
    cur_email VARCHAR(256);
    cur_allowed_id VARCHAR(256);
    cur_allowed_ids SUPER := '[]'::SUPER;  -- Initialize an empty SUPER array
    prev_email VARCHAR(256) := NULL;
BEGIN
    FOR cur_email, cur_allowed_id IN SELECT email, allowed_id FROM your_existing_table ORDER BY email
    LOOP
        IF cur_email != prev_email AND prev_email IS NOT NULL THEN
            -- Insert the previous email and its allowed_ids into the summary table
            INSERT INTO email_summary (email, allowed_ids) VALUES (prev_email, cur_allowed_ids);
            -- Reset the allowed_ids array for the next email
            cur_allowed_ids := '[]'::SUPER;
        END IF;
        -- Add the current allowed_id to the allowed_ids array
        cur_allowed_ids := cur_allowed_ids || ('"' || cur_allowed_id || '"')::SUPER;
        -- Remember the current email for the next iteration
        prev_email := cur_email;
    END LOOP;
    -- Don't forget to insert the last email and its allowed_ids into the summary table
    IF prev_email IS NOT NULL THEN
        INSERT INTO email_summary (email, allowed_ids) VALUES (prev_email, cur_allowed_ids);
    END IF;
END;
$$;

caveats try this on a small scale initially as what you see above is utterly untested and, if it works, may work slowly. Then you face the issue of getting that summary table out - that is possibly another question, and not something I'm trying to cover here.

Collectives™ on Stack Overflow

Redshift - Array returning single data per record

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related