Remove Duplicates Based Off of Two Columns in PostgreSQL

Question

So let's say I have a table named Class with the following fields: userid, time, and score. The table looks like this:

+--------+------------+-------+
| userid |    time    | score |
+--------+------------+-------+
|      1 | 08-20-2018 |    75 |
|      1 | 10-25-2018 |    50 |
|      1 | 02-01-2019 |    88 |
|      2 | 04-23-2019 |    98 |<remove
|      2 | 04-23-2019 |    86 |
|      3 | 06-05-2019 |    71 |<remove
|      3 | 06-05-2019 |    71 |
+--------+------------+-------+

However, I would like to remove records where the userid and the time is the same (since it doesn't make sense for someone to give another score on the same day). This would also take care of the records where the userid, time, and score are the same. So in this table, rows 4 and 6 should be removed.

The following query gives me a list of the duplicated records:

select userid, time
FROM class
GROUP BY userid, time
HAVING count(*)>1;

However, how do I remove the duplicates while still keeping the userid, time, and score column in the outcome?

There is "hidden" ctid column in each table physically identified each row. In case you have only two duplicates you can to delete them like (brief test case) create table t(x int); insert into t values(1), (1); select distinct on (x) ctid, x from t order by x; /* just for test */ delete from t where ctid in (select distinct on (x) ctid from t order by x); — Abelisto
– Abelisto, Commented Dec 9, 2019 at 19:20

sticky bit · Accepted Answer · 2019-12-09 19:06:48Z

5

You can use the row_number() window function to assign a number to each record in the order of score for each userid and time and then select only the rows where this number is equal to one.

SELECT userid,
       time,
       score
       FROM (SELECT userid,
                    time,
                    score,
                    row_number() OVER (PARTITION BY userid,
                                                    time
                                       ORDER BY score) rn
                    FROM class) x
       WHERE rn = 1;

answered Dec 9, 2019 at 19:06

sticky bit

37.7k12 gold badges34 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

albielin · Accepted Answer · 2019-12-09 19:27:12Z

First, you need some criterium to distinguish between two rows that have different scores (unless you want to randomly choose between the two). E.g., you could pick the highest score (like the SATs) or the lowest.

Assuming you want the highest score per day, you can do this:

SELECT distinct on (userid, time)
 user_id, time, score
from class
order by userid, time, score desc

Some key things: you have to have the same columns in your distinct on in the left-most positions in your order by but the magic is in the field that comes next in the order by - it’ll pick the first row among dupes of (userid, time) when ordered by score desc.

Gordon Linoff · Accepted Answer · 2019-12-09 20:01:53Z

You have a real problem with your data model. This is easy enough to fix in a select query, as the other answer suggest (I would recommend distinct on) for this.

For actually deleting the row, you can use ctid (as mentioned in a comment. The approach is:

delete from t
    where exists (select 1
                  from t t2
                  where t2.user_id = t.user_id and t2.time = t.time and
                        t2.ctid < t.ctid
                 );

That is, delete any row where there is a smaller ctid for the user_id/time combination.

Collectives™ on Stack Overflow

Remove Duplicates Based Off of Two Columns in PostgreSQL

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related