1

So let's say I have a table named Class with the following fields: userid, time, and score. The table looks like this:

+--------+------------+-------+
| userid |    time    | score |
+--------+------------+-------+
|      1 | 08-20-2018 |    75 |
|      1 | 10-25-2018 |    50 |
|      1 | 02-01-2019 |    88 |
|      2 | 04-23-2019 |    98 |<remove
|      2 | 04-23-2019 |    86 |
|      3 | 06-05-2019 |    71 |<remove
|      3 | 06-05-2019 |    71 |
+--------+------------+-------+

However, I would like to remove records where the userid and the time is the same (since it doesn't make sense for someone to give another score on the same day). This would also take care of the records where the userid, time, and score are the same. So in this table, rows 4 and 6 should be removed.

The following query gives me a list of the duplicated records:

select userid, time
FROM class
GROUP BY userid, time
HAVING count(*)>1;

However, how do I remove the duplicates while still keeping the userid, time, and score column in the outcome?

2
  • There is "hidden" ctid column in each table physically identified each row. In case you have only two duplicates you can to delete them like (brief test case) create table t(x int); insert into t values(1), (1); select distinct on (x) ctid, x from t order by x; /* just for test */ delete from t where ctid in (select distinct on (x) ctid from t order by x); Commented Dec 9, 2019 at 19:20
  • Kindly please accept the best answer Commented Apr 2, 2022 at 10:26

3 Answers 3

5

You can use the row_number() window function to assign a number to each record in the order of score for each userid and time and then select only the rows where this number is equal to one.

SELECT userid,
       time,
       score
       FROM (SELECT userid,
                    time,
                    score,
                    row_number() OVER (PARTITION BY userid,
                                                    time
                                       ORDER BY score) rn
                    FROM class) x
       WHERE rn = 1;
Sign up to request clarification or add additional context in comments.

Comments

2

First, you need some criterium to distinguish between two rows that have different scores (unless you want to randomly choose between the two). E.g., you could pick the highest score (like the SATs) or the lowest.

Assuming you want the highest score per day, you can do this:

SELECT distinct on (userid, time)
 user_id, time, score
from class
order by userid, time, score desc

Some key things: you have to have the same columns in your distinct on in the left-most positions in your order by but the magic is in the field that comes next in the order by - it’ll pick the first row among dupes of (userid, time) when ordered by score desc.

Comments

2

You have a real problem with your data model. This is easy enough to fix in a select query, as the other answer suggest (I would recommend distinct on) for this.

For actually deleting the row, you can use ctid (as mentioned in a comment. The approach is:

delete from t
    where exists (select 1
                  from t t2
                  where t2.user_id = t.user_id and t2.time = t.time and
                        t2.ctid < t.ctid
                 );

That is, delete any row where there is a smaller ctid for the user_id/time combination.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.