Removing duplicates in Postgres

Question

Using Postgres 9.5, I have a table properties:

CREATE TABLE properties (
    id serial PRIMARY KEY,
    property_id integer,
    state character(2),
    record_type character(1),
    ...
);

id is my unique internal id.
property_id is from a 3rd party. Properties from different states may share the same property_id but there is only one property_id per state. Reason being, the properties table contains all states together instead of one state per table and the property_id counter starts from 1 for each state.
state is the US state abbreviation (e.g. MA, CA, NY). When concatenated with property_id it references one property, e.g. 12345NY.
record_type can be A (add), C (change), or D (delete).

When new properties are added to the table their record_type is A. Over time a properties' details change and there are new rows added to the table with C as their record_type.

Example:

id,   property_id, state, record_type, ...
7353, 6001,        'MA',  'A',         ...
7354, 6001,        'MA',  'C',         ...
7355, 6001,        'MA',  'C',         ...

Here's the problem: I want to only keep the most recent row for the property (doesn't matter what record_type) and delete all the older ones. So in the example, just keep the last row. There's no date column but we can assume the higher the id, the newer the record. As a side note, all the rows with D record types have been previously removed so we're only dealing with add and change record types.

It's not clear if the record_type (and state) is relevant for the question. Do you just want to keep one record per property_id, the rest of the field being irrelevant? — leonbloy
– leonbloy, Commented Mar 11, 2016 at 22:18
@leonbloy The state is relevant because when combined with the property_id, references one unique property. I guess the record_type could be considered extraneous information if we're only concerned with keeping the latest record irrespective of record_type. — Tyler
– Tyler, Commented Mar 11, 2016 at 22:48

Mihai · Accepted Answer · 2016-03-11 23:09:02Z

2

WITH CTE AS
  (SELECT *,ROW_NUMBER() OVER(PARTITION by property_id,state
                              ORDER BY id DESC) AS rn
   FROM properties)
DELETE
FROM properties WHERE id IN (SELECT id FROM CTE WHERE rn >1)

edited Mar 11, 2016 at 23:09

answered Mar 11, 2016 at 22:53

Mihai

26.8k8 gold badges71 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

leonbloy Over a year ago

This is much more elegant (and probably efficient) than my answer.

Jean Jung Over a year ago

What a beautiful solution

Tyler Over a year ago

@Mihai I'm getting ERROR: relation "cte" does not exist. Any ideas?

leonbloy · Accepted Answer · 2016-03-11 22:50:19Z

1

If you just want to keep one record per property_id state pair, irrespective of the other fields, this should be enough

DELETE FROM properties p1 
WHERE p1.id != 
(SELECT max(p2.id) FROM properties p2 WHERE 
 p2.property_id = p1.property_id AND p2.state = p1.state);

edited Mar 11, 2016 at 22:50

answered Mar 11, 2016 at 22:25

leonbloy

76.4k22 gold badges149 silver badges197 bronze badges

3 Comments

Jean Jung Over a year ago

Sorry I am on smartphone and just touched the wrong place, already undone it (:

Tyler Over a year ago

@leonbloy I'm not the downvoter, but I think there is an issue with that query where it can unintentionally delete properties from other states. The true property id is really the property_id and state combined. There may be multiple different properties that share the same property_id.

leonbloy Over a year ago

@Tyler It seems you are right. I think it's fixed now.

Collectives™ on Stack Overflow

Removing duplicates in Postgres

2 Answers 2

3 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Related