How to get rid of duplicates with T-SQL

Question

Hi I have a login table that has some duplicated username. Yes I know I should have put a constraint on it, but it's a bit too late for that now!

So essentially what I want to do is to first identify the duplicates. I can't just delete them since I can't be too sure which account is the correct one. The accounts have the same username and both of them have roughly the same information with a few small variances.

Is there any way to efficiently script it so that I can add "_duplicate" to only one of the accounts per duplicate?

have you identified the duplicates? Do you have any query?

Marcelo Origoni
– Marcelo Origoni

2017-12-18 00:18:05 +00:00
Commented Dec 18, 2017 at 0:18 — Marcelo Origoni
– Marcelo Origoni, Commented Dec 18, 2017 at 0:18

Gottfried Lesigang · Accepted Answer · 2017-12-18 11:35:46Z

You can use ROW_NUMBER with a PARTITION BY in the OVER() clause to find the duplicates and an updateable CTE to change the values accordingly:

DECLARE @dummyTable TABLE(ID INT IDENTITY, UserName VARCHAR(100));
INSERT INTO @dummyTable VALUES('Peter'),('Tom'),('Jane'),('Victoria')
                             ,('Peter')        ,('Jane')
                             ,('Peter');
WITH UpdateableCTE AS
(
    SELECT t.UserName AS OldValue
          ,t.UserName + CASE WHEN ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY ID)=1 THEN '' ELSE '_duplicate' END AS NewValue
    FROM @dummyTable AS t
)
UPDATE UpdateableCTE SET OldValue = NewValue;

SELECT * FROM @dummyTable;

The result

ID  UserName
1   Peter
2   Tom
3   Jane
4   Victoria
5   Peter_duplicate
6   Jane_duplicate
7   Peter_duplicate

You might include ROW_NUMBER() as another column to find the duplicates ordinal. If you've got a sort clause to get the earliest (or must current) numbered with 1 it should be easy to find and correct the duplicates.

Once you've cleaned this mess, you should ensure not to get new dups. But you know this already :-D

Alex Kudryashev · Accepted Answer · 2017-12-18 04:38:20Z

There is no easy way to get rid of this nightmare. Some manual actions required.
First identify duplicates.

select * from dbo.users
where userId in 
(select userId from dbo.users
   group by username
   having count(userId) > 1)

Next identify "useless" users (for example those who registered but never place any order).
Rerun the query above. Out of this list find duplicates which are the same (by email for example) and combine them in a single record. If they did something useful previously (for example placed orders) then first assign these orders to a user which survive. Remove others.
Continue with other criteria until you you get rid of duplicates.
Then set unique constrain on username field. Also it is good idea to set unique constraint on email field.
Again, it is not easy and not automatic.

Pawan Kumar · Accepted Answer · 2017-12-18 05:04:00Z

0

In this case where you duplicates and the original names have some variance it is highly impossible to select non duplicate rows since you are not aware which is real and which is duplicate.

I think the best thing to is to correct you data and then fix from where you are getting this slight variant duplicates.

answered Dec 18, 2017 at 5:04

Pawan Kumar

2,02112 silver badges12 bronze badges

1 Comment

Gottfried Lesigang Over a year ago

If you read the question the OP's need is exactly what you've described (identify the duplicates and re-work the manually). But the question is: How can this be done?

Collectives™ on Stack Overflow

How to get rid of duplicates with T-SQL

3 Answers 3

Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Related