SQL Server / Remove duplicates

Question

I have a table with 3 million records. The table looks like that :

id          phones
----------- -----------------
0           1234; 5897;
1           0121; 7875; 5455;
2           0121; 5455; 7875;
3           999;
4           0121;
5           5455; 0121;

Records with id 1, 2, 4 & 5 are duplicates. I would like to keep the only record with the highest id and the longest phone string.

So in my example, after running the query, my table should be :

id          phones
----------- -----------------
0           1234; 5897;
2           0121; 5455; 7875;
3           999;

How would I accomplish this?

NB : there is no blank space in phone strings.

Though actually the task of finding exact duplicates wouldn't be much easier if properly designed in first normal form. — Martin Smith
– Martin Smith, Commented Dec 22, 2013 at 16:01
@MitchWheat - You'd need some hairy exact relational division query and it would probably just be quicker (better performing across all possible data distributions) and easier to use XML PATH on the normalised data and compare the concatenated strings. Similar question about finding recipes with identical ingredients — Martin Smith
– Martin Smith, Commented Dec 22, 2013 at 16:26

Fredou · Accepted Answer · 2013-12-23 13:52:32Z

this should be a good starting point;

you will need to create a 2 temp tables.

one that will hold the id(column Id), one that will hold the phones(column Id, Phone). so this will be a one to many.

then what you need to do is to insert in these 2 tables the whole original table

when this is done, start sorting/comparing result to reconstruct the filtered result.

so here is a demo; (this code is not optimized but work)

declare @AllPhones table (id int, phones varchar(max))

insert into @AllPhones select 0, '1234; 5897;'
insert into @AllPhones select 1, '0121; 7875; 5455;'
insert into @AllPhones select 2, '0121; 5455; 7875;'
insert into @AllPhones select 3, '999;'
insert into @AllPhones select 4, '0121;'
insert into @AllPhones select 5, '5455; 0121;'
insert into @AllPhones select 6, '222;'
insert into @AllPhones select 7, '888;'
insert into @AllPhones select 8, '222; 888;'
insert into @AllPhones select 9, '888; 222;'


select * from @AllPhones

declare @IdPhone table (id int, done bit)
declare @Phone table (id int, phone varchar(max), insertOrder int)

insert into @IdPhone
select id, 0
from   @AllPhones

declare @Id int
declare @ConcatPhone varchar(max)

declare @idx int       
declare @slice varchar(max)
declare @insertOrder int

while exists(select * from @IdPhone where done=0)
begin
    select top 1 @Id = ap.id
               , @ConcatPhone = ap.phones
    from @IdPhone ip inner join @AllPhones ap on ip.id = ap.id
    where done=0 

    select @idx = 1
    select @insertOrder = 1       
    if len(@ConcatPhone)> 0 and @ConcatPhone is not null
    begin
        while @idx!= 0       
        begin       
            set @idx = charindex(';',@ConcatPhone)       
            if @idx!=0       
                set @slice = left(@ConcatPhone,@idx - 1)       
            else       
                set @slice = @ConcatPhone       

            if(len(@slice)>0)
                insert into @Phone(Id, phone,insertOrder) values(@Id, rtrim(ltrim(@slice)),@insertOrder)       

            set @ConcatPhone = right(@ConcatPhone,len(@ConcatPhone) - @idx)       
            if len(@ConcatPhone) = 0 break       

            select @insertOrder = @insertOrder+1 
        end   
    end

    update @IdPhone 
    set done=1 
    where Id = @Id
end

declare @UniquePhone table (id int, c int, phone varchar(max),insertOrder int, done int)

insert into @UniquePhone
    select p.id
         , (select top 1 count(pCount.id) from @phone pCount where pCount.id=p.id) as t
         , p.phone
         ,p.insertOrder
         ,0
    from @phone p
    group by p.id, p.phone, p.insertOrder

while exists(select * from @UniquePhone where done=0)
begin
    select top 1 @Id = up.id
    from @UniquePhone up 
    where done=0 
    order by c desc
           , id desc

    delete from @UniquePhone 
    where id <> @id and phone in (select phone from @UniquePhone pp where pp.id=@id)

    print @id

    update @UniquePhone 
    set done=1 
    where Id = @Id
end

select FinalTable.id,
       ltrim(rtrim(FinalTable.Phones)) As Phones
from(select distinct up2.id, 
           (select up1.phone + '; ' as [text()]
            from @UniquePhone up1
            where up1.id = up2.id
            order by up1.id, insertOrder
            for XML PATH ('')) Phones
     from @UniquePhone up2) [FinalTable]

You're leaving finding exact duplicate groups in the phones(column Id, Phone) as an exercise for the reader?
Well you haven't answered how at all. You've advised him to restructure the table then just been hand wavy on the tricky bit!
@Fredou : Thanks for this code. This works fine with the values in our example. But i can't get it work with values taken from my test table. Here is a screenshot.
@Fredou, it works fine ! Excepted that the output phones are in the wrong order. Is it possible to replace the old table by the new one ? Many thanks !

Collectives™ on Stack Overflow

SQL Server / Remove duplicates

1 Answer 1

7 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Related