Removing duplicate rows from SQL Server Table

Question

I have a SQL table with some redundant data as follows. (SQL Server 2012)

ColumnA(varchar) | ColumnB(varchar)
---------------- | ---------------
name1            | name2
name3            | name4
name2            | name1
name5            | name6

I need to select distinct data/rows from this table such that it will give me result as

ColumnA(varchar) | ColumnB(varchar)
---------------- | ---------------
name3            | name4
name2            | name1
name5            | name6

or

ColumnA(varchar) | ColumnB(varchar)
---------------- | ---------------
name1            | name2
name3            | name4
name5            | name6

Basically, name1 & name2 should be consider as unique if it is present as name2 & name1 (irrespective of order of column in which they are present).

I have no idea how can I filter the rows based on the strings being equal in different columns.

Can someone help me with this?

Gordon Linoff · Accepted Answer · 2016-09-16 21:03:44Z

1

You can remove the data with logic like this:

delete from t
    where t.columnB > t.columnA and
          exists (select 1
                  from t t2
                  where t2.columnA = t.columnB and t2.columnB = t.columnA
                 );

If you don't want to actually delete the records, but simply want to return a result set without duplicates, you can use a similar query:

select t.columnA, t.columnB
from t
where t.columnA < t.columnB
union all
select t.columnA, t.columnB
from t
where t.columnA > t.columnB and
      not exists (select 1
                  from t t2
                  where t2.columnA = t.columnB and t2.columnB = t.columnA
                 );

answered Sep 16, 2016 at 21:03

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Matt Over a year ago

so the one nuance if removing all duplicates if the test data actually duplicate name1 name2 and name2 name1 so both are represented twice in the dataset these statements wont remove one set of those duplicates

Gordon Linoff Over a year ago

@Matt . . . It seems pretty clear that the OP's intention is to remove "duplicates" where that is defined as the value in the two columns being in reversed order: "Basically, name1 & name2 should be consider as unique if it is present as name2 & name1 (irrespective of order of column in which they are present)."

Mike · Accepted Answer · 2016-09-16 22:15:32Z

1

with TabX as(
 select 'name1' as ColumnA, 'name2' as ColumnB
 union all
 select 'name3' as ColumnA, 'name4' as ColumnB
 union all
 select 'name2' as ColumnA, 'name1' as ColumnB
 union all
 select 'name5' as ColumnA, 'name6' as ColumnB
)

select min(ColumnA) as ColumnA,max(ColumnB) as ColumnB
  from tabX
 group by case when ColumnA > ColumnB then ColumnA+ColumnB else ColumnB+ColumnA end

answered Sep 16, 2016 at 22:15

Mike

2,0051 gold badge10 silver badges15 bronze badges

1 Comment

Matt Over a year ago

Great Answer Mike!

Matt · Accepted Answer · 2016-09-16 23:51:07Z

;WITH cte AS (
    SELECT *
       ,ROW_NUMBER() OVER (PARTITION BY
          CASE WHEN ColumnA < ColumnB THEN ColumnA + ColumnB ELSE ColumnB + ColumnA END
          ORDER BY (SELECT 0)) as RowNumber
    FROM
       @Table
)

DELETE FROM cte
WHERE
    RowNumber > 1

If you want to select rather than delete change it to

SELECT * FROM cte WHERE RowNumber = 1

Or you can also use a method similar to that of @mike and just do straight case expressions with DISTINCT to get the unique combinations:

SELECT DISTINCT 
    CASE WHEN ColumnA < ColumnB THEN ColumnA ELSE ColumnB END as ColumnA
    ,CASE WHEN ColumnA < ColumnB THEN ColumnB ELSE ColumnA END as ColumnB
FROM
    @Table

Here is some test data:

DECLARE @Table AS TABLE (ColumnA VARCHAR(10),ColumnB VARCHAR(10))
INSERT INTO @Table VALUES
('name1','name2')
,('name3','name4')
,('name2','name1')
,('name2','name1')
,('name5','name6')
,('name1','name2')

SlimsGhost · Accepted Answer · 2016-09-16 23:59:27Z

Here's a simple way to get a totally de-duped set of rows (per your criteria for dupes):

select t.columnA, t.columnB
from (
    select t.columnA, t.columnB, 
    row_number() over (
        partition by 
            case when t.columnA >= t.columnB then t.columnA + t.columnB 
            else t.columnB + t.columnA end 
        order by t.columnA) as rseq 
        /* order of "dupes" decided above, only first one gets rseq = 1 */
    from t
) t
where t.rseq = 1

Collectives™ on Stack Overflow

Removing duplicate rows from SQL Server Table

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Related