0

I've got a table that has duplicate data that needs to be cleaned up. Consider the following example:

CREATE TABLE #StackOverFlow
(
    [ctrc_num] int, 
    [Ctrc_name] varchar(6),
    [docu] bit, 
    [adj] bit, 
    new bit, 
    [some_date] datetime
);
    
INSERT INTO #StackOverFlow
    ([ctrc_num], [Ctrc_name], [docu], [adj], [new], [some_date])
VALUES
    (12345, 'John R', null, null, 1, '2023-12-11 09:05:13.003'),
    (12345, 'John R', 1, null, 0, '2023-12-11 09:05:12.987'),
    (12345, 'John R', null, null, 1, '2023-12-11 09:05:12.947'),
    (56789, 'Sam S', null, null, 1, '2023-12-11 09:05:13.003'),
    (56789, 'Sam S', null, null, 1, '2023-12-11 09:05:12.987'),
    (56789, 'Sam S', 1, null, 0, '2023-12-11 09:05:12.947'),
    (78945, 'Pat P', null, null, 1, '2023-12-11 09:05:13.003'),
    (78945, 'Pat P', null, null, 1, '2023-12-11 09:05:12.987'),
    (78945, 'Pat P', null, null, 1, '2023-12-11 09:05:12.947');

This gives me:

[ctrc_num]  [Ctrc_name] [docu]  [adj]   [new]   [some_date]
-----------------------------------------------------------------------
12345        John R     NULL    NULL    1       2023-12-11 09:05:13.003
12345        John R     1       NULL    0       2023-12-11 09:05:12.987
12345        John R     NULL    NULL    1       2023-12-11 09:05:12.947
56789        Sam S      NULL    NULL    1       2023-12-11 09:05:13.003
56789        Sam S      NULL    NULL    1       2023-12-11 09:05:12.987
56789        Sam S      1       NULL    0       2023-12-11 09:05:12.947
78945        Pat P      NULL    NULL    1       2023-12-11 09:05:13.003
78945        Pat P      NULL    NULL    1       2023-12-11 09:05:12.987
78945        Pat P      NULL    NULL    1       2023-12-11 09:05:12.947

What I need to do is delete from the table duplicates. If new is 0, delete the records where new is 1. If all records have new = 1 keep the newest record and delete the older ones.

The result should look like this:

[ctrc_num]  [Ctrc_name] [docu]  [adj]  [new]    [some_date]
-----------------------------------------------------------------------
12345        John R     1       NULL    0       2023-12-11 09:05:12.987
56789        Sam S      1       NULL    0       2023-12-11 09:05:12.947
78945        Pat P      NULL    NULL    1       2023-12-11 09:05:13.003

I've tried ROW_NUMBER:

;WITH RankedByDate AS
(
    SELECT 
        ctrc_num, Ctrc_name,
        docu, adj, new, some_date,
        ROW_NUMBER() OVER (PARTITION BY Ctrc_num, Ctrc_name, [docu],[adj], [new] 
                           ORDER BY some_date DESC) AS rNum
    FROM 
        #StackOverFlow
)
SELECT * 
FROM RankedByDate

This separates the ones with new = 0, but I still have the ones with new = 1 that are ordered.

Grouping gives me the records that are duplicated but no way to delete the ones needed to be deleted:

SELECT [ctrc_num]
    ,[Ctrc_name]
    ,[docu]
    ,[adj]
    ,[new]
FROM 
    #StackOverFlow
GROUP BY 
    [ctrc_num]
    ,[Ctrc_name]
    ,[docu]
    ,[adj]
    ,[new]
HAVING 
    COUNT(*) > 1
7
  • What constitutes a duplicate? Same [ctrc_num] and [Ctrc_name]? Commented Dec 15, 2023 at 16:09
  • 1
    Post the query you've tried, even if not working. Commented Dec 15, 2023 at 16:10
  • There are no duplicate rows since no two rows are equal. Therefore you must specify what you mean by duplicate. Also, which value of the rows not building the duplicate do you want to keep? Commented Dec 15, 2023 at 16:18
  • Unless there can be more than one new = 0, your logic can be summarized as remove all rows partitioned by ctrc_num order by new, some_date desc where row_number > 1. It shouldn't be very hard to come up with sql corresponding to the above. Commented Dec 15, 2023 at 16:21
  • Duplicates are the same [ctrc_num and [Ctrc_name] Commented Dec 15, 2023 at 16:21

2 Answers 2

2

Break the problem down into it's parts

  1. "If new is 0, delete the records where new is 1"

    delete from #StackOverFlow
    where [new] = 1
    and [ctrc_num] in (select [ctrc_num]
                       from #StackOverFlow
                       where [new] = 0);
    
  2. "If all records have new = 1 keep the newest record and delete the older ones" Use a CTE to add a row number based on the date and partitioned by the [ctrc_num] such that the "first" record in each group is the one you want to keep - if there is only 1 row in a group that's the one you want to keep anyway. Then delete everything else

    ;with cte as
    (
        select 
             [ctrc_num]  
             ,ROW_NUMBER() OVER (PARTITION BY [ctrc_num] ORDER BY [ctrc_num], [some_date] DESC) as rw
        from #StackOverFlow
    )
    DELETE FROM cte where rw <> 1;
    
Sign up to request clarification or add additional context in comments.

6 Comments

This is exactly what I was looking for. I was hoping I would be able to eliminate the duplicate without having to break it into more than one part, but this works.
you can write this as subquery too, no need for CTE.
@TN - Why? In step 1 I deleted any records where new = 1 if there was a subsequent new = 0. So either there is only a single record per [ctrc_num] and new = 1 OR there is/are 1+ records for a [ctrc_num] where new = 0. Sorting by new only becomes relevant if trying to do both steps at once.
@kool_kris - as you will see from siggemannen's solution, it is possible to do what you want in a single query. But when you are trying to figure out how to do something it is good practice to break it down first. See also "SQL Antipatterns" by Bill Karwin - Chapter 18 "Spaghetti Query" - "Solve a Complex Problem in One Step". You can always merge the "bits" together afterwards - once you have something working. Personally, I'd rather have three simple queries I can follow than one complex one that has me puzzled :-)
@TN - Phew. I did scratch my head for a while though - at least you made me think :-)
|
2

It is possible to do what you want is a single query.

;with cte as(
    select [ctrc_num], [Ctrc_name], [docu],[adj], [new], [some_date]
    ,ROW_NUMBER() over(partition by [ctrc_num] -- group by [ctrc_num]
        order by [new], --0 then 1
        [some_date] desc --newest first
        ) rn
    from #StackOverFlow
)
delete cte
where rn>1
;

select * from #StackOverFlow

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.