0

Environment:

  • OS: Windows Server 2012 DataCenter
  • DBMS: SQL Server 2012
  • Hardware (VPS): Xeon E5530 4 cores + 4GB RAM

Question:

I have a large table with 140 million rows. Some rows are supposed to be duplicate so I want to remove such rows. For example:

id   name   value   timestamp
---------------------------------------
001  dummy1 10      2015-7-27 10:00:00
002  dummy1 10      2015-7-27 10:00:00    <-- duplicate
003  dummy1 20      2015-7-27 10:00:00

The second row is deemed duplicate because it has identical name, value and timestamp regardless of different id with the first row.

Note: the first two rows are duplicate NOT because of all identical columns, but due to self-defined rules.

I tried to remove such duplication by using window function:

select 
    id, name, value, timestamp
from
   (select 
        id, name, value, timestamp,
        DATEDIFF(SECOND, lag(timestamp, 1) over (partition by name order by timestamp),
        timestamp) [TimeDiff]
    from table) tab

But after an hour of execution, the lock is used up and error was raised:

Msg 1204, Level 19, State 4, Line 2
The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

How could I remove such duplicate rows in an efficient way?

4
  • 1
    You want to delete duplicate or select all non-duplicate? Commented Jul 27, 2015 at 16:55
  • @ForguesR Sorry about the ambiguity. I want to select all non-duplicate for other queries and leave the original table intact. Commented Jul 27, 2015 at 17:08
  • 1
    Do you have a covering index with leading columns name, timestamp? What isolation level is the query running at? Commented Jul 27, 2015 at 17:25
  • 1
    This is going to be slow unless you have an index on the columns you are using according to your self-defined rules. Also if this is a run-once query then it might be better to put all the non-dup inside a new temporary table instead of working which such a huge dataset result. Commented Jul 27, 2015 at 17:56

3 Answers 3

4

What about using a cte? Something like this.

with DeDupe as
(
    select id
        , [name]
        , [value]
        , [timestamp]
        , ROW_NUMBER() over (partition by [name], [value], [timestamp] order by id) as RowNum
    from SomeTable
)

Delete DeDupe
where RowNum > 1;
Sign up to request clarification or add additional context in comments.

Comments

1

If only thing is selection of non-duplicate rows from table, consider using this script

SELECT MIN(id), name, value, timestamp FROM table GROUP BY name, value, timestamp

If you need to delete duplicate rows:

DELETE FROM table  WHERE id NOT IN ( SELECT MIN(id) FROM table GROUP BY name, value, timestamp)

or

DELETE t FROM table t INNER JOIN 
table t2  ON
t.name=t2.name AND 
t.value=t2.value AND 
t.timestamp=t2.timestamp AND 
t2.id<t.id

Comments

1

Try something like this - determine the lowest ID for each set of values, then delete rows that have an ID other than the lowest one.

Select Name, Value, TimeStamp, min(ID) as LowestID
into #temp1
From MyTable
group by Name, Value, TimeStamp

Delete MyTable 
from MyTable a
inner join #temp1 b
on a.Name = b.Name 
  and a.Value = b.Value 
  and a.Timestamp = b.timestamp 
  and a.ID <> b.LowestID

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.