Friday, January 16, 2015

[T-SQL] A more efficient DELETE using TOP, CTEs, OUTPUT, and somebody else's clever idea.

I would totally give credit to whomever did this, but I can't figure out who it is (otherwise I would have gone to their site, stolen their code, run it in my environment, and then gone off on my merry way... and my spouse wouldn't yell at me for coming home late because I blogged it)

ALSO: There's almost assuredly a better way to do this, but it's already 5:30 so I rushed it.


When you delete from a large table, you delete in batches.  No blowing out logs, less contention, etc, etc.

DELETE TOP (5000) FROM Table_X

But , it gets slower as it progresses, because each time it has to scan until it finds them.   And so your IO goes up each time it runs.
One potential way around that is to use Simon Sabin's trick...
(http://sqlblogcasts.com/blogs/simons/archive/2009/05/22/DELETE-TOP-x-rows-avoiding-a-table-scan.aspx)

delete t1
from (select top (10000) *
        from t1 order by a) t1

Which uses a derived table to force it to use an indexed order.  But the problem I ran into is that it still gets slower as it progresses.  Both because there are other rows in the table that I'm not deleting, and because it has a LOT of rows to delete. 
 
Someone (please tell me who and I'll happily change it!) had a really clever idea - figure out each time what the most recent record deleted was, and use that to modify the WHERE clause to only get more recent.  

E.G. your table is indexed (clustered, whatever) on the inserted_datetime.  So you DELETE TOP 5000 ... ORDER BY (which is why I have the CTE; could probably do the derived trick above), and the most recent one deleted was 1/1/2014 12:23:34.567.  You know that because you saved the deleted values to a table (via the OUTPUT), then grabbed the most recent to a variable, which is in the WHERE clause so the next delete starts looking at that time (1/1/2014 12:23:34.567)

Does it work? Like a champ.  My "stupid" delete was taking 30 seconds when I stopped it an hour in.  
This one is still doing batches every 2-5 seconds.  Of course, for this post I had to stop it... and so when it started back up it needed 8 minutes to figure out where it was (and then each subsequent batch took 2-5 seconds. 

Enjoy, and thanks again to whomever had the great idea!!

--mdb 2016/08/18 v1.10. Added a few things to make it faster and not delete the entire table. And some docs!
--No idea where the idea to walk the table comes from (Bertrand? Strate?); the premise was that if you knew
-- what you were currently deleting, and it was ordered, you could tell the optimizer to skip the part you've already deleted.
-- The overall delete for the top 5k from a CTE is from Simon Sabin & the SQLCat team.
--https://web.archive.org/web/20090630064721/http://blogs.msdn.com/sqlcat/archive/2009/05/21/fast-ordered-delete.aspx?
--http://sqlblogcasts.com/blogs/simons/archive/2009/05/22/DELETE-TOP-x-rows-avoiding-a-table-scan.aspx
IF OBJECT_ID('tempdb..#mydeleted') IS NOT NULL
DROP TABLE #mydeleted
CREATE TABLE #mydeleted
(insert_datetime DATETIME)
CREATE CLUSTERED INDEX cidx_mydeleted_insert_datetime ON #mydeleted (insert_datetime)
DECLARE @msg NVARCHAR(50),
@insert_datetime DATETIME,
@delete_before DATETIME
SELECT @delete_before = DATEADD(dd, DATEDIFF(dd,0,getdate()-14), 0)
SELECT @insert_datetime = ISNULL(MAX(INSERT_datetime), '1/1/2000') --ISNULL for the first run
FROM #mydeleted
SELECT 'Starting delete of data from ' +CONVERT(VARCHAR(20),@insert_datetime,120) + ' up to ' + CONVERT(VARCHAR(20),@delete_before,120) --DO NOT CHANGE TO PRINT
--sets @@ROWCOUNT
WHILE (
@@ROWCOUNT <> 0
AND GETDATE() < '20160818 23:50:00' --so it stops before backups run, preventing a TLOG blowout
AND @insert_datetime <= @delete_before
)
BEGIN
--get that most recent deleted!
SELECT @insert_datetime = ISNULL(MAX(INSERT_datetime), '1/1/2000') --ISNULL for the first run
FROM #mydeleted
TRUNCATE TABLE #mydeleted
--informative so you know where you are in the process.
PRINT @insert_datetime
--this SHOULD force SSMS to tell you almost instantly when it's doing another batch.
--http://thebakingdba.blogspot.com/2011/11/use-raiserror-to-return-results-in-ssms.html
SET @msg = ( SELECT CONVERT( VARCHAR (19), GETDATE (), 120) )
RAISERROR (@msg, 0, 1) WITH NOWAIT
--1 second between batches to prevent too much contention
WAITFOR DELAY '00:00:01';
WITH cte
AS (
SELECT TOP (5000) *
--my EPR saves this daily so I can do trending
FROM dm_db_index_usage_stats_hist
--databases that I can't change and that are REALLY chatty
WHERE database_name IN ('TableA', 'TableB')
AND Insert_Datetime >= @insert_datetime
AND insert_datetime < @delete_before
ORDER BY insert_datetime
)
DELETE FROM cte
OUTPUT deleted.Insert_Datetime
INTO #mydeleted
END

No comments: