The 'smart' delete

Deleting rows from SQL Server database table is not as simple as it sounds.

The ‘straight forward’ deletego delete, no matter what!

Yes, there is the DELETE statement:

DELETE table_1 WHERE ColumnName = 'some_value'

What is the problem with that?

There are several problems:

  • transactions and logging
  • disk space
  • time consuming
  • locking and concurrency
  • IO throughput
  • parallelism, if present

So, it depends how many rows are affected by the DELETE statement above, but generally, it is almost never a good idea to run a ‘straight forward’ deletes. SQL Server is working in a transactional manner, which means that before the rows are deleted there has to be a transaction which may have the potential of being rolled back. The information about this transaction rollback is at least as big as the size of the deleted amount of data. In other words, the rows which are to be deleted have to be logged in the transaction log, and if there is not enough disk space for the log, then we have a real problem – the log is full, no other transactions can be run… you get the idea.

The ‘straight forward’ delete is a time consuming and dangerous operation. Of course, as with any query, SQL Server engine has to generate an execution plan, figure out the indexes to be used, check the constraints etc. But what is important is, that if we start the transaction of deleting the rows, and if we decide to terminate it before it is finished (by pressing the stop button, killing the connection, shutting down the server :) etc) it will take at least as much time to stop and revert as it has elapsed so far since the start of the DELETE transaction.

Also, by running a ‘straight forward’ delete, there is a locking mechanism which involves the rows to be deleted and also may cause lock escalations, which can have a serious impact on the system and all other connections which are trying to access the data in the same table.

And finally, there is a serious impact on IO and CPU resources, especially if we are trying to delete great amount of rows (let’s suppose that great amount is above 100,000; this is not a fixed value, but merely a guideline.)

So, we get IO issues because the rows have to be read and logged, and we get CPU pressure because the CPU has to command the process.  Remember, that deleting is a logged operation regardless of the database recovery model. I.e. it does not matter if your database is in a SIMPLE, BULK-LOGGED or FULL recovery mode – when you delete rows from a table, the rollback transaction information is always written fully to the log file, and it will grow until the transaction is committed or rolled back.

The ‘batch delete’a spoonful at a time, please!

Another way to do the DELETE is to run a batch delete:

While (1=1)
Begin
DELETE TOP (10000) FROM table_1
WHERE ColumnName = 'some_value';
 
If @@rowcount = 0
BREAK
ELSE
Continue
End

What we have is an ‘endless’ loop which deletes 10,000 rows at a time and checks if there are more rows to delete, if there are not, then the loop terminates.

This method is better and more efficient for obvious reasons: the log does not grow by too much and the log space is reused; it is easy to terminate the delete quickly and only the transaction of the last batch will be reverted; the lock escalations do not have a huge impact on the entire system.

The ‘swap me’ methodnifty hack or last resort?

Yet another way to delete data is to ‘swap’ it. It does not always work for a live production systems, especially if there is no way to get a maintenance window, but it is good to know about this method. (Sometimes it may turn out to be the only good method.) For example, if we have a table with 1 billion rows, and we need to delete one third of them… What do we do? A ‘straight forward’ delete is out of the question, a batch delete will probably take ages.

In this case, if we get a maintenance window we could perform the following steps:

  1. make sure you BACKUP your database before this
  2. make sure noone alters the data during the following steps (find a way to use singe mode or by other means)
  3. script the prod table and create an identical table with a slightly different name, let’s say prod_recordsToKeep
  4. script the prod table and create an identical table with a slightly different name, let’s say prod_recordsToDelete
  5. run the import-export wizard or an SSIS package with bulk load and load the data from the prod table to the newly created tables
    1. prod_recordsToKeep table will have all records which should remain in the prod table
    2. prod_recordsToDelete will have all records which should be deleted
  6. after making sure that the data was imported in both tables, truncate the prod table. Yes, you got it right – TRUNCATE. This is very fast and is not logged.
  7. use the import-export wizard to import all data from prod_recordsToKeep table to the prod table
  8. truncate the prod_recordsToKeep, then keep the prod_recordsToDeleted table around for some time as a reference, and then drop it

You get the idea. The last method is not perfect, but sometimes it’s a ‘good enough’ option to clean up data.

 

 

Comments are closed.