SQL Server Corruption Checklist

October 25, 2022

SQL Server Corruption Checklist

You may already know how corruption happens and how to look for it. But what do you do when you actually find corruption?

Well, firstly, you should not start with trying to repair the corruption because it may be that the storage system itself is corrupting the data. Any attempts to repair the database may cause further corruption.

Your primary plan, at this point, should be to communicate to your manager or team lead and any other key stakeholders that there is corruption, and the plan is to get off that storage as soon as possible. If you have a secondary, such as a warm standby which restored the transaction logs from the primary through log shipping, you can point your applications to query that database instead.

The Corrution Checklist

It is important to have a checklist of tasks to perform when corruption is found:

Turn off backup and delete jobs.
Turn off the backup jobs because you don’t want to keep on backing up corrupted data which can spread the corruption even more. Turn off the delete jobs because you may need to restore the full backup before the corruption occurred but you don’t know when that it is.This is why we keep backups longer than the CHECKDB intervals.

For example, if you do checkdb once a week, you should have a clean backup that is older than 7 days along with all the transaction logs which would have been created since the full backup.

Alert the key stakeholders within the IT Department, such as your manager and the systems administrator, that there has been corruption and you are about to start your recovery procedures.

If you have a warm standby, you can failover to that server so that end-users are not affected.
Then you can alert the users to let them know that you have failed over to the standby server and some of the most recent data may not be there. If you don’t have a standby server, you still need to communicate to the end-users that the database is unavailable for the period of time agreed upon in the Recovery Time Objective

Query the msdb.dbo.suspect_pages table
This will give you an idea of how many databases got corrupted and when it started occurring. If corruption first happened 2 days ago, then we will have to find all of the full backups that were taken 3 days ago and all of the transaction logs since then.
```
select * from msdb.dbo.suspect_pages;
```

Open up the Job History Monitor and look at the CHECKDB and backup jobs
This assumes that you have a CHECKDB job set up.
In SQL Server Management Studio:
Expand Management => expand Maintenance Plans => right-click the CHECKDB plan for the database => View History

Here, you can find out:
- When was the most recent successful CHECKDB job run against the database without errors?
- Do we have a full backup available on or before the date the CHECKDB was run?
- Do we have all of the transaction log backups since that date?
If you can answer Yes for all of these questions, then you can begin the recovery process. Finding the last good backup may require going through old tape backups.

Restore to another server
Remember, you don’t want to reuse the same server where corruption has occurred. If you don’t have a warm standby, you will have to recover a full backup and then all of the transaction logs since then. This is where automating the restores can save so much time.

Further investigations
We went through the CHECKDB results to find out how many databases were affected and when they started happening. A more thorough investigation should be done to find the root of the problem meaning what caused the corruption.

…If the checklist fails

If at any point the checklist fails or you cannot complete a restore of a clean backup, you have 2 options:

Contact Microsoft
Some companies may already have a support agreement in place with Microsoft. If not, you can open a support ticket.

Repair the corruption
You can use DBCC CHECKDB with the REPAIR_REBUILD or REPAIR_ALLOW_DATA_LOSS which MAY work. However, you would still be fixing the issue on storage which you suspect as corrupted or dealing with a bug in SQL Server itself. The best option is to restore the backup in another SQL Server instance until the primary instance is back to a proper state (meaning new drives and a clean instance).

If you insist on trying to repair the corrupted database, let’s look at an example. We are going to attempt to repair a database called class3_DR.

Firstly, we need to put the database in single user mode which means only one user can connect at a time:
```
ALTER DATABASE class3_DRSET SINGLE_USER;
```
Now, if we include the REPAIR_REBUILD command, which attempts to rebuild corrupted data pages without any data loss, in DBCC CHECKDB:
```
DBCC CHECKDB('class3_DR', REPAIR_REBUILD) WITH NO_INFOMSGS;
```
You can see that it is unsuccessful. It says that “other errors are required to be corrected first”.

Let’s try the other option called REPAIR_ALLOW_DATA_LOSS, which attempts to rebuild corrupt data pages but doesn’t try to preserve the data. So, you can experience data loss. At this point, it is a last resort.
```
DBCC CHECKDB('class3_DR', REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;
```
The new message says that the error has been repaired.

Put the user back in MULTI_USER mode:
```
ALTER DATABASE class3_DR SET MULTI_USER;
```
Afterwards, when I ran a regular CHECKDB corruption check, it ran successfully and reported no corruption errors:
```
DBCC CHECKDB('class3_DR') WITH NO_INFOMSGS, EXTENDED_LOGICAL_CHECKS;
```
However, when I queried the table for the corrupted row, it was no longer there because we experienced data loss.

Safraz Hosein