NUMERIC vs INT Clustered Index

When using the SQL Server Migration Assistant for Oracle (OracleToSQL) to convert the tables in an Oracle database to tables in a SQL Server database; it automatically converts the Oracle NUMBER datatype to the SQL Server NUMERIC datatype.

The Oracle tables used a column with the datatype and precision of NUMBER(10, 0) as the PRIMARY KEY. This column gets its default value from a sequence (currently Oracle does not have an auto increment feature like SQL Server and MySQL). The columns with the NUMBER(10, 0) datatype are converted to NUMERIC(10, 0) in SQL Server.

The dilemma is whether we should leave the primary key columns as SSMA converted them or to manually change the columns datatypes from NUMERIC(10, 0) to INT.

Two things will have to be evaluated:

  1. The performance impact of using NUMERIC(10, 0) versus INT
  2. The impact of using SEQUENCES to generate new values for the Primary Key versus using SQL Server’s auto-increment feature of IDENTITY(1, 1). This will be done in another blog post.

 
To test, I will be using the StackOverflow2010 database which is 10GB in size. This is a test database that ships with only Primary Keys. No Foreign Key constraints or Non-clustered indexes are defined.

To test performance, we will be measuring:

  1. Logical reads
  2. CPU Time
  3. Cost of the Query (Estimated Subtree Cost)

 
Note: CPU Time and Cost of Query of a query can be different on the same query depending on the current workload of the server. Therefore, these tests must be run in the same session rather than at different times of the day. Logical Reads, however, will remain the same.
We will first do benchmark tests with the existing tables using INT as the Primary Key.

Then the following tests will begin:

  1. Test a query with one table that has a Primary Key of NUMERIC(10, 0)
  2. Test a query with two joined tables that both have Primary Keys of NUMERIC(10, 0)

 
Each of these results will be compared to the benchmark test:

USE [StackOverflow2010]
GO
set statistics io on;
set statistics time on;
/* START BENCHMARK TEST */
/* logical reads 6547
   CPU time = 125 ms (This will be different on your server)
   Estimated Subtree Cost 6.04
*/
select * from dbo.Badges;

/* logical reads 7405
   CPU time = 47 ms (This will be different on your server)
   Estimated Subtree Cost 5.79
*/
select * from dbo.Users;

/* Badges table logical reads: 6547
   Users table logical reads: 7405
   CPU time = 250 ms (This will be different on your server)
   Estimated Subtree Cost for Hash Match: 14.44
*/
select U.id as user_id, U.DisplayName, B.Name AS Badge
from dbo.Users U
inner join dbo.Badges B
on U.Id = B.UserId;

Now, let’s create a copy of the dbo.Badges table and populate the table with the same data:

	
/* create dbo.NewBadges table */
USE [StackOverflow2010]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[NewBadges](
	[Id] numeric(10,0) NOT NULL,
	[Name] [nvarchar](40) NOT NULL,
	[UserId] [int] NOT NULL,
	[Date] [datetime] NOT NULL,
 CONSTRAINT [PK_NewBadges__Id] PRIMARY KEY CLUSTERED 
(
	[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

/* insert data from dbo.Badges to dbo.NewBadges table */
insert into dbo.NewBadges 
select Id, Name, UserId, Date from dbo.Badges;


We can run the following query and compare to the results from the dbo.Badges table:

use StackOverflow2010;
set statistics io on;
set statistics time on;

/* logical reads 7244
   CPU time = 140 ms
   Estimated Subtree Cost 6.55 */
select * from dbo.NewBadges;


In all 3 measurements, the INT datatype on the dbo.NewBadges table performed better:

dbo.Badges dbo.NewBadges
Logical Reads 6547 7244
CPU Time 125 ms 140 ms
Estimated Subtree Cost 6.04 6.55

Let’s see how the query performs in a JOIN with the existing Users table:

select U.id as user_id, U.DisplayName, B.Name AS Badge
from dbo.Users U
inner join dbo.NewBadges B
on U.Id = B.UserId;
dbo.Badges dbo.NewBadges
Logical Reads 6547 7244
CPU Time (Total time) 250 ms 407 ms
Estimated Subtree Cost (Hash Match) 14.44 26.79

Now, lets make a copy of the dbo.Users table and populate it with data:

CREATE TABLE [dbo].[NewUsers](
	[Id] numeric(10,0) NOT NULL,
	[AboutMe] [nvarchar](max) NULL,
	[Age] [int] NULL,
	[CreationDate] [datetime] NOT NULL,
	[DisplayName] [nvarchar](40) NOT NULL,
	[DownVotes] [int] NOT NULL,
	[EmailHash] [nvarchar](40) NULL,
	[LastAccessDate] [datetime] NOT NULL,
	[Location] [nvarchar](100) NULL,
	[Reputation] [int] NOT NULL,
	[UpVotes] [int] NOT NULL,
	[Views] [int] NOT NULL,
	[WebsiteUrl] [nvarchar](200) NULL,
	[AccountId] [int] NULL,
 CONSTRAINT [PK_NewUsers_Id] PRIMARY KEY CLUSTERED 
(
	[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

/* populate the dbo.NewUsers table with data from dbo.Users table */
INSERT INTO dbo.NewUsers
SELECT Id, AboutMe, Age, CreationDate, DisplayName, DownVotes, EmailHash, LastAccessDate, Location,	Reputation, UpVotes, Views, WebsiteUrl, AccountId
FROM dbo.Users;

Run a query which joins the dbo.NewUsers table and the dbo.NewBadges table and compare to the benchmark query:

select U.id as user_id, U.DisplayName, B.Name AS Badge
from dbo.NewUsers U
inner join dbo.NewBadges B
on U.Id = B.UserId;
dbo.Badges, dbo.Users dbo.NewBadges, dbo.NewUsers
Logical Reads 6547, 7405 7244, 7591
CPU Time (Total time) 250 ms 859 ms
Estimated Subtree Cost (Hash Match) 14.44 34.7

The query with the new tables using NUMERIC performed significantly worse in the Estimated Subtree Cost and the CPU Time when joined than the tables with the INT datatype as the clustered index.
The reason for this is the difference in size of the columns. For example, the size of the Id column in the dbo.Badges table is 4303 kb.

-- 4304 kb
select concat(sum(datalength(Id)) / 1024, ' kb') as column_size
from Badges;

The Id column of the dbo.NewBadges table has the exact same data as the dbo.Badges table yet the column is larger simply because it uses a larger datatype:

-- 5380 kb
select concat(sum(datalength(Id)) / 1024, ' kb') as column_size
from NewBadges;

Conclusion: Use INT as the clustered index if you can wherever possible so that it saves on server resources. However, if you can’t avoid it, NUMERIC works just fine as a clustered index.

Leave a Reply

Your email address will not be published. Required fields are marked *