Why Do We Need Statistics and to be more specific SQL Server statistics

 

This post is written by Ivan Mukiibi, a young SQL Server professional based in Kampala, Uganda. Ivan is devoted to SQL Server performance tuning, SQL Server internals and also he is passionate about ETL solutions. At this early stage of his DBA carreer he has done an outstanding job researching and interpreting the facts in the area of SQL Server statistics in this guest post. Enjoy!

 

Before a football match, a coach chooses players that are going to face an opposing team; he/she will choose players based on their stats. The coach will get each player’s stats during training sessions or matchup games.

In these sessions, the coach will usually divide the players into small teams of say 5 out of the 21 players (Lookup). The coach would not risk getting each player’s stats minutes before the game (full table scan).

Just like a coach who needs players’ stats before any game, SQL Server needs to have stats in order for it to work efficiently.

Why Do We Need Statistics and to be more specific SQL Server statistics?

The reason is that we can’t build a good plan to get the rows we need without having an idea of how many rows we’re going to get!SQL Server has a method that we can use to get an approximation of how many rows will be returned – that is our “statistics”. That allows us to ‘guess’ with decent accuracy.

SQL Server tries to use statistics to “react intelligently” in its query optimization. Statistics are used to give the optimizer information about how many rows will satisfy a given predicate in a WHERE clause or an ON clause.

Knowing the number of records, density of pages, histogram, or available indexes help the SQL Server optimizer “guess” more accurately how it can best retrieve data. A common misconception is that if you have indexes, SQL Server will use those indexes to retrieve records in your query. That may not be necessarily the case.

If you create, let’s say, an index to a column LastName and <90% of the values are ‘Smith’, SQL Server will most likely opt for a table scan instead of using the index if it knows these stats. By default, SQL Server automatically creates statistics every time an unindexed column is referenced in a query’s WHERE or ON clause. This default applies only when the auto_create_statistics database option is set to ON, which is the default setting in the model database and thus in all new databases you create.

Let’s consider the two queries listed below- Will the two queries use a similar query plan?

A)    

select * from person.Contact where  LastName like 'S%‘

B)     

select * from person.Contact where  LastName like 'SM%'

C)      Query A returns 2694 rows

D)     Query B return 669 rows

at some “tipping point” the plan will change from a key lookup to a scan.

LastName Like S% = Scan

For Query A, we see that SQL has decided to do a table scan

LastName Like Sn% = Lookup

SQL has decided to take a lighter weight approach – a bookmark lookup

Bookmark lookup is the generic term for using an index to find the matching rows, then doing a logical join to the referenced table to retrieve any columns that were referenced in the SELECT but not included in the index.

 

 

 

How Statistics are created – Automatic

The default setting for each database is to have automatic creation of stats enabled, allowing SQL to create a new stat if a query uses a column in a where clause or join that doesn’t have a stat.

 

Example: System created statistic

 

Example: System created statistic

Example: System created statistic

-If the database setting autocreate stats is on, then SQL Server will automatically create statistics for non-indexed columns that are used in queries.


SELECT DISTINCT OCCUPATION FROM EMPLOYEE WHERE EMPLOYEENAME LIKE ‘S%

How Statistics are created – Manually

It’s not common, but you might need to create a statistic manually. Here is an example:


create statistics EmailAddress on person.contact (EmailAddress) WITH FULLSCAN

-Statistics are automatically created for each index key you create.


CREATE INDEXEmailAddress on person.Contact(EmailAddress)

 

 

Statistics created by a user

Statistics created by a user

 

Note-The automatically created statistics have a system-generated name that always starts with the eight characters “_WA_Sys_.” These eight characters are followed by a hexadecimal value indicating the ordinal position of the column in the table (the fourth column, in this case) and a hex value representing the table’s object_id.

 

How can one view statistics in their database

Option 1 – you can go to your Statistics node in your SSMS, right click > Properties, then go to Details.

Option 2 – you can use DBCC SHOW_STATISTICS WITH HISTOGRAM

dbcc show_statistics (‘person.contact’, ‘ndxlastname’) WITH HISTOGRAM

 

How are Statistics Updated?

Unlike indexes, statistics are a batch operation. That decreases the load on the system, but it means that over time the accuracy of the stats can decrease as the distribution of the data changes from what it was at the time we built our statistic.

 

The default settings in SQL Server are to autocreate and autoupdate statistics

When you check the properties of your database, there two (2) options with the Auto Update statistics.
Auto Update Statistics basically means, if there is an incoming query but statistics are stale, SQL Server will update statistics first before it generates an execution plan.
Auto Update Statistics Asynchronously on the other hand means, if there is an incoming query but statistics are stale, SQL Server uses the stale statistics to generate the execution plan, then updates the statistics afterwards.

 

The fix is to periodically update our statistics:

  •        By association when we rebuild indexes
  •        Directly, either manually or via a job
  •        Based on thresholds if auto update enabled

Updating Stats Via Rebuild

  •        This only works if they created the index with the default behavior to create/maintain stats (STATISTICS_NORECOMPUTE = OFF)
  •        This only works for index related stats. Stats created manually or auto created are not changed as part of an index rebuild even if one of the columns is part of an index

Updating Stats Directly

The most definite approach to updating stats is to use UPDATE STATISTICS which allows us to:

  •        Update a single statistic, or all stats on a table
  •        Specify the sampling rate or reuse the previous sample rate
  •        Update index based stats, other stats, or both
  •        Disable automatic statistics update on a stat

 

If you need to update all the stats in a database, look at sp_updatestats or maintenance plans

Updating Stats Directly

Examples:

  •        update statistics Person.contact(ndxemail) with fullscan
  •        update statistics Person.contact(ndxemail) with sample 50 PERCENT
  •        update statistics Person.contact with columns
  •        update statistics Person.contact with index

 

NOTE: To maintain the statistical information in an as up-to-date fashion as possible, SQL Server introduces AutoStat, which, through SQL Server’s monitoring of table modifications, is capable of automatically updating the statistics for a table when a certain change threshold has been reached.

AutoStat will automatically update the statistics for a particular table when a “change threshold” has been reached. The sysindexes.rowmodctr column maintains a running total of all modifications to a table that, over time, can adversely affect the query processor’s decision making process.

This counter is updated each time any of the following events occurs:

    A single row insert is made.

    A single row delete is made.

    An update to an indexed column is made.

In cases where large amount of data in an indexed column has been added, changed, or removed (that is, if the distribution of key values has changed), or the table has been truncated using the TRUNCATE TABLE statement and then repopulated one needs to use UPDATE STATISTICS to ensure that the statistics are up-to-date.This is due to the fact that these events do not update rowmodctr.

Consider thefollowing:

 

 

Create Table SampleTable (a int, b char(1), c int);
 
GO
 
Declare @i int
 
set @i = 1
 
while @i &lt; 10000
 
begin
 
insert SampleTable values(@i, char(@i%65 + 65), @i - 1)
 
set @i = @i + 1
 
end
 
GO
 
create statistics statsA on SampleTable (a);
 
create statistics statsB on SampleTable (b);
 
GO
 
dbcc show_statistics ('SampleTable', ' statsB '); -- 41 rows
 
GO
 
Truncate Table SampleTable;
 
GO
 
dbcc show_statistics ('SampleTable', 'statsB'); -- still 41 rows. truncate leaves stats intact
 
GO
 
insert into SampleTable values(1, 'a', 0);
 
GO
 
dbcc show_statistics ('SampleTable', 'statsB'); -- still 41 rows, insert doesn't affect them either
 
GO
 
SELECT * FROM SampleTable;
 
GO
 
dbcc show_statistics ('SampleTable', 'statsB'); -- still 41 rows (select * doesn't care about stats)
 
GO
 
-- now execute a query that will utilize statistics
 
select * from SampleTable where b = 'J';
 
GO
 
dbcc show_statistics ('SampleTable', 'statsB'); -- 1 row (header row only)
 
GO

 

  •        Sp_helpstats is a quick way to return stats information about a table, but it has been deprecated. By default, sp_helpstats shows only the column statistics, but when you add a second parameter, ‘ALL’, the procedure will return both column statistics and index statistics.

EXEC sp_helpstats ‘dbo.Contact’, ‘ALL’;

Instead, use the sys.stats and sys.stats_columns tables to get the same info. Sys.Stats and Sys.Stats_Columns let you see all the available statistics.

 

Consider thefollowing that uses sys.stats_columns and sys.columns to get Statistics

 

SELECT OBJECT_NAME(sc2.object_id) AS TableName ,
 
s.name AS StatisticsName ,
 
s.stats_id ,
 
s.auto_created ,
 
ColList = SUBSTRING(( SELECT ( ', ' + c1.name )
 
FROM sys.stats_columns sc1
 
JOIN sys.columns c1 ON sc1.object_id = c1.object_id
 
AND sc1.column_id = c1.column_id
 
WHERE sc1.object_id = sc2.object_id
 
AND sc1.stats_id = s.stats_id
 
ORDER BY sc1.stats_id ,
 
sc1.stats_column_id ,
 
c1.name
 
FOR
 
XML PATH('')
 
), 3, 4000)
 
FROM sys.stats_columns sc2
 
JOIN sys.columns c2 ON sc2.object_id = c2.object_id
 
AND sc2.column_id = c2.column_id
 
JOIN sys.stats s ON sc2.object_id = s.object_id
 
AND sc2.stats_id = s.stats_id
 
WHERE sc2.object_id = OBJECT_ID('TableName') --substitute Tablename
 
GROUP BY sc2.object_id ,
 
s.name ,
 
s.stats_id ,
 
s.auto_created
 
ORDER BY s.stats_id

 

If we have no stats for a column, we force the query optimizer to guess, which is not a good idea. There are cases where statistics may not be useful –Having a significant mismatch in actual vs estimated often indicates stale stats.

 

Summary

Statistics are what SQL Server uses to determine the viability of how to get data and reduce sub-optimal execution plans. Most importantly, you can rely on the automatic creation and update of statistics to help ensure good query plans in most cases. When a database has tables with millions or billions of records, sometimes statistics are not updated often enough; this calls for manual updating of statistics.

 

 

 

 

Comments are closed.