A few days ago I read Tony Davis’ blog post called “SQL Server Optimizer Malfunction?“.
In short, the blog post is about Adam Machanic’s presentation during the PASS conference claiming that the Query Optimizer is stuck in 1997.
Even though it is controversial whether PASS would be the place for expressing similar thoughts and ideas, I guess the time and place does not really matter on the background of reality.
I wrote a comment to Tony Davis’ blog post, and since the topic has been on my mind for a few days now, I would like to elaborate a bit more. (I will paste my comment below, just in case someone cares to read it in full.)
To summarize my thoughts on the Query Optimizer situation, I would say that it seems like it is stuck in 1977 because of a small hurdle called ‘no clue about context and patterns‘.
In other words, the Query Optimizer is taking the requests as they come, and even though it tries to use semi-smart mechanisms of caching and cost based analysis, it still lacks the power of context and pattern analysis of the workloads over time.
Why is this important anyway?
Recently I wrote an article called “How come the Hourglass? Why database applications slow down“, which explains the ‘big picture’ of data going to and from the database engine. The fact is that the context of the workload and the performance patterns are one of the most important prerequisites for handling the load.
And finally, let me ask this: if the Optimizer needs statistics on the data in order to properly handle a request, would not it need statistics of the execution context and its patterns in order to perform better?
If you find an answer to this question, you can post a comment.
Here is the original comment to Tony Davis’ blog post.
It is a very valid point that the improvements in hardware are changing the game of the optimizer almost constantly. Here are a few real world examples that I have encountered over the years:
1. the problem of using the same query / stored procedure for different amounts of data: the problem becomes obvious when we have a query which may have to process 50 rows or 100 million rows. And depending on which plan is cached first, the other one will not be optimal. This can be simply solved by creating two different queries / procedures – one for each case – and passing hints.
2. The query optimizer does not handle the historical hardware utilization context very well, for example if the pattern in a system is that it gets high resources utilization every 5 minutes (due to some scheduled job) and this event lasts 2 minutes, then the chances are that the plans generated during the peak times are not going to be the same as if they were created during the time of a calmer system, and they may not be optimal. This is not so easy to detect and fix, and it takes a complete overview of the workloads in the system.
3. the worst problem is when there is a database table in use in a very highly transactional system and millions of rows ‘pass through’ the table per minute (being inserted, updated and deleted) and at times it may have 1 row or 1 million rows. In such case, if someone rebuilds the indexes and decides to also do “auto recompute statistics”, then the statistics will be updated based on what data there is at the moment in the table, and the chances are that the next time the plan is generated, the plan may not be optimal and potentially may cause a serious CPU pressure.
These are all valid situations and I have dealt with them at one point of time. The reality is that the Optimizer has no clue about the ‘big picture’ and about workload and hardware utilization patterns over time.
The bottom line is that special attention needs to be paid to the workload patterns at all times, and the query tuning or even the solution approaches have to be based on the workload patterns and the underlying hardware and not based on blindly hoping that the Optimizer will do what we want it to do.