Can partitioning solve it all?

Recently I got a question:

We are thinking to use partitioning for our DW database, since we have monthly incoming data of 30 million rows and partitioning will make the reads and writes much faster; we are using a 2 CPU machine with 16 Gb, 2 hard drives of 500Gb each.

Could you recommend a good strategy for the partitioning?

Well, to be honest, let me state it in the beginning: partitioning is far from being the solution to all evil in performance tuning in SQL Server. Partitioning is:

  • relatively expensive, i.e. requires some serious investment in hardware, software and manpower
  • requires time to setup and maintain
  • does not solve all performance problems, since partitions are hosting their own B-trees, and this still requires extra ‘power’ to get to the needed record

 

So, where do we start?

Hardware considerations:

Think of the storage as the slowest unit in your system. (unless you have a RAMSAN or unlimited access to FusionIO storage for example :) )

In this particular case we are saying that we have 1 disk and 2 partitions. Keep in mind that the OS needs to reside on its own physical disk, and your data needs to be distributed into at least 1 or preferably 2 or more physical disks for performance and reliability.

The memory…

… is next, since the query performance of a DW relies on the amount of data being able to be cached at the same time.

Let’s say that we have a table with 50GB index which we query often and our system has only 16GB  memory. What will happen in this case is that our disk system will have to eventually work really hard to read data from disk into memory, since there is not enough room to retain the data in cache. (also, from the 16GB we will need to use about 2 for the OS, another 500mb for cached plans, etc. )

At the end, our disk system will be working very hard to read the data into memory all the time, and we will get an IO bottleneck.

The CPU…

… SQL Server is not CPU aware, but core aware. We are saying that there are 2 CPUs, but how many cores are there? This is very important for the DW especially, since we can take advantage of parallelism (or, if we are not careful – the parallelism can take advantage of us! :) ).

The database and the table design:

think about the database as a container – we cannot ‘place all eggs in the same basket’.. We need to design the filegroups and the files to serve our reliability and disaster recovery needs.

The table design: make sure we have the proper data types. For example, there is a very small difference between int and bigint in a table with 100 rows, but there is a significant difference between them in a table with 1 billion rows. The difference is expressed in performance, memory usage, IO load, optimization, storage space etc.

Also, keep in mind that since our data is kept in 8 kb pages on disk and in memory, this means that there is a limitation of how much data we can fit in a page, how many rows and so on.

Also, depending on our data types, and on the ratio between read vs. writes, we will have to deal with fragmentation.

This brings us to the choice of SQL Server edition.

Let me put it simple: if we have 30 million rows incoming every month, we will not be able to deal with our data unless we get Enterprise edition. The Ent edition offers online index rebuild, improved optimization and partitioning (aside from other benefits, as well).

Now, for the partitioning:

in our case it might help us with performance by partitioning the data by month, and by using partition elimination in our queries. But keep in mind that the more partitions we have, the bigger the maintenance overhead is, and also keep in mind that each partition has its own B-tree structure, which means that you still need to deal with fragmentation and with extra reads in order to get to the leaf levels of the index.

We have to calculate beforehand how big our indexes will be per month.  Here are 2 links on how to estimate the size of the clustered and non-clustered indexes:

Estimating the Size of a Clustered Index

Estimating the Size of a Nonclustered Index 

There are plenty of other considerations, but this is a good start.

Ah yes, back to the first question: Can partitioning solve it all?

The answer: barely. There is plenty to consider before the partitioning can help us with performance and reliability.

 

1 comment to Can partitioning solve it all?