In my earlier blog post I wrote about setting up the WikiDB database and importing the files.
Initially I was inclined to import the XML files as they came – the files had different sizes ranging from 140 Mb to 7 Gb.
I managed to import the first 12 files, and thought I would stop there.
But on a second though, yesterday I decided to write a C# class which would resize the XML files and split them into chunks of 150Mb each.
The development took about 1 hour – I used C# and an online compiler http://compilr.com/.
With my current computing power (dual core desktop with a single hard disk) it took about 10 minutes per file to be sliced into chunks.
The bottom line is that now I have about 250 files with size of 150 Mb each.
The next step is to import all of them into the WikiDB database.
In my previous blog post I had set up a script to manually insert the files, but since the number of files is no longer 27 but it is 250, I needed to automate the import.
I wrote a SSIS package which does precisely that – imports the XML files, cleans up the reference, and turns the articles into one-article-per-row in the prod table.
The first insert:
So far so good – I have the files, I have the SSIS package, let’s import.
Not so fast, not so fast!
The first file took 12 minutes to be inserted, and I noticed some serious disk activity.
A minute later I realized that I had forgotten to change the default settings of the database file size in SQL Server – it was set to grow 1 Mb at a time (why did Microsoft leave these silly default settings even in SQL 2012?!).
What this really means is that when the database has reached its initial allocated size, it will have to grow 1 Mb at a time. In this case it was quite silly to do so, because the SQL Server was halting after each row insert and expanding the data file. And of course, this is a lonely and time-consuming task. (No other tasks are carried out while the data file is expanded. If you would like to know how many times and how long it takes for your data files to grow – I wrote a SSMS set of reports called The Default Trace audit reports.)
Anyhow, I pre-grew the database to 100 Gb and set the growth to be 100 Mb at a time after that.
I also added my SQL Server service account to the ‘Perform volume maintenance tasks’ security policy. This makes the allocated space NOT to be zeroed out, i.e. it is not recommended for security reasons since if the pages are not zeroed out someone can read what was on the disk before the database was created. But since this is my own desktop and I have nothing to hide from myself, then I can not zero out the pages and the file can grow from 50 to 110 Gb in a few seconds.
Note to self: I really wonder how this security vs. performance issue is solved in SQL Azure. After all, Windows Azure is a shared environment and no one knows what disks are used and how they are shared. So there are two options: either everything is zeroed out every time a database is created and every time it is grown, OR the pages are not zeroed out. In the first case, there would be a performance penalty, in the second case there will be a serious security issue.
I have to test it.
Does anyone want to share their Azure account with me so I can test this?
Anyhow, I can now import 150 Mb file with XML data in about 4 minutes.
And right now my database is 55 Gb, but by tonight I am expecting the database to be over 100 Gb.
Big data, eh? Or not so big?
As some guy said once – ‘There is no such thing as a Big Data. People call Big Data only the data they don’t understand or when they don’t know what to do with it. ‘
Stay tuned for the next blog post when I will be explaining how to setup the Full Text search and the indexing for the database.