Azure vs. inhouse SQL Server - the Wiki project results

As I blogged yesterday, the competition is on: in-house SQL Server vs. Windows Azure.

It does feel like the war of the robots!

Yesterday I started importing the data, and today I have some results.

Here are some notes of the process, though.

The Hardware:

As I mentioned, I am using a home desktop with Pentium Dual Core E5400 @ 2.70 GHz with total of 8Gb RAM, of which SQL Server is using only 4.

The hard disk I am using is WDC WD10EADS-22M2B0.

And its specs are:

Product Identifiers
Brand Western Digital
Model Caviar Green
MPN WD10EADS
Key Features
Enclosure Internal
Capacity 1 TB
Buffer Size 32 MB
Hard Drive Type HDD (Hard Disk Drive)
Spindle Speed 7200 RPM
Interface SATA I, SATA II, Serial ATA
Designation Desktop Computer
Technical Features
Platform PC
External Data Transfer Rate 300 Mbps
Form Factor 3.5″

This is a single drive in my computer, and the OS, SQL Server and all databases reside on the same drive. The write caching is disabled.

The Software:

I am using 64-bit Windows 7 and SQL Server 2012 SP1.

The settings of the SQL Server are default, except for the max memory setting, which I have changed to be 4000, i.e. the SQL Server instance is not using more than 4Gb of RAM.

The Test Scripts:

Here is the script I am using for creating the database objects:

 

USE [master]
GO
 
/****** Object:  Database [WikiDB]    Script Date: 12/24/2012 10:06:19 AM ******/
CREATE DATABASE [WikiDB]
 CONTAINMENT = NONE
 ON  PRIMARY 
( NAME = N'WikiDB', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL2012\MSSQL\DATA\WikiDB.mdf' , SIZE = 4096KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ), 
 FILEGROUP [UserData1]  DEFAULT 
( NAME = N'WikiDB_UserData', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL2012\MSSQL\DATA\WikiDB_UserData.ndf' , SIZE = 512000KB , MAXSIZE = UNLIMITED, FILEGROWTH = 102400KB )
 LOG ON 
( NAME = N'WikiDB_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL2012\MSSQL\DATA\WikiDB_log.ldf' , SIZE = 1024KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
GO
 
ALTER DATABASE [WikiDB] SET COMPATIBILITY_LEVEL = 110
GO
 
IF (1 = FULLTEXTSERVICEPROPERTY('IsFullTextInstalled'))
begin
EXEC [WikiDB].[dbo].[sp_fulltext_database] @action = 'enable'
end
GO

And here is the script to create the database objects:

 -- create a temp laod table
CREATE TABLE TempXML (IntCol int, XmlCol xml);
GO
-- create prod table
 
USE [WikiDB]
GO
 
SET ANSI_NULLS ON
GO
 
SET QUOTED_IDENTIFIER ON
GO
 
CREATE TABLE [dbo].[WikiDocuments](
	[ID] [bigint] IDENTITY(1,1) NOT NULL,
	[WikiDocument] [xml] NOT NULL,
 CONSTRAINT [PK_WikiDocuments] PRIMARY KEY CLUSTERED 
(
	[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
 
GO
 
-- create a XML index
 
USE [WikiDB]
 
GO
 
SET ARITHABORT ON
SET CONCAT_NULL_YIELDS_NULL ON
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
SET ANSI_PADDING ON
SET ANSI_WARNINGS ON
SET NUMERIC_ROUNDABORT OFF
 
GO
 
CREATE PRIMARY XML INDEX [PrimaryXmlIndex] ON [dbo].[WikiDocuments]
(
	[WikiDocument]
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
 
GO

And here is the script to insert the data:

--insert all articles into 1 row as xml
 
INSERT INTO TempXML(XmlCol)
 
SELECT * FROM
 
OPENROWSET(BULK 'D:\WIKIfiles\enwiki-latest-pages-articles1.xml',SINGLE_BLOB) AS ExternalFile
 
-- remove the reference
 
update TempXML
 
set XmlCol = replace(convert(nvarchar(max),XmlCol),'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en">','<mediawiki>')
 
-- insert the split XML
 
insert into [dbo].[WikiDocuments]([WikiDocument])
 
SELECT
 
Col.query('.')
 
FROM
 
dbo.TempXML
 
CROSS APPLY
 
XmlCol.nodes('/mediawiki/page') AS Tbl(Col)

There are a few issues with the above script:

  • the idea is to import the Wiki xml files in a temp table in the database, then to stripe every article in a production table in the Wiki database as a separate row
  • the problem comes from the XML, i.e. there are 27 XML files we are working with, and each file is bigger than the previous. For example, the first file is 140Mb, the second is 240Mb, the third is 360Mb and so on, until the last one being 7.8 Gb.
  • the XML files contain a reference which has to be removed before the XML can be parsed into separate articles

Now, my poor desktop is not made for inserting, updating and parsing huge XML files, but still it managed to insert the first 12 files without much hiccups. (the 12th file is 1.3 Gb).

If I were to do this project for real, I would definitely go for a C# app which breaks down the XML files of 50Mb each. This would make the insert very fast.

For example, I measured the insert, update and parse performance for the first file and it looks like this:

Insert 1:

SQL Server parse and compile time:
CPU time = 16 ms, elapsed time = 233 ms.
Table ‘TempXML’. Scan count 0, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 170627, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 14, physical reads 0, read-ahead reads 0, lob logical reads 479535, lob physical reads 0, lob read-ahead reads 249312.

SQL Server Execution Times:
CPU time = 4352 ms,  elapsed time = 18720 ms.

Update 1:

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.
Table ‘TempXML’. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 331317, lob physical reads 0, lob read-ahead reads 487380.
Table ‘Worktable’. Scan count 0, logical reads 19, physical reads 0, read-ahead reads 0, lob logical reads 1539873, lob physical reads 0, lob read-ahead reads 1037882.

SQL Server Execution Times:
CPU time = 25538 ms,  elapsed time = 37777 ms.

Prod Insert 1:

SQL Server parse and compile time:
CPU time = 15 ms, elapsed time = 38 ms.
Table ‘xml_index_nodes_261575970_256000’. Scan count 0, logical reads 583512, physical reads 0, read-ahead reads 0, lob logical reads 67630, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 2, logical reads 20619, physical reads 0, read-ahead reads 0, lob logical reads 6, lob physical reads 0, lob read-ahead reads 0.
Table ‘fulltext_index_docidstatus_261575970’. Scan count 0, logical reads 38261, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘WikiDocuments’. Scan count 0, logical reads 18613, physical reads 0, read-ahead reads 0, lob logical reads 212030, lob physical reads 0, lob read-ahead reads 0.
Table ‘TempXML’. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 288966, lob physical reads 973, lob read-ahead reads 338317.

SQL Server Execution Times:
CPU time = 6428 ms,  elapsed time = 13389 ms.

(6336 row(s) affected)

 

So, for a 140Mb file it takes 19 seconds to insert from file to database, then 37 seconds to remove the XML reference and it takes 14 seconds to parse the XML and to insert all articles in a table as one article per row.

Keep in mind that if I had the time I would write a C# program to break the XML files in a smaller sizes and to remove the reference. In this case I had to go for a dumb update of XML in SQL Server – not a good idea, but it worked for this case.

So, I run the above script for each file (in this case I ran the script for the first 12 files) and currently I have 1,316,164 articles inserted.

This seems fine for now, considering that the 12th file took 39 minutes to insert – given that I am using one hard drive which hosts the OS, SQL Server, Database files and paging file.

The database objects:

Now let’s do something interesting with the data.

To begin with, let’s create a FULL Text catalog and FULL text index:

 

-- create fulltext index
 
-- create FT catalog
USE [WikiDB]
GO
CREATE FULLTEXT CATALOG [WikiDocumentFT]WITH ACCENT_SENSITIVITY = ON
 
GO
 
-- create the FT index
USE [WikiDB]
GO
CREATE FULLTEXT INDEX ON [dbo].[WikiDocuments] KEY INDEX [PK_WikiDocuments] ON ([WikiDocumentFT]) WITH (CHANGE_TRACKING AUTO)
GO
USE [WikiDB]
GO
ALTER FULLTEXT INDEX ON [dbo].[WikiDocuments] ADD ([WikiDocument])
GO
USE [WikiDB]
GO
ALTER FULLTEXT INDEX ON [dbo].[WikiDocuments] ENABLE
GO
 
-- populate the FT index manually 
 
USE [WikiDB];
GO
ALTER FULLTEXT INDEX ON [dbo].[WikiDocuments] START UPDATE POPULATION;
GO

After setting up the database, objects and the import, we will see what we can do with the queries in the next blog post. Well, it turns out that we will need to do some performance tuning first.

 

2 comments to Azure vs. inhouse SQL Server – the Wiki project results