SQL Server 2012 Full-Text search and XML Search - the Wiki project

Finally, here are some results from the Wiki project.

After a few days of development, here is the accounting for the wiki project:

  • Desktop PC with 8Gb RAM, single home use HDD
  • SQL Server 2012, using XML and Full-Text search
  • the total amount of data to be imported: 39.2 GB
  • the original XML files had to be sliced in 150Mb chunks for the import to be easier – 1 hour for development of the XML slicing tool
  • design of the database, filegroups and schema – 1 hour
  • the SSIS package development to import the XML – 1 hour
  • import of the XML data – 2 minutes per 150Mb file = 262 files
  • 12 million imported XML documents in a single SQL Server table
  • 6 hours to create a primary XML index
  • 11 hours to build the Full Text Search catalog
  • the size of the database is 207Gb
  • the index space used is 80Gb for the clustered index, 95Gb for the XML index and 32Gb for the Full-Text index

So far we have total of 3 days of development and object creation.

And now let’s look at the price:

The entire Desktop PC is about $600 as it was new, but since it is 3 years old, I don’t think anyone would pay more than $300 for it.

The development time for the project would cost about $400.

The electricity bill would be another $5 per month.

The SQL Server 2012 license is free for now, I am using the evaluation version.

The performance:

There are several ways to search through the Wiki articles. (Yes, I know that I can do so much more than I have done so far!)

But for this case I have chosen to work with a combination of XML and Full-Text indexes, which gives pretty good performance from the current silly hardware.

Let’s start with the XML. I can search the XML nodes like this:

 

SELECT top 10 * 
FROM   [dbo].[WikiDocuments] 
WHERE   WikiDocument.exist('/page/revision/text/text()[contains(.,"music")]') =1

The execution results look like this:

SQL Server parse and compile time:
CPU time = 15 ms, elapsed time = 15 ms.

(10 row(s) affected)
Table ‘WikiDocuments’. Scan count 0, logical reads 45, physical reads 0, read-ahead reads 0, lob logical reads 1216, lob physical reads 16, lob read-ahead reads 603.
Table ‘xml_index_nodes_261575970_256000’. Scan count 1, logical reads 80, physical reads 0, read-ahead reads 0, lob logical reads 20568, lob physical reads 28, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 78 ms,  elapsed time = 345 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

Now I can combine the XML and Full-Text index searches like this:

 

SELECT top 10 * 
FROM   [dbo].[WikiDocuments] 
WHERE  CONTAINS(WikiDocument,'music') 
AND    WikiDocument.exist('/page/revision/text/text()[contains(.,"music")]') =1

The execution results look like this:

SQL Server parse and compile time:
CPU time = 31 ms, elapsed time = 70 ms.

(10 row(s) affected)
Table ‘xml_index_nodes_261575970_256000’. Scan count 13, logical reads 95, physical reads 0, read-ahead reads 0, lob logical reads 6352, lob physical reads 37, lob read-ahead reads 0.
Table ‘WikiDocuments’. Scan count 0, logical reads 58, physical reads 0, read-ahead reads 0, lob logical reads 1862, lob physical reads 15, lob read-ahead reads 921.

SQL Server Execution Times:
CPU time = 250 ms, elapsed time = 4312 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

Now let’s get to a bit more complex searches:

 

SELECT top 10 ID, [WikiDocument]
FROM   [dbo].[WikiDocuments] 
WHERE CONTAINS(WikiDocument , 'NEAR((landing,moon), 5, TRUE)')

This script uses proximity search with the CONTAINS function and the execution results look like this:

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

(10 row(s) affected)
Table ‘WikiDocuments’. Scan count 0, logical reads 481, physical reads 6, read-ahead reads 1360, lob logical reads 1144, lob physical reads 18, lob read-ahead reads 538.

SQL Server Execution Times:
CPU time = 141 ms, elapsed time = 13359 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

Here is a search with all forms of the word:

SELECT top 10 ID, [WikiDocument]
FROM   [dbo].[WikiDocuments] 
WHERE CONTAINS(WikiDocument , ' FORMSOF (INFLECTIONAL, walk) ');

The execution results look like this:

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 102 ms.

(10 row(s) affected)
Table ‘WikiDocuments’. Scan count 0, logical reads 176, physical reads 0, read-ahead reads 40, lob logical reads 752, lob physical reads 15, lob read-ahead reads 368.

SQL Server Execution Times:
CPU time = 203 ms, elapsed time = 5479 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

And here is one more search – using the freetext:

SELECT top 10 ID, [WikiDocument]
FROM   [dbo].[WikiDocuments]
WHERE FREETEXT (WikiDocument, 'beatles and music' );

The execution results look like this:

SQL Server parse and compile time:
CPU time = 31 ms, elapsed time = 53 ms.
Informational: The full-text search condition contained noise word(s).

(10 row(s) affected)
Table ‘WikiDocuments’. Scan count 0, logical reads 45, physical reads 0, read-ahead reads 0, lob logical reads 910, lob physical reads 0, lob read-ahead reads 449.

SQL Server Execution Times:
CPU time = 874 ms, elapsed time = 4393 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

The bottom line:

All the above performance results are from ‘cold cache’ and keep in mind that there is a single hard drive in the PC, which is used for the OS, the Database files, TempDB and so on.

The searches are between 4 and 14 seconds, depending on the search words and depending on the kind of search – only FT or FT and XML search.

If I use a SSD or a separate drive I could actually speed up the searches at least a few times.

 

 

1 comment to SQL Server 2012 Full-Text search and XML Search – the Wiki project