Making Data Analytics Simpler: SQL Server and R

After my recent post on Making Data Analytics Simpler: SQL Server and R, there was a reader’s question which seems quite interesting: “… Still i am trying to understand how R benefits SQL Server…”.

I answered the question in the comments section of the article, however today I got another interesting discussion with a friend and it lead me to the idea of writing this blog post with some comments on the topic.

My friend’s train of thought was that the customers of his company rarely could afford the analytical / data mining part of their solution, since it was using some fairly expensive RDBMS and they had data in the TBs.

So, here we go.  I think we have a very interesting train of though here. Firs of all,  the first question is a bit clumsily formed. “… Still i am trying to undestand how R benefits SQL Server…” is a bit of a strange question. R does not have to benefit SQL Server. R is a standalone language with all of its perks and downsides and it seems to work quite well in its own niche.

SQL Server, on the other hand, is a fairly good RDBMS system, which has everything a user might need from ACID to supporting the storage of fairly large data volumes. What SQL Server lacks, and in my opinion will not gain anytime soon, is a sophisticated way to crunch data for analysis and a way to do machine learning, not even to start speaking of deep learning. It is just not designed for this. As a matter of fact, many RDBMS systems are just not designed for this.

The other problem is that SQL Server costs. A lot, if you don’t know what you are doing or if you know what you are doing but have a lot of data and big intentions.

And now we get to the second discussion – the one on the analytical part of the RDBMS. It does not have to be hard to do, it is just matter of some strategy. Think about it this way – how often does it happen for a user to be needing to analyze ALL of their data? Not really so often. And the actual RDBMS system should not be scaled for this either, since most of them either can’t scale up, or cost a lot to do so. (Yep, some of the vendors make big money out of this – think about SQL Server, for example – functional columnstores in PDW, and just silly remains of functionality in SQL Server)

So, here is an idea: keep your data in RDBMS, if you need to. Fine, RDBMS will guarantee you transactions, recovery, speed and storage. For the analytical part do this:

  • get an appliance server – it is like a toaster, it does one thing but it does it well – and make sure you load plenty of memory on it. Memory is cheap.
  • Then do some data cleaning and pre-aggregations in your RDBMS – after all you are paying crazy amounts of money for it, so might as well make it work for it
  • then load the prepared data in the other cheap server, in memory with R and do the real heavy lifting there. If needed, return some results to be written in your RDBMS for later use

The idea is fairly simple – just give the data owners the option to get a fairly cheap way to crunch the data and they will be happy. Of course, another question is “ok, but how about the cloud computing”. Sure, you can count on it, but knowing Microsoft, it will take a long while before it’s useful, and with other providers you still have to do your own heavy lifting. Might as well do the heavy lifting here and now, and get some competitive advantage here and how (instead of waiting for it to come to you later).

I don’t want to repeat my comment from the article, even though it has some good reasoning related to SQL Server’s licensing and ideas about strategies. This is why I will copy the comments in this post for whoever would be interested to read.

Subject: SQL Server & R
Posted by: Anonymous (not signed in)
Posted on: Monday, May 4, 2015 at 7:00 AM
Message: Still i am trying to undestand how R benefits SQL Server.
This article shows how to read data & display in graph which not new can be done in SSRS or SharePoint. If you can elobrate more in detail what R can do that SQL Server cannot and what kind of data that R can store in SQL Server, that will be great & beneficial to readers.



RE: SQL Server & R

Posted by: Feodor (not signed in)
Posted on: Tuesday, May 5, 2015 at 3:09 AM
Message: Anyone who has worked with SQL Server and data analysis for a while will have it in the clear that SQL Server has quite a few costs. We are talking about time and money. The time part is related to data gathering, modelling and writing the reporting part. The money part is the licensing costs of SQL Server – you pay for the hardware and for the software, whereas for R you have only the hardware to pay for. And for R you have no hardware limitations, like you do for SQL Server. (in 2014 you have free Express edition limited to 1GB memory and 10GB data size and max 4 cores, Standard edition is fairly lousy for BI and it costs too much, the Enterprise edition costs 4 times the Standard edition – you get the picture).
You are right about the fact that the article is missing the perspective of the scalability comparison between SQL Server and R; the advantage of using R would be quite obvious to a fairly experienced data user, though. As mentioned in the article, R does not have a storage engine of its own, hence it needs to somehow get the data from a source. In reality this source can be simply flat files, or it can be any RDBMS. The cheaper the better.
When it comes to the time factor, R is very beneficial to SQL Server because you can do a proof of concept and some data mining in a few minutes, whereas in SQL Server you’d have to do some modelling and reporting separately, and quite a few data mining / exploration features are quite costly.

Here is a real life scenario, which can be beneficial to quite a few startups or non-profit orgs: get a few appliances (think even laptops or some cheap old servers) with as much memory as possible, and install SQL Server express edition. Use the Express editions to store relational data, just as in the relational source system, then use as much memory as you can buy to do data analysis in R and gain competitive advantage. This way it will cost as little as possible.

After all, the article is using SQL Server as an example, because the audience is quite familiar with it and can easily follow the examples.

Hope this helps.


How to get events per month barplot in R

# generate some random data

dat = data.frame(timestamp = Sys.time() + sort(round(runif(1000, (24*3600), (2*365*24*3600)))))

# create new columns to identify the month and year
dat$month = strftime(dat$time, “%b”)
dat$year = strftime(dat$time, “%Y”)

# count the timestamps per month for each year
timestamps_month = count(dat, vars = c(“month”,”year”))

# create the plot graph
ggplot(data = timestamps_month) + geom_bar(aes(x = month, y = freq, fill = year), stat=”identity”, position = “dodge”)


Here is the result:



How to remove (almost) everything from the work environment


This command will clear the variables and the loaded libraries. Almost.

If we run ?rm we get the following info:

## remove (almost) everything in the working environment.
## You will get no warning, so don't do this unless you are really sure.
rm(list = ls())