It has been widely reported that data growth is increasing year on year with more data being created in the past two years than in our entire history before this point.
Along with the increase in information is the requirement to report, analyse, manipulate and interrogate the data set.
With traditional on-premises solutions, you aim to size the Data Warehouse for peak workloads so that you have enough compute and storage performance to deal with demand. This capital investment then leads to further question.
- What about off-peak times, can I make use of the hardware for other purposes?
- Who is going to manage and maintain the firmware, operating systems?
- How are we going to expand the environment or are we sizing it for the future now?
- What if our predicted future requirement calculations are wrong, can we expand compute and storage, also will we have the budget available?
To answer these questions and more Microsoft introduced Azure SQL Data Warehouse in July 2016.
What Is Azure SQL Data Warehouse?
Azure SQL Data Warehouse is a PaaS offering, meaning that there is no requirement to manage the hardware or operating system, you just consume what you need at any given time.
It’s built on a distributed architecture which decouples storage from compute, meaning you don’t have a hard tie between increasing compute power with storage capacity. Therefore if you require more storage capacity but your compute resources are adequate you only pay for the extra storage.t
Microsoft use a Massively Parallel Processing system which means that each compute node contributes to the overall workload as shown in the diagram below. This is versus a traditional Symmetric Multiprocessing system in which you scale resources up for example a SQL Server.
Each node takes a slice of the data from the database, when a query is ran a co-ordination task is undertaken to ensure that the query execution is performed in the most optimal way.
How Is It Priced?
Microsoft use the concept of a Data Warehousing Unit or DWU which is an underlying measure of the compute power of the database.
DWU are charged per hour and you can scale up or down within seconds between different tiers.
For example if it took 15 minutes to load three tables and run a report on 100 DWU then I would expect it to take 5 minutes to load the same tables and run the report on 300 DWU.
As well as the DWU cost thereis also a storage cost which is charged per GB on Azure Premium Storage.
One of the excellent features is that you can ‘pause’ your SQL Data Warehouse which means that you don’t pay for any compute resources during this time. Let’s look at a practical example:
8am Monday – Normal working conditions ‘un-pause SQL Data Warehouse at 300 DWU’
6pm Monday – ‘Pause SQL Data Warehouse’
8am Tuesday – End of month reporting ‘un-pause SQL Data Warehouse and increase to 3000 DWU’
10am Tuesday – End of month reporting complete ‘scale down to 300 DWU’
For more details on pricing, please see SQL Data Warehouse Pricing.
Azure SQL Data Warehouse is designed differently and also behaves differently to an on-premises environment.
Those of us who are used to making design decisions in a VMware network environment based on load distribution will be vaguely familiar with the concepts of a distribution key. A distribution key determines the method in which Azure SQL Data Warehouse spreads data across nodes.
When data is loaded from a table into Azure SQL Data Warehouse it is split into 60 distributions (buckets of data) which are attached to different compute nodes.
Hash Distribution – A unique field within a table is used to distribute data. The entry within the field is hashed and those fields returning the same value will be placed on the same compute node for example:
Row 1 Trainer would go to Compute Node 1
Row 2 Sock would go to Compute Node 2
Row 3 Trainer would go to Compute Node 1
Row 4 Shoe would go to Compute Node 3
The main drawback of Hash Distribution is you could end up with an data skew causing uneven workload spread across your distributions.
Round-Robin Distribution – Each row within the table is distributed to a different compute node for example:
Row 1 Trainer would go to Compute Node 1
Row 2 Sock would go to Compute Node 2
Row 3 Trainer would go to Compute Node 3
Row 4 Shoe would go to Compute Node 4
Azure SQL Data Warehouse offers three different table types:
- Clustered Columnstore – Organises records by columns, has a high compression ratio but without secondary indexes. Note this is the default table type in Azure SQL Data Warehouse.
- Heap – No index on the data, but much faster at loading as has no compression. But does allow for secondary indexes.
- Clustered B-Tree Index – A table that is organised on a sorted column which is often the clustering key. Enables fast single lookup, but doesn’t support compression. But does allow for secondary indexes.
Azure SQL Data Warehouse users the concept of Table Partitioning, this is where a top level table is split into different data buckets as shown in the diagram below.
This approach has a number of benefits which include the ease of loading and removing data, the ability to target specific table partitions for maintenance operations and overall increased performance.
Within Azure SQL Data Warehouse it is best to use lower granularity partition schemes for example weekly or monthly rather than daily.
Loading Data into Azure SQL Data Warehouse
When loading data into Azure SQL Data Warehouse we have to bear in mind that it is a Massively Parallel Processing system which therefore means we want to load as much data in parallel as possible.
Depending on the DWU are you paying for, depends on the number of reader threads which are available to you. These are threads that read data in parallel. Note that the number of writer threads into the MPP remain the same.
As you can see your DWU has a direct impact on how fast you can load data.
When data is read into Azure SQL Data Warehouse, you can choose to use single client loading methods such as SSIS and Azure Data Factory. Alternatively you can use PolyBase for parallel loading.
Loading data using a single-client methodology means that your ‘Control Node’ becomes your bottleneck. A ‘Control Node’ is a special category of node that receives all the connections and sends the queries onto the compute nodes.
The ‘Control Node’ is a static item and doesn’t increase or decrease depending on the amount of DWU you add or remove. Therefore meaning that if you load with SSIS, then the ‘Control Node’ could become a bottleneck.
Loading data with PolyBase which is recommended for parallel loading means that data is taken straight into the compute nodes without the need for a ‘Control Node’.
Data is usually exported from SQL and then loaded into Azure Blob Storage as a CSV before being imported into the compute nodes using an automation routine.
This means that using PolyBase loading scales more efficiently and is usually much faster.
Built into Azure SQL Data Warehouse is a backup functionality which provides a data warehouse consist backup every four to eight hours. This means that you will receive between three and six backups per day which are stored for seven days.
You can also trigger snapshots of the Blob Storage on which the Azure SQL Data Warehouse resides if you want a one off or more frequent backup schedule. Note this would be triggered by PowerShell and automated by Azure Automation see Using Azure PowerShell with Azure Storage.
Microsoft provides a Migration Utility that supports SQL Server 2012 or above and Azure SQL Databases. It provides a report pointing out possible migration issues which could be encountered when using Azure SQL Data Warehouse. It can help to assist with schema and data migrations.