This post is about the new feature, i.e., Column Store Index which is available since SQL 2012 version. Microsoft has released column store index to improve the performance by 10x. In today’s blog, we will discuss about the architecture, limitations and dependencies of this new feature.
What is Column Store index?
This is a new type of index which will store data in column wise format, unlike the traditional B-tree structures used for clustered and non-clustered row store indexes. This is basically designed to speed up the analytical processing and data-warehouse queries. When used appropriately, it can reduce disk I/O and use memory more efficiently.
It is based on xVelocity (formerly known as Microsoft VertiPaq).
Architecture of Column Store Index
Column store index stores each column in a separate set of disk pages, rather than storing multiple rows per page as data traditionally was being stored. The difference between column store and row store approaches is illustrated below:
Let’s take an example to elaborate the working of Column Store Index.
The following T-SQL code shows the table definition:
CREATE TABLE Autotype
AutoID INT PRIMARY KEY,
Make VARCHAR (20) NOT NULL,
Model VARCHAR (20) NOT NULL,
Color VARCHAR (15) NOT NULL,
ModelYear SMALLINT NOT NULL
But note that AutoIDcolumn is configured as the primary key.
Figure 1: For retrieving data from cluster index.
Here data is split in three pages and each page contains five rows. Also, the entire row is stored in one page.
Now when we run the below query, It will retrieve all the data pages in memory. During this type of processing I/O will be consumed more since we are including everything, the data that might not be required too.
SELECT Make, Year FROM AutoType;
Now, let’s see what happens when we create column store index on the table.
Herein, we are including all the columns in the index and each column is its own segment. Each segment contains value from only one column as shown in fig2, so that each column’s data is accessible independently. However, a column can span multiple segments and each segment can be made up of multiple data pages. Data is transferred from the disk to memory by segment, not by page. It is not necessary to keep all the columns in index, you can decide as per your requirement also.
Figure 2: For retrieving data from column store index.
A segment is a highly compressed Large Object (LOB) that can contain up to one million rows. The data within each column’s segment matches row-by-row and the rows can be assembled correctly. For example, the third row in each segment in Figure 2, all point to the same car: the gray 2010 Lexus GX460 with an ID of 103.
Now if we run the same query again after creating the columns store index, the query processor will only use the column store index instead of cluster index. Due to this only those segments will pull the data which are associated with the columns Make and ModelYear into memory, this will limit the resource to process the query and majorly will help in disk I/O consumption.
When should I use a column store index?
Recommended use cases:
- Use a clustered column store index to store fact tables and large dimension tables for data warehousing workloads. This improves query performance and data compression by up to 10x.
- Use a non-clustered column store index to perform analysis in real-time on an OLTP workload.
Limitation of Column Store Index
Non-clustered column store indexes –
- Cannot have more than 1024 columns.
- Cannot be created on a view or indexed view.
- Cannot be changed by using the ALTER INDEX statement. To change the non-clustered index, you must drop and re-create the column store index instead. You can use ALTER INDEX to disable and rebuild a columnstore index
- Cannot be created by using the INCLUDE keyword.
- Cannot include the ASC or DESC keywords for sorting the index. Column Store Indexes are ordered according to the compression algorithms. Sorting would eliminate many of the performance benefits.
Column Store Index cannot be combined with following features.
- Computed columns
- Page and row compression
Note: These limitations are till SQL 2014.
Column store index gives the best performance when used in data warehouse. The performance benefit depends on two major factors. One is I/O saving by the new index structure and another is batch mode execution. Batch mode execution is limited in SQL server 2012 but it over comes in SQL 2014 and Column store index requires significant amount of memory to store the complete object. Apart from this, additional memory is required for the global dictionaries for each string column. There are a lot of system views which help to give all the information related to column store index.
- SQL Server – Reporting Services Migration
A while back, I got a task in my project, which was to migrate SQL Server Reporting Services from SQL 2008 to SQL 2012. Now, I had two ways of…
- KAFKA-Druid Integration with Ingestion DIP Real Time Data
The following blog explains how we can leverage the power of Druid to ingest the DIP data into Druid (a high performance, column oriented, distributed data store), via Kafka Tranquility…
- HAWQ/HDB and Hadoop with Hive and HBase
Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big…
- Real Time Data Ingestion (DiP) – Spark Streaming (co-dev opportunity)
This blog is an extension to that and it focuses on integrating Spark Streaming to Data Ingestion Platform for performing real time data ingestion and visualization. The previous blog DiP (Storm Streaming) showed how…
- Working with SQL Server Database on Microsoft Azure (Part 2)
The huge volume of data generated around us, today, requires optimum storage and management to have any benefit for organizations. With a host of advantages that MS Azure possesses, especially…
- Teradata Intelligent Memory (TIM)
Overview of Teradata Teradata is a RDBMS (Relational Database Management System). This system is solely based on off-the-shelf (readymade) symmetric multiprocessing (SMP) technology combined with communication networking, connecting SMP systems…