Today, every piece of information, whether it is in simple text or is a large document or image, holds some importance for businesses and needs to be stored. While most businesses focus on Relational Database Management Systems (RDBMS) to create, retrieve, update and manage data, the evolving needs require them to look beyond such databases to something that offers a mechanism to store and retrieve data other than the tabular relations. This is where Apache Cassandra, a highly scalable and high-performance distributed database comes into picture.
Cassandra was developed by Facebook way back in 2008 and Apache incubator accepted it the following year. Though Cassandra took its time to make a mark, today its acceptance by business is growing by leaps and bounds, due to its different approach to data management. In this post, we look at what makes Apache Cassandra the most sought after databases along with its implementation & basic functions.
Apache Cassandra is an open source, distributed and decentralized storage system. It was designed as a highly scalable, high-performance database to handle large volume of data across commodity servers, with high availability and no single point of failure. It is a kind of NoSQL Database System, providing a mechanism to store and retrieve data in other ways than the tabular relations used in relational databases. In addition, it is schema-free, supports easy replication, has a simple API, and can handle huge volume of data.
Apache Cassandra cater to three primary concerns of a DBMS, consistency, availability, and partition tolerance. However, a NoSQL Database System can only provide two of three, which is also stated in the CAP theorem by computer scientist Eric Brewer.
The following figure used by Nathan Hurst in his blog post on a visual guide to NoSQL systems is a simple illustration of the CAP theorem. As we can see in the diagram, Apache Cassandra falls under the AP side of the triangle.
Apache Cassandra has peer-to-peer distributed system across nodes, which helps in the distribution of data among all nodes in the cluster. Each node in the cluster is independent and connected to the adjacent nodes, which helps them in accepting read/write requests, irrespective of data’s location in the cluster. The interconnectivity helps serving read/write requests from other nodes in the network when a node goes down. This interconnectivity is due to the Gossip Protocol that runs in the background and allows nodes to communicate with each other and detect faulty nodes within the cluster.
Key Component of Cassandra
- Node – The a place where data is stored.
- Data Center – Collection of related nodes.
- Cluster – The component that contain one or more data centers.
- Commit Log – It is the place where every write opetation is wittern.
- Mem Table – It is a memory-resident data structure, where the data is written after commit log. Sometimes there are multiple mem-tables for a single-column family.
- Row Cache – It holds the entire content of a row in the memory. It caches data instead of reading it from the disk on every occasion.
- Key Cache – It holds the location of data in the memory for every column family.
- SSTable – It is a disk file where the data is flushed into from the mem-table, when its content reaches a threshold value.
- Bloom Filter – These are quick, nondeterministic, algorithms to test if an element is a member of a set. Bloom filter is a special kind of cache and is accessible after every query.
Figure 1: source (http://www.tutorialspoint.com/cassandra/cassandra_quick_guide.htm)
Apache Cassandra installation is a simple step by step process, we just need to check main prerequisites, which are latest version of Java 8 (either the Oracle Java Standard Edition 8 or OpenJDK 8) and Python 2.7 for using Cqlsh.
Run the following command to update Ubuntu Repository:
sudo apt update
Run the following command to install Python:
sudo apt install python-software-properties –y
Add the new Java PPA repository to the system using the following command:
sudo add-apt-repository -y ppa:webupd8team/java
Now, update all Ubuntu Repositories with the following command:
sudo apt update
Run the following command to install Java:
sudo apt install oracle-java8-installer –y
You will see the following screen. Click YES to proceed with installation.
Run the following command to verify that Java 8 is installed.
Installing Apache Cassandra
We can install Cassandra using either binary tarball files or Debian packages from the Apache Repository. We followed Debian package method for installation.
The first thing we did was to add the Apache Cassandra Repository, followed by adding a new key, and then finally install Apache Cassandra.
Run the following command to add new Apache Cassandra Repository
echo “deb http://www.apache.org/dist/cassandra/debian 311x main” | sudo tee –a
Run the following command to add and sign the software developer key:
Wget –qO- https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add –
Update Ubuntu Repositories with the following command:
sudo apt update
sudo apt install cassandra –y
After the installation is complete, start the Cassandra service and then enable it to run at the boot time using the following systemctl commands.
The above error message indicate that we didn’t have the necessary permission, thus, we need to log in with
Sudo sh –
Now, we again try with systemctl commands to start and enable the services.
To check the service status, use the following command:
systemctl status Cassandra
Hence, we see that Apache Cassandra is successfully installed and running on the system.
Cassandra: Data Model
In Cassandra, a Keyspace is the outermost container for data. Its basic attributes are:
The replication factor refers to the number of machines in the cluster that will receive copies of the same data.
Replica placement strategy
It is the strategy to place replicas in the ring. There are three strategies for it:
- Rack-unaware strategy or simple strategy
- Rack-aware strategy or old network topology strategy
- Datacenter-shared strategy or network topology strategy
Column families represent the structure of data. A Keyspace serves as a container for a number of column families, which in turn, are containers of a collection of rows. Each row in the column family contains ordered columns.
Cassandra provides a prompt Cassandra query language shell (cqlsh) by default for users to communicate with it. The shell is used to execute Cassandra Query Language (CQL).
To access CQLSH, run following command on putty:
First, we need to create a Keyspace to store the data. The syntax for creating a Keyspace is as follows:
Verify if Keyspace is created with the following command:
Now that the Keyspace is created, we can go ahead with adding data to it. We begin by creating a table and inserting a few rows in it:
Command to create table:
Command for bulk insert:
That completes the basic of Apache Cassandra, which was the scope of the blog. There is so much more to learn about Cassandra from here on and we believe that this very article will encourage you to not only learn but also implement this DBMS tool and resolve your business challenges.
Do let us know your thoughts in the comments below.
Until next time!
- Content Data Store
Content Data Store Content Data Store (CDS) is a system to provide storage facilities to massive data sets in the form of images, pdfs, documents and scanned documents. This dataset…
- MongoDB with C#.Net
Being a C# developer we closely work with relational databases like MS SQL and Oracle. It thus becomes very exciting to explore the world of a prominent NoSQL database like…
- HAWQ/HDB and Hadoop with Hive and HBase
Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big…
- Understanding Teradata Wallet
Teradata Wallet is a facility for storage of sensitive/secret information, such as Teradata Database user passwords. Users are able to save and retrieve items by using this facility. Teradata wallet…
- Column Store Index in SQL Server 2012
This post is about the new feature, i.e., Column Store Index which is available since SQL 2012 version. Microsoft has released column store index to improve the performance by 10x.…
- Oracle Goldengate
Oracle GoldenGate is an Oracle proprietary software for real-time data integration and replication that supports different databases- Oracle, DB2, SQL Server, Ingres, MySQL etc. Even the source and target database…