Post4

Amazon Athena: An Overview

Posted by Akansha Agarwal

It is difficult to fathom the progress of the computing world. Only a couple of decades ago, 1.44MB of memory space offered by floppy disks seemed a lot of storage and processing space. Today, we have smartphones with 128GB!

This is because the amount of data that we process on a regular base has grown exponentially in a short time. The primary reason for this is the need of the organizations to process and query humongous volume of structured or semi-structured data for insights that too at high speed. Organizations today, do not want to spend time preprocessing, scanning, loading, or indexing data. Instead, they want to get started with it in a matter of minutes – identifying the data, running queries, getting the results, and acting on the results.

The blog post introduces Amazon Athena a new age serverless query service to analyze a large volume of data.

Introduction

Amazon Athena is an interactive query service that helps to analyze data in Amazon Simple Storage Service (Amazon S3) via standard SQL. With limited actions available in the AWS Management Console, we can simply point Athena at the data stored in Amazon S3 and start executing standard SQL queries for quick results.

Image Source: https://aws.amazon.com/blogs/big-data/build-a-schema-on-read-analytics-pipeline-using-amazon-athena/

Features

Pay Per Query

Athena is serverless (cloud-based), which means you neither have to set up nor maintain an infrastructure. Thus, you only pay for the queries you execute.

Easily Scalable

Athena scales automatically and executes queries in parallel. Thus, the results are faster, irrespective of the size of the dataset and complexity of the queries.

Quick Querying

Athena supports a set of data definition language (DDL) statements like ANSI SQL functions and operators.

Highly Secure

Amazon Athena uses policies like IAM to regulate access to operations. There are encryption options to encrypt query result files and query data in Amazon S3. Accessing and decrypting the files require appropriate permissions.

Advantages

  • It supports multiple data formats, i.e. Text, CSV, TSV, JSON, weblogs and AWS service logs
  • It can stream data directly from Amazon S3, thus does not require ETL
  • It is fine-tuned to deliver high performance
  • It automatically parallelizes queries
  • It streams the results to console and also stores them in S3
  • It compresses the data
  • It uses columnar formats

Limitation

Athena does not support the features supported by open source Presto version 0.172.

Understanding Tables and Databases in Athena

In Athena, tables and databases contain metadata with definitions of a schema for an underlying data source. A table exists in Athena for each dataset. The metadata enables Athena to know the location of the data in Amazon S3, along with the data structure (column names, data types, and table name). Databases are a logical grouping of tables and store the metadata and schema information for a dataset.

Athena uses HiveQL DDL statements such as CREATE DATABASE, CREATE TABLE, and DROP TABLE to create tables and databases in the AWS Glue Data Catalog, or in the internal data catalog where AWS Glue is not available.

The following figure depicts the Database, Tables and Query Editor where we can perform ANSI SQL operations.

AWS Service Integrations with Athena

We can integrate Athena with Amazon Quick-Sight for data visualization. It helps generate reports, or explore data with Business Intelligence tools or SQL clients connected with a JDBC or an ODBC driver.

The following figure is a step-by-step guide of a serverless environment.

Image Source: https://aws.amazon.com/blogs/big-data/query-and-visualize-aws-cost-and-usage-data-using-amazon-athena-and-amazon-quicksight/

Final Few Words

Amazon Athena is a tool that caters to the various needs of organizations. Athena empowers companies to analyze data via standard SQL without having to manage any infrastructure, access data through a business intelligence tool, using the JDBC driver a norm and charges on the volume of data scanned per query.

As was evident from this blog, converting your data into open source formats allows you to not only save costs but also improve performance.

That is all from our side! Let us know your thoughts in the comments below.

Until next time.

Related Posts

  • Working with SQL Server Database on Microsoft Azure (Part 2)Working with SQL Server Database on Microsoft Azure (Part 2)

    The huge volume of data generated around us, today, requires optimum storage and management to have any benefit for organizations. With a host of advantages that MS Azure possesses, especially…

  • SQL Server 2017 on Ubuntu PlatformSQL Server 2017 on Ubuntu Platform

    Unlike previous releases of SQL Server, this time Microsoft has also given us the choice of picking the operating system on which we wish to install the application. This means…

  • Different Application of Joins in SQLDifferent Application of Joins in SQL

    Structurally organized with large quantities of information, databases are the backbone of every organization. That is why companies invest in database applications such as MySQL, PostgreSQL, Microsoft SQL Server and…

  • Microsoft SQL Server Execution PlanMicrosoft SQL Server Execution Plan

    In a database environment, certain questions related to performance pop up repeatedly: why is my query running slowly? And why is SQL Server not using the index? etc.  As a…

  • Flyway Integration with Spring MVCFlyway Integration with Spring MVC

    Flyway is an open-source database migration tool that supports simplicity and convention over configuration. It makes setting up and maintaining database schemas a breeze. Flyway is used across all environments…

  • Data Dump UtilityData Dump Utility

    With data becoming a key factor in taking business decision, the need for the organizations to store and process large volume of the data has become crucial for their sustainability…

Leave a Reply

Your email address will not be published. Required fields are marked *