Post4

Real time data ingestion – Easy and Simple (co-dev opportunity)

Posted by

This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any kind of data in real time.

All you need is a running Hadoop cluster with Kafka, Storm, Hive and HBase. You can deploy the application on the top of your existing cluster and ingest any  kind of data.

This blog is a prototype. If you are interested in contributing and getting involved, reach out to us on Twitter @FollowXavient.

DiP High Level Process Workflow

DiP

“Data Ingestion Platform utilizes the true power of the latest edge-cutting technologies in the big data ecosystem to achieve almost real time data analytics and visualization”.

DiP, scalable up to thousands of nodes, can take in data from multiple sources and in different forms to store it into multiple platforms and provide you the ability to query the data on the go.

 Multiple Sources

DiP can take data from multiple sources, it gives you the ability to push data manually, upload files or a scheduler can be used to automate the workflow.

 Multiple File Formats

DiP can process different file formats such as XML, JSON, CSV etc. using implicit data handling mechanism impervious to the client.

 Easy to use UI

DiP comes with an easy to use, aesthetic user interface to start data processing.

 Data Persistence

DiP stores data in a lightning fast manner into multiple structured/unstructured storage platforms.

 Data Visualization

Visualize data in almost real time using different reporting styles like graphs, charts etc.

  • Input to the application can be fed from a user interface that allows you either enter data manually or upload the data in XML, JSON or CSV file format for bulk processing
  • Data ingested is published by the Kafka broker which streams the data to Kafka spout which acts as consumer across the topology
  • Once the message type is identified, the content of the message is extracted and is sent to different bolts for persistence – HBase bolt or HDFS bolt
  • Hive external table provides data storage through HDFS and Phoenix provides an SQL interface for HBase tables
  • Reporting and visualization of data is done through Zeppelin

Technology Stack

 Source System – Web Client

Messaging System – Apache Kafka

Target System – HDFS, Apache HBase, Apache Hive

Reporting System – Apache Phoenix, Apache Zeppelin

Topology Builder – Apache Storm

Programming Language – Java

IDE – Eclipse

Build tool – Apache Maven

Operating System – CentOS 7

DiP Front End

 Screen 1 – Use message box to feed data to Data Ingestion Platform

Screen 2 – Alternatively, upload files to feed data to Data Ingestion Platform

DiP Execution Flow

Below is a snapshot of DiP topology that runs across many worker nodes on different machines. The Kafka-spout passes the input stream to filter bolt, which transforms the incoming data and then other bolts persist the data into various systems.

 

DiP Data Visualization

 

Using Apache Zeppelin, data ingested in HBase can be viewed as a report/graphs by simply using phoenix interpreter which provides SQL like interface to HBase table. These graphs can be embedded to any other applications using JFrames.

Demo (Gautam Marya)

https://youtu.be/-QRR6qiFL_U

Happy Hadooping!!

Credits:

Xavient Information Systems

Technical team:

Gautam Marya

Puneet Singh

Sumit Chauhan

Mohin Khan

Mukesh Kumar

Related Posts

4 comments

  1. Its like you learn my thoughts! You seem to know so much about this,
    such as you wrote the guide in it or something. I feel
    that you simply can do with some percent to drive the message home a bit, but other than that,
    that is fantastic blog. A great read. I will definitely
    be back. http://yahoo.co.uk

Leave a Reply

Your email address will not be published. Required fields are marked *