This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any kind of data in real time.
All you need is a running Hadoop cluster with Kafka, Storm, Hive and HBase. You can deploy the application on the top of your existing cluster and ingest any kind of data.
This blog is a prototype. If you are interested in contributing and getting involved, reach out to us on Twitter @FollowXavient.
DiP High Level Process Workflow
DiP
“Data Ingestion Platform utilizes the true power of the latest edge-cutting technologies in the big data ecosystem to achieve almost real time data analytics and visualization”.
DiP, scalable up to thousands of nodes, can take in data from multiple sources and in different forms to store it into multiple platforms and provide you the ability to query the data on the go.
Multiple Sources
DiP can take data from multiple sources, it gives you the ability to push data manually, upload files or a scheduler can be used to automate the workflow.
Multiple File Formats
DiP can process different file formats such as XML, JSON, CSV etc. using implicit data handling mechanism impervious to the client.
Easy to use UI
DiP comes with an easy to use, aesthetic user interface to start data processing.
Data Persistence
DiP stores data in a lightning fast manner into multiple structured/unstructured storage platforms.
Data Visualization
Visualize data in almost real time using different reporting styles like graphs, charts etc.
- Input to the application can be fed from a user interface that allows you either enter data manually or upload the data in XML, JSON or CSV file format for bulk processing
- Data ingested is published by the Kafka broker which streams the data to Kafka spout which acts as consumer across the topology
- Once the message type is identified, the content of the message is extracted and is sent to different bolts for persistence – HBase bolt or HDFS bolt
- Hive external table provides data storage through HDFS and Phoenix provides an SQL interface for HBase tables
- Reporting and visualization of data is done through Zeppelin
Technology Stack
Source System – Web Client
Messaging System – Apache Kafka
Target System – HDFS, Apache HBase, Apache Hive
Reporting System – Apache Phoenix, Apache Zeppelin
Topology Builder – Apache Storm
Programming Language – Java
IDE – Eclipse
Build tool – Apache Maven
Operating System – CentOS 7
DiP Front End
Screen 1 – Use message box to feed data to Data Ingestion Platform
Screen 2 – Alternatively, upload files to feed data to Data Ingestion Platform
DiP Execution Flow
Below is a snapshot of DiP topology that runs across many worker nodes on different machines. The Kafka-spout passes the input stream to filter bolt, which transforms the incoming data and then other bolts persist the data into various systems.
DiP Data Visualization
Using Apache Zeppelin, data ingested in HBase can be viewed as a report/graphs by simply using phoenix interpreter which provides SQL like interface to HBase table. These graphs can be embedded to any other applications using JFrames.
Demo (Gautam Marya)
Happy Hadooping!!
Credits:
Technical team:
Related Posts
DiP (Storm,Spark,Flink and Apex) Co-Dev opportunity
Real time data ingestion using Data Ingestion Platform (DiP) which harness the powers of Apache Apex, Apache Flink, Apache Spark and Apache Storm to give real time data ingestion and visualization. DiP…
Real Time Data Ingestion (DiP) – Apache Apex (co-dev opportunity)
Data Ingestion Platform This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any…
Messaging: What to choose and when
In a previous blog, we gave an overview of the different messaging protocols available to us (AMQP & JMS) and listed each one's benefits and issues. In this blog, we…
KAFKA-Druid Integration with Ingestion DIP Real Time Data
The following blog explains how we can leverage the power of Druid to ingest the DIP data into Druid (a high performance, column oriented, distributed data store), via Kafka Tranquility…
Real Time Data Ingestion (DiP) – Spark Streaming (co-dev opportunity)
This blog is an extension to that and it focuses on integrating Spark Streaming to Data Ingestion Platform for performing real time data ingestion and visualization. The previous blog DiP (Storm Streaming) showed how…
Data Ingestion Platform(DiP) – Real time data analysis – Flink Streaming
This blog is an extension to that and it focuses on using Flink Streaming for performing real time data ingestion. The previous blog DiP (Storm Streaming) showed how we can leverage the power of…
Its like you learn my thoughts! You seem to know so much about this,
such as you wrote the guide in it or something. I feel
that you simply can do with some percent to drive the message home a bit, but other than that,
that is fantastic blog. A great read. I will definitely
be back. http://yahoo.co.uk
Awesome.. well explained.. Can you please so KafKa integration at various points.
Many Thanks!
You saved me a lot of halsse just now.