This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any kind of data in real time.
The previous blog DiP (Storm Streaming) showed how we can leverage the power of Apache Storm and Kafka to do real time data ingestion and visualization.
This blog is an extension to that and it focuses on using Apache Apex for performing real time data ingestion.
All you need is a running Hadoop cluster with Apache Apex, Kafka, Hive, HBase and Zeppelin. You can deploy the application on the top of your existing cluster and ingest any kind of semi-structured and structured data.
You can download the code base from GitHub.
Apache Apex vs Spark Streaming:
Apache Apex and features:
DAG, or Directed Acyclic Graph, expresses processing logic. It has operators (vertices) and streams (edges) that together constitute an Apache Apex application. Operators function as nodes within the graph, which are connected by a stream of events called tuples.
As Apache Apex is built on top of Apache YARN, hence it comes with inherited builtin support for fault-tolerance, scalability and operability. Apache apex is a true stream processing in a sense that, incoming record is processed and sent to next level of processing as soon as it arrives. Also supports micro-batching like Spark Streaming.
In future, Apex will provide support for streaming machine learning algorithms.
The Demo API has been tested on below mentioned HDP 2.4 components:
– Apache Hadoop 18.104.22.168.4
– Apache Kafka 0.9.0.2.4
– Apache Apex 3.4.0
– Apache Hbase 22.214.171.124.4
– Apache Hive 126.96.36.199.4
– Apache Zeppelin 0.6.0.2.4
– Apache Tomcat Server 8.0
– Apache Phoenix 188.8.131.52.4
– Apache Maven
– Java 1.7 or later
High Level Process Flow:
- Input to the application can be fed from a user interface that allows you either enter data manually or upload the data in XML, JSON or CSV file format for bulk processing
- Data ingested is published to Kafka broker which streams the data to Kafka operator
- Custom Apex operator(Classifier operator) identifies the message type, extracts the message and send it to different operators for further processing or persistence.
- Operators recieve data and persist it to storage layer like NoSQL, HDFS, etc.
- Hive external table provides data storage through HDFS and Phoenix provides an SQL interface for HBase tables
- Reporting and visualization of data is done through Zeppelin
DiP Front End:
Application Package Archive – DAG:
The submitted application Directed Acyclic Graph(DAG) looks like this:
DiP Data Visualization:
Using Apache Zeppelin, data ingested in HBase can be viewed as a report/graphs by simply using phoenix interpreter which provides SQL like interface to HBase table. These graphs can be embedded to any other applications using JFrames.
- Data Ingestion Platform(DiP) – Real time data analysis – Flink Streaming
This blog is an extension to that and it focuses on using Flink Streaming for performing real time data ingestion. The previous blog DiP (Storm Streaming) showed how we can leverage the power of…
- Real Time Data Ingestion (DiP) – Spark Streaming (co-dev opportunity)
This blog is an extension to that and it focuses on integrating Spark Streaming to Data Ingestion Platform for performing real time data ingestion and visualization. The previous blog DiP (Storm Streaming) showed how…
- HAWQ/HDB and Hadoop with Hive and HBase
Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big…
- Real time data ingestion – Easy and Simple (co-dev opportunity)
This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any kind of data…
- KAFKA-Druid Integration with Ingestion DIP Real Time Data
The following blog explains how we can leverage the power of Druid to ingest the DIP data into Druid (a high performance, column oriented, distributed data store), via Kafka Tranquility…
- Content Data Store
Content Data Store Content Data Store (CDS) is a system to provide storage facilities to massive data sets in the form of images, pdfs, documents and scanned documents. This dataset…