Post1

Real Time Data Ingestion (DiP) – Spark Streaming (co-dev opportunity)

Posted by

This blog is an extension to that and it focuses on integrating Spark Streaming to Data Ingestion Platform for performing real time data ingestion and visualization.

The previous blog DiP (Storm Streaming) showed how we can leverage the power of Apache Storm and Apache Kafka to do real time data ingestion and visualization.

DiP currently supports three more data streaming engines namely Apache Storm, Apache Apex and Apache Flink.

This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any kind of data in real time.

All you need is a running Hadoop cluster with Kafka, Spark, Hbase, Hive and Zeppelin.

You can download the code base for DiP from GitHub and start ingestion and visualization of data on the go.

Spark Streaming Features

  • High Level API
  • Fault tolerant
  • Deep integration with Spark Ecosystem(MLib, SQL, GraphX)
  • Java, Scala , Client bindings
  • Very high throughput

Technology Stack

  • Source System– Web Client
  • Messaging System– Apache Kafka
  • Target System– HDFS, Apache HBase, Apache Hive
  • Reporting System– Apache Phoenix, Apache Zeppelin
  • Streaming API– Apache Spark
  • Programming Language– Java
  • IDE– Eclipse
  • Build tool– Apache Maven
  • Operating System – CentOS 7

High-Level Process Workflow

SparkArchitecture

  • Input to the application can be fed from a user interface that allows you either enter data manually or upload the data in XML, JSON or CSV file format for bulk processing
  • Data ingested is published by the Kafka broker which streams the data to Kafka consumer process
  • Once the message type is identified, the content of the message is extracted and is sent to different executors
  • Hive external table provides data storage through HDFS and Phoenix provides an SQL interface for HBase tables
  • Reporting and visualization of data is done through Zeppelin

DiP Front End

You can download the DataIngestUI web application from GitHub and use it to feed data to your streaming application.

You can enter data in two ways:

  • Message box

dataingestUI

  • Upload the file

5

Job Submission and Status:

By following the steps given at the GitHub link you can submit your spark application to the cluster and see its status on the Spark UI as shown below:

sparkUI

DiP Data Visualization

Using Apache Zeppelin, data ingested in HBase can be viewed as a report/graphs by simply using phoenix interpreter which provides SQL like interface to HBase table. These graphs can be embedded to any other applications using JFrames.

  • Example 1:

ZeppelinSparkUI

  • Example 2:

ZeppelinSparkUI2

Credits:

Xavient Information Systems

Technical team:

Neeraj Sabharwal

Mohiuddin Khan Inamdar

Gautam Marya

Puneet Singh

Sumit Chauhan

Related Posts

  • DiP (Storm,Spark,Flink and Apex) Co-Dev opportunityDiP (Storm,Spark,Flink and Apex) Co-Dev opportunity

    Real time data ingestion using Data Ingestion Platform (DiP) which harness the powers of Apache Apex, Apache Flink, Apache Spark and Apache Storm to give real time data ingestion and visualization. DiP…

  • Data Ingestion Platform(DiP) – Real time data analysis – Flink StreamingData Ingestion Platform(DiP) – Real time data analysis – Flink Streaming

    This blog is an extension to that and it focuses on using Flink Streaming for performing real time data ingestion. The previous blog DiP (Storm Streaming) showed how we can leverage the power of…

  • Content Data Store

    Content Data Store Content Data Store (CDS) is a system to provide storage facilities to massive data sets in the form of images, pdfs, documents and scanned documents. This dataset…

  • Real Time Data Ingestion (DiP) – Apache Apex (co-dev opportunity)Real Time Data Ingestion (DiP) – Apache Apex (co-dev opportunity)

    Data Ingestion Platform This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any…

  • HAWQ/HDB and Hadoop with Hive and HBaseHAWQ/HDB and Hadoop with Hive and HBase

    Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big…

  • Hadoop Cluster Verification (HCV)Hadoop Cluster Verification (HCV)

    Verification scripts basically composed of idea to run a smoke test against any Hadoop component using shell script. HCV is a set of artifacts developed to verify successful implementation of…

Leave a Reply

Your email address will not be published. Required fields are marked *