Content Data Store
Content Data Store (CDS) is a system to provide storage facilities to massive data sets in the form of images, pdfs, documents and scanned documents. This dataset is processed, organized and managed by CDS. CDS is a fast data ingestion and lookup system for heterogeneous dataset and its content. Businesses want to store heterogeneous data from various sources into Hadoop or NoSQL databases and run analytics on its contents. This requires for a platform that can help to build a scalable and structured data store and that’s what CDS is meant for.
- High Speed Image Store
- Content Store
- Real Time Analytics
- Text detection & extraction
- Document Analysis
- Actions & Decisions
The API has been tested on the following HDP 2.5 components:
- Apache Hadoop 126.96.36.199.4
- Apache Kafka 0.10.0.1
- Apache Hbase 2.0.0-SNAPSHOT
- Apache Solr 6.2.1
- Apache Maven
- Java 1.7 or later
CDS Technology Stack
- System – HDFS, NoSQL( Hbase), Apache Hive
- Scheduling System – Apache Oozie
- Reporting System – Zeppelin, Custom UI, Webhdf, Hue
- Indexing- Apache Solr
- Programming Language – Java
- IDE – Eclipse
- Operating System – Centos, Ubuntu
- UI- Banana
Content Data Store (CDS) harnesses the power of Apache Hadoop, Tesseract, NoSQL database and Apache Solr. CDS comes along with an automated system where once the data reaches HDFS, the Oozie scheduled event allows the CDS agent to pick the file and its content and process it forward. Any data file can be processed and CDS agent automatically detects the file format, its size and stores it into respective databases.
- Ingestion System: Any ingestion system can be used, DiP is considered to be the best fit.
- Agent: Tesseract, Leptonica, Oozie, Java.
- Image Store: HDFS, Hbase/MongoDB
- Indexing: Apache Solr
UI: Zeppelin, Hue or any custom UI which can support integration with underlying NoSQL
Keep some text, PDF, PPT, image file somewhere at cluster. Note down the path for the files. Go to $KAFKA_INSTALL_DIR/bin and create a Kafka topic named “kafka_topic” using the below mentioned command ./kafka-topics.sh –create –topic kafka_topic –zookeeper zookeeper-server:port –replication-factor 1 –partition 5
Download the CDS source code from “https://github.com/XavientInformationSystems/CDS.git” and compile the code using below commands:
Decompress the zip file
Compile the code
cd cds mvn clean package
Once the code has been successfully compiled, go to the target directory and locate the jar by the name “uber-cds-1.0.0.jar”
Submit the jar file to the Hadoop cluster to start the process, with the following set of argument path using below command:
java -jar uber-cds-0.0.1-SNAPSHOT.jar -s “path to ppt/pdf/image file” -d “path to destination” -f “path to text file” -c “path to core-site.xml” -h “path to hdfs-site.xml”
Once you have submitted the jar , you can check the metadata of the file stored in Hbase table by following commands:
Launch Hbase shell
Scan the Hbase table
Finally, the message reaches the Kafka Consumer, as shown in the screenshot below
./kafka-console-consumer.sh –zookeeper 10.5.3.196:2181 –topic mytopic –from-beginning.
On consuming the message from Kafka Consumer, it will be indexed by Solr and will finally be sent to Banana (rich and flexible UI, enabling users to rapidly develop end-to-end applications that leverage the power of Apache Solr).
- Healthcare: Online storage and sharing of medical prescriptions by doctors in the form of images is becoming increasingly important for medical organizations and large hospitals. This application presents a distributed architecture based on Hadoop and HBase to support online storage and sharing for medical/prescription images. An experimental system called CloudDICOM is designed and realized based on this architecture.
- This new application architecture focuses on designing the architecture, workflow, data schema, and then on analyzing the components in CloudDICOM. Firstly, DICOM messages sent by clients will be received, converted and stored into Hadoop and HBase. Then, these messages will be indexed and generated query and index database. The components of DICOM query based on this index are implemented to provide online DICOM query for clients. The test results demonstrate that CloudDICOM can provide online storage and sharing service for large-scale medical images, and support standard DICOM Query.
- E-Government: E-Government has entered into a new age of mass data. Therefore, storing images and managing to manage the emergency events,
- E-government information resource sharing and providing intelligent- personalized one-stop information service.
- Heterogeneous Sensor Data: Internet of Things (IoT) is playing a much important role in modern agriculture development. However, problems of efficient storing and reasoning those massive heterogeneous sensor data collected from variety of sensing equipment need to be resolved to implement Internet of Things in agriculture. This application explores the architecture of Internet of Things in agriculture with heterogeneous sensor data, and proposes a design of implementation to Internet of Things in agriculture based on cloud computing. The design is based on two-tier storage structure of HBase, which is a distributed database with high scalability and a distributed programming framework. Hence, this design provides scalable storage, efficient data access, and eases other processing of sensor data.
- Equipment Data Store: As technology contents of large machinery and electronic equipments increase continuously, systems become more and more complex. Lots of data are generated during tests, and various kinds of data are widely distributed. Data management problems thus become more prominent as various kinds of design, simulation, calculation and test data generated during development become key system contents. How to effectively manage this humungous data becomes one of the key initiatives of integrating IT applications with industrialization.
- Real Time Data Ingestion (DiP) – Spark Streaming (co-dev opportunity)
This blog is an extension to that and it focuses on integrating Spark Streaming to Data Ingestion Platform for performing real time data ingestion and visualization. The previous blog DiP (Storm Streaming) showed how…
- HAWQ/HDB and Hadoop with Hive and HBase
Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big…
- Hadoop Cluster Verification (HCV)
Verification scripts basically composed of idea to run a smoke test against any Hadoop component using shell script. HCV is a set of artifacts developed to verify successful implementation of…
- Working with NoSQL Database – Apache Cassandra
Today, every piece of information, whether it is in simple text or is a large document or image, holds some importance for businesses and needs to be stored. While most…
- Data Ingestion Platform(DiP) – Real time data analysis – Flink Streaming
This blog is an extension to that and it focuses on using Flink Streaming for performing real time data ingestion. The previous blog DiP (Storm Streaming) showed how we can leverage the power of…
- MongoDB with C#.Net
Being a C# developer we closely work with relational databases like MS SQL and Oracle. It thus becomes very exciting to explore the world of a prominent NoSQL database like…