Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big data store
PXF: PXF is an extensible framework that allows HAWQ to query external system data
Let’s learn Query federation
This topic describes how to access Hive data using PXF. Link
Previously, in order to query Hive tables using HAWQ and PXF, you needed to create an external table in PXF that described the target table’s Hive metadata. Since HAWQ is now integrated with HCatalog, HAWQ can use metadata stored in HCatalog instead of external tables created for PXF. HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL. This provides several advantages:
- You do not need to know the table schema of your Hive tables
- You do not need to manually enter information about Hive table location or format
- If Hive table metadata changes, HCatalog provides updated metadata. This is in contrast to the use of static external PXF tables to define Hive table metadata for HAWQ.
- HAWQ retrieves table metadata from HCatalog using PXF.
- HAWQ creates in-memory catalog tables from the retrieved metadata. If a table is referenced multiple times in a transaction, HAWQ uses its in-memory metadata to reduce external calls to HCatalog.
- PXF queries Hive using table metadata that is stored in the HAWQ in-memory catalog tables. Table metadata is dropped at the end of the transaction.
Follow this to create hbase tables
Create table in HAWQ to access HBASE table
Note: Port is 51200 not 50070
Must see this
Zeppelin interpreter settings
- Real Time Data Ingestion (DiP) – Spark Streaming (co-dev opportunity)
This blog is an extension to that and it focuses on integrating Spark Streaming to Data Ingestion Platform for performing real time data ingestion and visualization. The previous blog DiP (Storm Streaming) showed how…
- Hadoop Cluster Verification (HCV)
Verification scripts basically composed of idea to run a smoke test against any Hadoop component using shell script. HCV is a set of artifacts developed to verify successful implementation of…
- Content Data Store
Content Data Store Content Data Store (CDS) is a system to provide storage facilities to massive data sets in the form of images, pdfs, documents and scanned documents. This dataset…
- Real Time Data Ingestion (DiP) – Apache Apex (co-dev opportunity)
Data Ingestion Platform This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any…
- Column Store Index in SQL Server 2012
This post is about the new feature, i.e., Column Store Index which is available since SQL 2012 version. Microsoft has released column store index to improve the performance by 10x.…
- Data Ingestion Platform(DiP) – Real time data analysis – Flink Streaming
This blog is an extension to that and it focuses on using Flink Streaming for performing real time data ingestion. The previous blog DiP (Storm Streaming) showed how we can leverage the power of…