Post4

Teradata Query Grid : With Database, NoSQL, Hadoop

Posted by

Teradata 15.0 has come up with various exciting features and enhanced capabilities, Teradata Query Grid being one of them.

Teradata database is now able to connect Hadoop with this Query Grid, so it’s called as Teradata Database-to-Hadoop or Teradata-to-Hadoop connector.

Teradata QueryGrid is a form of data virtualization and its key feature is to put data in the Data Lake much quicker, across foreign servers.

What is Query Grid?

Query Grid works to connect a Teradata and Hadoop system to massive scale, with no effort, and at speeds of 10TB/second.

  • It provides an SQL interface for transferring data between Teradata Database and remote Hadoop hosts.
  • Import Hadoop data into a temporary or permanent Teradata table.
  • Export data from temporary or permanent Teradata tables into existing Hadoop tables.
  • Create or drop tables in Hadoop from Teradata Database.
  • Reference tables on the remote hosts in SELECT and INSERT statements.
  • Select Hadoop data for use with a business tool.
  • Select and join Hadoop data with data from independent data warehouses for analytical use.
  • Leverage Hadoop resources, Reduce data movement
  • Bi-directional to Hadoop
  • Query push-down
  • Easy configuration of server connections

Query Grid Process Flow?

  • Query through Teradata
  • Sent to Hadoop through Hive
  • Results returned to Teradata
  • Additional processing joins data in Teradata
  • Final results sent back to application/user

With QueryGrid, We can add a clause in a SQL statement that says –

“Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.”

 Let’s not ignore that all the database security, advanced SQL functions, and system management in the Teradata system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems.

  • QueryGrid
  1. Server grammar
    1. Simplify via “server name”
  2. Hadoop import operator
    1. Load_from_hcatalog
    2. Added server grammar
  3. Hadoop export operator (new)
    1. Load_to_hcatalog
    2. Supports files:
      1. Delimited Text, JSON, RCFile
      2. Sequence File, ORCfile, Avro
  • Query push-down
  • Bi-directional data transfer
  • Provide access rights

     

                 12

 

Parallel Performance:
For years, data virtualization tools have lacked the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata Query Grid is built to link to remote systems in parallel and interchange data through many network connections at once.

Now, what if we want to move a terabyte per minute? With the right configurations, that can be done too. Parallel processing via both the systems, makes this process extremely fast. There are many a data virtualization systems that do this today.

Without a doubt, the Hadoop cluster will have a different number of servers compared to the Teradata or any MPP systems. The Teradata systems start parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes.

But Teradata Query Grid does it all for us, completely invisible to the user.

Query Grid Teradata to Hadoop Server Configuration:

CREATE    FOREIGN SERVER Hadoop_sysd_xav USING HOSTTYPE(‘hadoop’) SERVER (‘sysd.labs.teradata.com’) PORT (‘9083’) HIVESERVER (‘sysd.labs.teradata.com’) HIVEPORT (‘10000’) USERNAME(‘Hive’) DEFAULT_STRING_SIZE(‘2048’) HADOOP_PROPERTIES(‘org.apache.hadoop.io.compress.GzipCodec’);

 

DO IMPORT WITH    syslib.load_from_hcatalog_hdp1_3_2,

DO EXPORT WITH    syslib.load_to_hcatalog_hdp1_3_2 Merge_hdfs_files(‘True’) Compression_codec(‘org.apache.hadoop.io.compress.GzipCodec’;

 

Server name = Hadoop_sysd_xav

Table Name = xav_hdp_tbl@Hadoop_sysd_xav

 

SELECT    source, session

FROM    xav_hdp_tbl@Hadoop_sysd_xav

WHERE    session_ts = ‘2017-01-01’;

 

QueryGrid Server Objects and Privileges:

  • TD_SERVER_DB contains all servers objects
  • Servers are global objects
  • Users have SELECT and INSERT granted to them

a)    GRANT SELECT ON hdp132_svr TO Pankaj; b)    GRANT INSERT ON hdp143_svr TO Abid;

  • Being able to create and drop a server is a privilege

a)    GRANT CREATE SERVER  b)    GRANT DROP SERVER

 

Remote SQL Execution :

  • Push SQL to remote Hive system
  • Hive filters data on non-partitioned columns
  • Foreign table ‘Select’ executed on remote system

 

SELECT    source, session FROM    FOREIGN TABLE(select    session, source from    xav_hdp_tbl where    source = “Mozilla” )@Hadoop_sysd_xav WHERE    session = current_date AS dt;

 

QueryGrid Data Transfer:

Import

SELECT    source, session FROM    xav_hdp_tbl@Hadoop_sysd_xav WHERE    session_ts = ‘2017-01-01’; “insert/select” & “create table as” to instantiate data locally.

Export

INSERT INTO emp_xav@Hadoop_sysd_xav SELECT    emp_xav_id, emp_xav_zip FROM    emp_xav_data WHERE    last_update = current_date;

 

QueryGrid Insert Explained :

EXPLAIN INSERT INTO xav_data@hdp132_svr SELECT * FROM newcars; ***Success: Activity Count = 41 Explanation ————————————————————————— 1) First, we lock a distinct ut1.”pseudo table” for read on a RowHash to prevent global deadlock for ut1.tab1. 2) Next, we lock ut1.tab1 for read. 3) We do an all-AMPs RETRIEVE step from ut1.newcars by way of an allrows scan with no residual conditions executing table operator SYSLIB.load_to_hcatalog with a condition of (“(1=1)”) into Spool 2 (used to materialize view, derived table, table function or table operator drvtab_inner) (all_amps), which is built locally on the AMPs. The size of Spool 2 is estimated with low confidence to be 8 rows (11,104 bytes). The estimated time for this step is 0.16 seconds.4) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of an allrows scan into Spool 3 (used to materialize view, derived table, table function or table operator TblOpInputSpool) (all_amps), which is redistributed by hash code to all AMPs. The size of Spool 3 is estimated with low confidence to be 8 rows ( 11,104 bytes). The estimated time for this step is 0.16 seconds. 5) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of an allrows scan executing table operator SYSLIB.load_to_hcatalog with a condition of (“(1=1)”) into Spool 4 (used to materialize view, derived table, table function or table operator h4) (all_amps), which is built locally on the AMPs. < BEGIN EXPLAIN FOR REMOTE QUERY –> TD: 3 column(s); Hadoop: 3 column(s), with 2 partition column(s); doors(INTEGER) -> doors(STRING); make(VARCHAR) -> make*(STRING); model(VARCHAR) -> model*(STRING); * denotes partition column; <— END EXPLAIN FOR REMOTE QUERY > The size of Spool 4 is estimated with low confidence to be 8 rows (200 bytes). The estimated time for this step is 0.16 seconds.  

Create and Drop Hadoop Tables:

  • Stored procedures to create and drop Hadoop tables
  • Allows SQL scripts to export data in stand alone fashion

CALL SYSLIB.HDROP(‘t3′,’hdp132_svr’);  CALL SYSLIB.HCTAS(‘t3′,’c2,c3’,’LOCATION “/user/hive/table_t12″‘,’hdp132_svr’,’default’);

Connection Flow:

  • The client connects to the system through the PE in node 1. The query is parsed in the PE. During the parsing phase, the table operator’s contract function contacts the HCatalog component through the External Access Handler (EAH), which is a one-per-node Java Virtual Machine Connection Flow.
  • The HCatalog returns the metadata about the table, the number of columns, and the types for the columns. The parser uses this information and also uses this connection d to obtain the Hadoop splits of data that underlie the Hadoop table.
  • The splits are assigned to the AMPs in a round-robin fashion so that each AMP gets a split.
  • The parser phase completes and produces an AMP step containing the table operator. This is sent to all the AMPs in parallel.
  • Each AMP then begins to execute the table operator’s execute function providing a parallel import of Hadoop data.
  • The execute function opens and reads the split data reading in Hadoop rows. These are converted to Teradata data types in each column, and the rows are written to spool.
  • When all the data has been written, the spool file is redistributed as input into the next part of the query plan.

3

 

Performance/Speed:

Imagine complex systematic analytics using R or SAS are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With reasonably simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.

What is Push-down Processing:
To minimize data movement, Teradata Query Grid sends the remote system SQL filters that eliminate records and columns that aren’t needed.

This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.

Business Benefits:

  • No hassle analytics with a seamless data fabric across all of our data and analytical engines
  • Get the most out of your data by taking advantage of specialized processing engines, operating as a cohesive analytic environment
  • Transparently harness the combined power of multiple analytic engines, to address a business question
  • Self-service data and analytics across all systems through SQL

IT Benefits

  • Automate and optimize the use of your analytic systems, through “push-down” processing across platforms
  • Minimize data movement and process data where it resides
  • Minimize data duplication
  • Transparently automate analytic processing and data movement between systems
  • Enable easy bi-directional data movement
  • Integrated processing without administrative challenges
  • Leverage the analytic power and value of your Teradata Database, Teradata Aster Database, open-source Presto and Hive for Hadoop, Oracle Database, and powerful languages such as SAS, Perl, Python, Ruby, and R.
  • High performance query plans using data from other sources while using systems within the Teradata Unified Data Architecture, such as passing workload priorities makes the best use of available resources

Requirements for Query Grid to Hadoop:

  • Teradata 15.0 +
  • Node memory > 96GB
    1. Network > All Teradata nodes able to connect to all Hadoop data nodes
    2. Proxy user on Hadoop

 

Credits:                                                                

Xavient Information Systems

Technical team:

Pankaj Chahar         


References:

http://www.teradata.com

https://en.wikipedia.org

http://in.teradata.com/products-and-services/query-grid/?LangType=16393&LangSelect=true

http://www.info.teradata.com/download.cfm?ItemID=1001944

Related Posts

  • Understanding Teradata Wallet

    Teradata Wallet is a facility for storage of sensitive/secret information, such as Teradata Database user passwords. Users are able to save and retrieve items by using this facility. Teradata wallet…

  • Teradata and JSON – Monetizing the Internet of Things (IoT)Teradata and JSON – Monetizing the Internet of Things (IoT)

    The prevalent influence of technology has resulted in a widespread use of a variety of devices such as cell phones, sensors, web API’s and browsers. These devices generate huge volume…

  • Teradata IoT capabilities & Teradata ListenerTeradata IoT capabilities & Teradata Listener

    What is Teradata: Teradata is a relational database management system (RDBMS) that is: an open system, running on a UNIX MP-RAS or Windows server platform. capable of supporting many concurrent users from various…

  • HAWQ/HDB and Hadoop with Hive and HBaseHAWQ/HDB and Hadoop with Hive and HBase

    Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big…

  • Oracle Goldengate

    Oracle GoldenGate is an Oracle proprietary software for real-time data integration and replication that supports different databases- Oracle, DB2, SQL Server, Ingres, MySQL etc. Even the source and target database…

  • Column Store Index in SQL Server 2012

    This post is about the new feature, i.e., Column Store Index which is available since SQL 2012 version. Microsoft has released column store index to improve the performance by 10x.…

Leave a Reply

Your email address will not be published. Required fields are marked *