Bye Bye MapReduce

Posted by Sauhard Jain


Wait! What? Really! In the era of Big Data how you can say “bye bye MapReduce”? We can, because Informatica just did the same.

MapReduce is a framework used to develop applications that can process large sets of data in a distributed computing environment. It can reduce big chunks of unstructured data into small parts by splitting them into pieces, then process the pieces in parallel via map tasks. This is somewhat similar to what we see in Teradata, which is a Relational Database (RDB), however, here it is done via on Hadoop Distributed File System (HDFS). The MapReduce framework sorts the output of these map tasks and passes the output to reduce tasks.

Informatica Big Data Management (BDM) product uses the MapReduce framework to push down the mapping logic to Hadoop clusters. BDM then translates the mappings into HiveQL and MapReduce programs, which are executed on a Hadoop cluster. In simpler words, you just have to perform traditional mapping and Informatica – the first vendor to provide this ability – will implicitly convert the mapping logic to HiveQL (like it does when we apply a PDO on an RDB) and then pass this processing on Hadoop cluster.

It thus, eliminates the need for developers to learn MapReduce programming. They simply need to select the “Hive” checkbox in the runtime properties of mapping and it will run the mapping in MapReduce mode. Thus, hundreds of BDM customers can reuse the traditional Data Integration jobs and onboard them to the Hadoop ecosystem.

Steps to Run the Mapping in MapReduce Mode

Suppose we have built the following mapping:

Now, when you are about to run this mapping just select the “Hive” checkbox in the runtime properties as highlighted in the following image.

This will enable the developers to run the mapping in MapReduce mode. Developers will now be able to run old codes in the MapReduce mode without putting extra efforts & time in understanding the existing legacy code and then translating them into the MapReduce programs.

Not Just MapReduce

Besides MapReduce, there are many more frameworks that are being widely adopted by vendors and customers alike. Spark is one of them.

As was the case with BDM, developers don’t have to make any change to their codes to run them on Spark. They can easily manage it by changing the execution engine from Hive to Spark as highlighted below:

The data integration service will now implicitly convert the code into the Spark Scala code instead of MapReduce.

The Big News

Informatica recently announced the end of life for MapReduce in BDM 2019 spring release (BDM 10.2.2). Hadoop distribution vendors have also started to move away from MapReduce and MapReduce is no longer supported as an execution engine in HDP 3.0.

This end of life of MapReduce will majorly affect those mappings that use MapReduce as the run time engine and not the components that rely internally on MapReduce.

Migrating from MapReduce

To migrate from MapReduce, all that the developers need to do is to select all the available Hadoop runtime engines, as shown in the image below.

Since we have selected all Hadoop run-time engines, Informatica will do the job of selecting the appropriate execution engine at runtime to process the requests. If Spark is present as a preferred engine, then the mapping will be performed in Spark mode by default.

If the mappings have only Hive (MapReduce) selected as the preferred engine, we can modify them in bulk to leverage Spark. We can perform the action via infacmd commands, as described below:

Closing Lines

The big take away from the blog is that Big Data Management 2019 spring release (BDM 10.2.2) will no longer support Hive execution mode (including MapReduce). Thus, customers must migrate to Spark and other execution engines.

That covers almost everything all from our end! Do you have anything to add? Let us know in the comments below.

Until next time!

Related Posts

  • HAWQ/HDB and Hadoop with Hive and HBaseHAWQ/HDB and Hadoop with Hive and HBase

    Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big…

  • Hadoop Cluster Verification (HCV)Hadoop Cluster Verification (HCV)

    Verification scripts basically composed of idea to run a smoke test against any Hadoop component using shell script. HCV is a set of artifacts developed to verify successful implementation of…

  • Introduction to Messaging

    Messaging is one of the most important aspects of modern programming techniques. Majority of today's systems consist of several modules and external dependencies. If they weren't able to communicate with…

  • Real Time Data Ingestion (DiP) – Apache Apex (co-dev opportunity)Real Time Data Ingestion (DiP) – Apache Apex (co-dev opportunity)

    Data Ingestion Platform This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any…

  • Real Time Data Ingestion (DiP) – Spark Streaming (co-dev opportunity)Real Time Data Ingestion (DiP) – Spark Streaming (co-dev opportunity)

    This blog is an extension to that and it focuses on integrating Spark Streaming to Data Ingestion Platform for performing real time data ingestion and visualization. The previous blog DiP (Storm Streaming) showed how…

  • Understanding Teradata Wallet

    Teradata Wallet is a facility for storage of sensitive/secret information, such as Teradata Database user passwords. Users are able to save and retrieve items by using this facility. Teradata wallet…

Leave a Reply

Your email address will not be published. Required fields are marked *