Data Dump Utility

Posted by

With data becoming a key factor in taking business decision, the need for the organizations to store and process large volume of the data has become crucial for their sustainability and progression. Data could be of varying type for e.g. log files, user behavior, click stream etc. Data could be stored on a local file system, data warehouse or can be distributed across many machines using distributed file systems like hdfs, s3, gs, etc.

This spark based command line driven utility can be used to fetch and store data from various source and destination file systems including s3, gs, hdfs and local file type system.
It can also be used to create a table in Amazon Athena if the destination data is S3



File Formats   

Input Data Type

Utility takes the CSV files or directory with CSV files as an input

Output Data Type

This utility stores the data in one of the two column oriented format: ORC and Parquet

The intention behind using these two file format is because of the smaller size and faster querying as compared to row based file formats.

Destination Schema

Destination data schema could be created from the source file or a custom schema could be provided through a command line option using the switch “e”

Without External Schema

·         If no custom schema is provided, then the utility infers the schema from the source file , do a validation check for the valid characters allowed and creates a schema using lower case letters for the column names

·         Alphanumeric names containing underscore ( _ ) are allowed

With External Schema

Following point that needs to be considered while creating an external schema

  • No of columns between the source file and the external schema should be same
  • The columns in the external schema should be ordered with respect to the source data with compatible data type (Same data type or type where  implicit conversion of data type should be possible from the source to destination data type )

Command Line Options available

This generic utility depends on command line parameters to pass the necessary information to interact with the supported file system (S3, GCS , Local and HDFS )  On top of the regular spark command line options, this utility provide switches to provide necessary information to retrieve and stored the data from the specific filesystem. These are

Generic Options

  • s : Source Location
  • d : Destination Location . Default to s + f
  • f : Destination data format . Defaults to ORC
  • e : External schema location . If not provided , the schema is created using the source file headers


S3 Related Options

  • s3ak : Access Key for the AWS System
  • s3sk : Secret Key for the AWS System

Google Cloud Related Options

  • gsi : Google Project Id
  • gss : Service Account for the GCS System
  • gsp : Path to the P12 file

Athena Related Options

  • adb : Athena Database
  • at : Athena Table Name
  • as : Athena Staging Directory
  • act : Create Table – true or false .Defaults to false
  • acs : Athena Connection String
  • p : Create Partitioned Data

How to –

Build application

Unzip the project and perform a maven build in its root directory

mvn clean package

Use with the generic options

spark-submit –class com.xavient.datadump.StoreData target/DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s test.csv -f parquet -d destinationDirectory -e hdfs://<<pathToExternalSchema

or can be used without the destination , format or the external schema

spark-submit –class com.xavient.datadump.StoreData target/ DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s test.csv

For S3 file type system

spark-submit –jars=AthenaJDBC41-1.0.0.jar –master yarn –class com.xavient.datadump.StoreData target/DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s <<SourceDirectory/File>> -d s3://<<bucketPath>> -s3ak <<AccessKey>> -s3sk <<SecretKey>

Creating Athena data

With s3 as a destination system we can also create an Athena table by passing Athena related options. Athena jar can be downloaded from here

spark-submit –jars=AthenaJDBC41-1.0.0.jar –master yarn –class com.xavient.datadump.StoreData target/DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s <<SourceDirectory/File>> -d s3://<<bucketPath>> -s3ak <<AccessKey>> -s3sk <<SecretKey> -act true -at <<table_name>> -adb <<Existing_dbname_name>> -acs jdbc:awsathena://<<Athena URL>>:443/ -as s3://<<temp_bucketPath>>

For Google Cloud File System

spark-submit –master yarn –class com.xavient.datadump.StoreData target/DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s <<SourceDirectory/File>> -d gs://<<destination>> -gsi <<google project id >> -gss <<google service account>> -gsp <<path to .p12 file >> -f parquet

Creating Partitioned Data

For faster querying, data is often into partitioned with columns that are most used in the where clause. The utility can be used to write the data into partitions using the “p” switch set to true


  • Partitioned can be currently created using the first column of the data set having the date in the Epoch time in microseconds (containing 16 digits). For e.g. 1488275919372280

spark-submit –jars=AthenaJDBC41-1.0.0.jar –master yarn –class com.xavient.datadump.StoreData target/DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s <<SourceDirectory/File>> -d s3://<<bucketPath>> -s3ak <<AccessKey>> -s3sk <<SecretKey> -act true -at << tablename>> -adb << dbname >> -acs jdbc:awsathena://<<Athena URL>>:443/ -as s3://<<temp_bucketPath>> -p true

If the table created is partitioned then execute the following command to in the Athena console before viewing the data

MSCK REPAIR TABLE <<dbname>>.<<tablename>>



Xavient Information Systems

Technical Team:

Mohiuddin Khan Inamdar

Gautam Marya

Sumit Chauhan


Related Posts

  • Column Store Index in SQL Server 2012

    This post is about the new feature, i.e., Column Store Index which is available since SQL 2012 version. Microsoft has released column store index to improve the performance by 10x.…

  • HAWQ/HDB and Hadoop with Hive and HBaseHAWQ/HDB and Hadoop with Hive and HBase

    Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HBase: Apache HBase™ is the Hadoop database, a distributed, scalable, big…

  • Content Data Store

    Content Data Store Content Data Store (CDS) is a system to provide storage facilities to massive data sets in the form of images, pdfs, documents and scanned documents. This dataset…

  • KAFKA-Druid Integration with Ingestion DIP Real Time Data

    The following blog explains how we can leverage the power of Druid to ingest the DIP data into Druid (a high performance, column oriented, distributed data store), via Kafka Tranquility…

  • Understanding Memory Tuning in JVM- A Case Study and Analysis

    JVM Heap Model The JVM heap model consists of the Young generation and the Old generation memory. The newly created objects are allocated to the young generation memory, as they…

  • Oracle Goldengate

    Oracle GoldenGate is an Oracle proprietary software for real-time data integration and replication that supports different databases- Oracle, DB2, SQL Server, Ingres, MySQL etc. Even the source and target database…


  1. Woah! I’m really digging the template/theme of this website. It’s simple, yet effective. A lot of times it’s difficult to get that “perfect balance” between superb usability and visual appearance. I must say you’ve done a awesome job with this. Also, the blog loads extremely quick for me on Internet explorer. Superb Blog!

Leave a Reply

Your email address will not be published. Required fields are marked *