On-Premise Data Platform to Analyze 2 Petabytes of Data Was Costly

Learn how an American data storage company achieved 60% cost savings by replacing Hortonworks Data Platform with Amazon EMR.


The company’s factory data was derived from testing devices and was processed using Enterprise Hadoop Cluster. The cost and performance were impacted by the Custom SerDe requirements and HDFS dependencies.

Their ETL was not easy to manage and their analytical queries needed to be optimized.


As part of their IT transformation’s Lift and Shift Project, they migrated their Hadoop Cluster to EMR using Spark and Presto.

They worked with an AWS Consulting Partner to replatform their Cluster from Tez to Spark ETL and Presto for analytics.

The company started using Apache Airflow as their ETL manager and scheduler.

The AWS Partner used Snowball Edge to migrate the data from the on-prem cluster to S3 and used AWS CLI S3 copy from multiple factories to upload incremental data.

Apache Superset was used for querying and exploration. All the data collected were used for Machine Learning using Spark ML, TensorFlow on EMR.



-60% cost savings for Hadoop
-Because of Airflow, ETL manageability improved
-Presto on Hive provided 20-30x performance to analytical queries
-Spark ETL improved the performance substantially to reduce ops cost and spot instances


-Their Data Analytics and Sciences team can now scale up and down any number of jobs based on the budget and importance of the specific job
-The robust ETL framework provided them 100% code reusability and high agility to add new tables
-All ETL jobs are now metadata driven
-All clusters are transient

Mactores is an AWS Advanced Consulting Partner with a proven track record helping their customers with Big Data / Analytics, AWS Cloud Engineering, Machine Learning, and more. Find out how they can help you with your challenges by visiting their website.