Skip to Scheduled Dates
Course Overview
Building Batch Data Analytics Solutions on AWS
More than 70% of big data workloads now run in the cloud—and Amazon EMR is one of the most popular services used to support them.
In the Building Batch Data Analytics Solutions on AWS course, you'll learn how to design, build, and manage scalable batch data pipelines using Amazon EMR, Apache Spark, and Hadoop. You’ll explore how EMR integrates with services like AWS Glue, Lake Formation, and Step Functions, as well as open-source tools like Hive, Hue, and HBase. This course covers the full pipeline—from ingestion and transformation to security and cost control—with hands-on labs that help you translate concepts into real-world skills and actionable insights.
Who Should Attend
- Data platform engineers
- Architects and operators who build and manage data analytics pipelines
Course Objectives
This instructor-led course provides technical professionals with the tools and knowledge to build, manage, and optimize scalable data analytics solutions using Amazon EMR. Participants gain practical skills to run secure and efficient data processing workflows on AWS.
You’ll learn how to:
- Launch and configure clusters using Amazon EMR for batch workloads
- Transform and analyze batch data using Spark, Hive, and AWS Glue
- Secure data in transit and at rest using AWS-native tools
- Monitor and optimize performance using built-in EMR tools
- Apply cost management strategies to large-scale workloads
Course Outline
Module A: Introduction to Data Analytics and Pipelines
- Overview of batch data workflows
- Define components of a modern AWS-based data pipeline
- Identify analytics use cases across business functions
Module 1: Using Amazon EMR for Batch Analytics
- Understand how Amazon EMR supports Spark, Hadoop, Hive, and HBase
- Interactive Demo: Launching an EMR cluster
- Explore cost management and auto scaling options
Module 2: Data Ingestion and Storage Optimization
- Compare techniques for data ingestion
- Optimize data storage with S3, compression, and tiering
- Integrate with AWS Glue and AWS Lake Formation
Module 3: Apache Spark on EMR for Data Processing
- Implement transformation and analytics with Apache Spark
- Interactive Demo: Run Spark commands using Spark shell
- Practice Lab: Use EMR Notebooks for low-latency analytics
Module 4: Batch Data Processing with Hive
- Query and transform structured data using Hive on Amazon EMR
- Practice Lab: Run Hive jobs for batch processing tasks
Module 5: Serverless Data Orchestration and Glue Integration
- Automate workflows with AWS Step Functions
- Catalog and transform data using AWS Glue
- Practice Lab: Orchestrate Spark jobs using Step Functions
Module 6: Securing and Monitoring EMR Clusters
- Protect data using EMRFS encryption and IAM
- Interactive Demo: Enable client-side encryption in EMRFS
- Monitor performance using logs, CloudWatch, and Spark History Server
Module 7: Designing Batch Analytics Solutions
- Apply cost, performance, and security tradeoffs to pipeline design
- Activity: Design a real-world batch data analytics solution
Module B: Building Modern Data Architectures on AWS
- Combine open-source and AWS services in flexible architectures
- Use Hive, HBase, and Redshift for complex batch analytics
- Integrate EMR with AWS Glue and Lake Formation
- Practice Lab: Process and analyze batch data using Hive and HBase
- Practice Lab: Coordinate Spark jobs using AWS Step Functions
- Explore real-world scenarios for enterprise-scale analytics pipelines
- Discuss how to structure architectures to support data lakes and data warehouses