Serverless Data Processing with Dataflow

Course Overview

How do you build data pipelines that scale without locking yourself into one platform? With Apache Beam and Google Cloud Dataflow, you can run serverless data processing at scale—without compromising flexibility or performance.

This 3-day course teaches data engineers and analysts how to use Apache Beam with Dataflow to build resilient, scalable, and portable pipelines for batch and streaming applications. You’ll learn to optimize performance, implement secure deployments, monitor your jobs, and apply best practices across the pipeline lifecycle—from development to CI/CD.

Whether you’re processing terabytes of batch data or building real-time pipelines, this course gives you the tools to simplify operations and build faster, more cost-effective solutions with Google Cloud Dataflow.

Who Should Attend

This course is designed for data engineers, as well as data analysts and data scientists looking to develop hands-on data engineering skills. Ideal for those working with batch or streaming pipelines on Google Cloud.

Course Objectives

Demonstrate how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs.
Summarize the benefits of the Beam Portability Framework and enable it for your Dataflow pipelines.
Enable Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance.
Enable Flexible Resource Scheduling for more cost-efficient performance.
Select the right combination of IAM permissions for your Dataflow job.
Implement best practices for a secure data processing environment.
Select and tune the I/O of your choice for your Dataflow pipeline.
Use schemas to simplify your Beam code and improve the performance of your pipeline.
Develop a Beam pipeline using SQL and DataFrames.
Perform monitoring, troubleshooting, testing and CI/CD on Dataflow pipelines.

Course Outline

Introduction

Course objectives and overview
Apache Beam and Dataflow integration

Beam Portability and Compute Options

Beam Portability Framework and use cases
Custom containers and cross-language transforms
Shuffle, Streaming Engine, and Flexible Resource Scheduling

IAM, Quotas, and Security

Selecting IAM roles and managing quotas
Zonal strategies for data processing
Best practices for secure environments

Beam Concepts and Streaming Foundations

Apache Beam review: PCollections, PTransforms, DoFn lifecycle
Windows, watermarks, and triggers for streaming data
Handling late data and defining trigger types

Sources, Sinks, and Schemas

Writing and tuning I/O for performance
Creating custom sources and sinks with SDF
Using schemas to express structured data and improve performance

State, Timers, and Best Practices

When and how to use state and timer APIs
Choosing the right type of state for your pipeline
Development and design best practices

Developing with SQL, DataFrames, and Notebooks

Using Beam SQL and DataFrames to build pipelines
Prototyping in Beam notebooks with Beam magics
Launching jobs to Dataflow from notebooks

Monitoring, Logging, and Troubleshooting

Navigating the Dataflow Job Details UI
Setting alerts with Cloud Monitoring
Troubleshooting with diagnostics widgets and error reports
Structured debugging and common failure patterns

Performance, Testing, and CI/CD

Performance tuning and data shape considerations
Testing strategies and automation
Streamlining CI/CD workflows for Dataflow

Reliability and Flex Templates

Designing for reliability in production pipelines
Using Flex Templates to standardize and reuse code

Summary

Course recap and next steps

< Back to Course Search

Class Dates & Times

Class times are listed Eastern time

This is a 3-day class

Register for Class

Register	When	Time	Where	How
Register	09/10/2025	9:00AM - 5:00PM	Online	VILT

University of Michigan