Serverless Data Processing with Dataflow

Skip to Scheduled Dates

Course Overview

How do you build data pipelines that scale without locking yourself into one platform? With Apache Beam and Google Cloud Dataflow, you can run serverless data processing at scale—without compromising flexibility or performance.

This 3-day course teaches data engineers and analysts how to use Apache Beam with Dataflow to build resilient, scalable, and portable pipelines for batch and streaming applications. You’ll learn to optimize performance, implement secure deployments, monitor your jobs, and apply best practices across the pipeline lifecycle—from development to CI/CD.

Whether you’re processing terabytes of batch data or building real-time pipelines, this course gives you the tools to simplify operations and build faster, more cost-effective solutions with Google Cloud Dataflow.

Who Should Attend

This course is designed for data engineers, as well as data analysts and data scientists looking to develop hands-on data engineering skills. Ideal for those working with batch or streaming pipelines on Google Cloud.

Course Objectives

    • Demonstrate how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs.
    • Summarize the benefits of the Beam Portability Framework and enable it for your Dataflow pipelines.
    • Enable Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance.
    • Enable Flexible Resource Scheduling for more cost-efficient performance.
    • Select the right combination of IAM permissions for your Dataflow job.
    • Implement best practices for a secure data processing environment.
    • Select and tune the I/O of your choice for your Dataflow pipeline.
    • Use schemas to simplify your Beam code and improve the performance of your pipeline.
    • Develop a Beam pipeline using SQL and DataFrames.
    • Perform monitoring, troubleshooting, testing and CI/CD on Dataflow pipelines.

Course Outline

Introduction

  • Course objectives and overview
  • Apache Beam and Dataflow integration

Beam Portability and Compute Options

  • Beam Portability Framework and use cases
  • Custom containers and cross-language transforms
  • Shuffle, Streaming Engine, and Flexible Resource Scheduling

IAM, Quotas, and Security

  • Selecting IAM roles and managing quotas
  • Zonal strategies for data processing
  • Best practices for secure environments

Beam Concepts and Streaming Foundations

  • Apache Beam review: PCollections, PTransforms, DoFn lifecycle
  • Windows, watermarks, and triggers for streaming data
  • Handling late data and defining trigger types

Sources, Sinks, and Schemas

  • Writing and tuning I/O for performance
  • Creating custom sources and sinks with SDF
  • Using schemas to express structured data and improve performance

State, Timers, and Best Practices

  • When and how to use state and timer APIs
  • Choosing the right type of state for your pipeline
  • Development and design best practices

Developing with SQL, DataFrames, and Notebooks

  • Using Beam SQL and DataFrames to build pipelines
  • Prototyping in Beam notebooks with Beam magics
  • Launching jobs to Dataflow from notebooks

Monitoring, Logging, and Troubleshooting

  • Navigating the Dataflow Job Details UI
  • Setting alerts with Cloud Monitoring
  • Troubleshooting with diagnostics widgets and error reports
  • Structured debugging and common failure patterns

Performance, Testing, and CI/CD

  • Performance tuning and data shape considerations
  • Testing strategies and automation
  • Streamlining CI/CD workflows for Dataflow

Reliability and Flex Templates

  • Designing for reliability in production pipelines
  • Using Flex Templates to standardize and reuse code

Summary

  • Course recap and next steps

< Back to Course Search

Class Dates & Times

Class times are listed Eastern time

This is a 3-day class

Register for Class

Register When Time Where How
Register 09/10/2025 9:00AM - 5:00PM Online VILT