Skip to Scheduled Dates
Course Overview
How do you build data pipelines that scale without locking yourself into one platform? With Apache Beam and Google Cloud Dataflow, you can run serverless data processing at scale—without compromising flexibility or performance.
This 3-day course teaches data engineers and analysts how to use Apache Beam with Dataflow to build resilient, scalable, and portable pipelines for batch and streaming applications. You’ll learn to optimize performance, implement secure deployments, monitor your jobs, and apply best practices across the pipeline lifecycle—from development to CI/CD.
Whether you’re processing terabytes of batch data or building real-time pipelines, this course gives you the tools to simplify operations and build faster, more cost-effective solutions with Google Cloud Dataflow.
Who Should Attend
This course is designed for data engineers, as well as data analysts and data scientists looking to develop hands-on data engineering skills. Ideal for those working with batch or streaming pipelines on Google Cloud.
Course Objectives
- Demonstrate how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs.
- Summarize the benefits of the Beam Portability Framework and enable it for your Dataflow pipelines.
- Enable Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance.
- Enable Flexible Resource Scheduling for more cost-efficient performance.
- Select the right combination of IAM permissions for your Dataflow job.
- Implement best practices for a secure data processing environment.
- Select and tune the I/O of your choice for your Dataflow pipeline.
- Use schemas to simplify your Beam code and improve the performance of your pipeline.
- Develop a Beam pipeline using SQL and DataFrames.
- Perform monitoring, troubleshooting, testing and CI/CD on Dataflow pipelines.
Course Outline
Introduction
- Course objectives and overview
- Apache Beam and Dataflow integration
Beam Portability and Compute Options
- Beam Portability Framework and use cases
- Custom containers and cross-language transforms
- Shuffle, Streaming Engine, and Flexible Resource Scheduling
IAM, Quotas, and Security
- Selecting IAM roles and managing quotas
- Zonal strategies for data processing
- Best practices for secure environments
Beam Concepts and Streaming Foundations
- Apache Beam review: PCollections, PTransforms, DoFn lifecycle
- Windows, watermarks, and triggers for streaming data
- Handling late data and defining trigger types
Sources, Sinks, and Schemas
- Writing and tuning I/O for performance
- Creating custom sources and sinks with SDF
- Using schemas to express structured data and improve performance
State, Timers, and Best Practices
- When and how to use state and timer APIs
- Choosing the right type of state for your pipeline
- Development and design best practices
Developing with SQL, DataFrames, and Notebooks
- Using Beam SQL and DataFrames to build pipelines
- Prototyping in Beam notebooks with Beam magics
- Launching jobs to Dataflow from notebooks
Monitoring, Logging, and Troubleshooting
- Navigating the Dataflow Job Details UI
- Setting alerts with Cloud Monitoring
- Troubleshooting with diagnostics widgets and error reports
- Structured debugging and common failure patterns
Performance, Testing, and CI/CD
- Performance tuning and data shape considerations
- Testing strategies and automation
- Streamlining CI/CD workflows for Dataflow
Reliability and Flex Templates
- Designing for reliability in production pipelines
- Using Flex Templates to standardize and reuse code
Summary
- Course recap and next steps