Video description
In Video Editions the narrator reads the book while the content, figures, code listings, diagrams, and text appear on the screen. Like an audiobook that you can also watch as a video.
An Airflow bible. Useful for all kinds of users, from novice to expert.
Rambabu Posa, Sai Aashika Consultancy
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.
about the technology
Data pipelines manage the flow of data from initial collection through consolidation, cleaning, analysis, visualization, and more. Apache Airflow provides a single platform you can use to design, implement, monitor, and maintain your pipelines. Its easy-to-use UI, plug-and-play options, and flexible Python scripting make Airflow perfect for any data management task.
about the book
Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You’ll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practical guide covers every aspect of the directed acyclic graphs (DAGs) that power Airflow, and how to customize them for your pipeline’s needs.
what's inside
- Build, test, and deploy Airflow pipelines as DAGs
- Automate moving and transforming data
- Analyze historical datasets using backfilling
- Develop custom components
- Set up Airflow in production environments
about the audience
For DevOps, data engineers, machine learning engineers, and sysadmins with intermediate Python skills.
about the author
Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies. Bas is also an Airflow committer.
An easy-to-follow exploration of the benefits of orchestrating your data pipeline jobs with Airflow.
Daniel Lamblin, Coupang
The one reference you need to create, author, schedule, and monitor workflows with Apache Airflow. Clear recommendation.
Thorsten Weber, bbv Software Services AG
By far the best resource for Airflow.
Jonathan Wood, LexisNexis
NARRATED BY JULIE BRIERLEY
Table of Contents
Part 1. Getting started
Chapter 1 Meet Apache Airflow
Chapter 1 Pipeline graphs vs. sequential scripts
Chapter 1 Introducing Airflow
Chapter 1 When to use Airflow
Chapter 2 Anatomy of an Airflow DAG
Chapter 2 Running a DAG in Airflow
Chapter 2 Running at regular intervals
Chapter 3 Scheduling in Airflow
Chapter 3 Cron-based intervals
Chapter 3 Processing data incrementally
Chapter 3 Understanding Airflow’s execution dates
Chapter 3 Best practices for designing tasks
Chapter 4 Templating tasks using the Airflow context
Chapter 4 Templating the PythonOperator
Chapter 4 Hooking up other systems
Chapter 5 Defining dependencies between tasks
Chapter 5 Branching
Chapter 5 Conditional tasks
Chapter 5 More about trigger rules
Chapter 5 Sharing data between tasks
Chapter 5 Chaining Python tasks with the Taskflow API
Part 2. Beyond the basics
Chapter 6 Triggering workflows
Chapter 6 Polling custom conditions
Chapter 6 Triggering other DAGs
Chapter 7 Communicating with external systems
Chapter 7 Developing locally with external systems
Chapter 7 Moving data from between systems
Chapter 8 Building custom components
Chapter 8 Building a custom hook
Chapter 8 Building a custom operator
Chapter 8 Packaging your components
Chapter 9 Testing
Chapter 9 Setting up a CI/CD pipeline
Chapter 9 Testing with files on disk
Chapter 9 Working with external systems
Chapter 9 Using tests for development
Chapter 10 Running tasks in containers
Chapter 10 Introducing containers
Chapter 10 Containers and Airflow
Chapter 10 Creating container images for tasks
Chapter 10 Running tasks in Kubernetes
Chapter 10 Using the KubernetesPodOperator
Part 3. Airflow in practice
Chapter 11 Best practices
Chapter 11 Manage credentials centrally
Chapter 11 Use factories to generate common patterns
Chapter 11 Designing reproducible tasks
Chapter 11 Handling data efficiently
Chapter 11 Managing your resources
Chapter 12 Operating Airflow in production
Chapter 12 Which executor is right for me?
Chapter 12 A closer look at the scheduler
Chapter 12 Installing each executor
Chapter 12 Setting up the KubernetesExecutor
Chapter 12 Capturing logs of all Airflow processes
Chapter 12 Visualizing and monitoring Airflow metrics
Chapter 12 Creating dashboards with Grafana
Chapter 12 How to get notified of a failing task
Chapter 12 Scalability and performance
Chapter 13 Securing Airflow
Chapter 13 Encrypting data at rest
Chapter 13 Encrypting traffic to the webserver
Chapter 13 Fetching credentials from secret management systems
Chapter 14 Project: Finding the fastest way to get around NYC
Chapter 14 Extracting the data
Chapter 14 Structuring a data pipeline
Part 4. In the clouds
Chapter 15 Airflow in the clouds
Chapter 15 Google Cloud Composer
Chapter 16 Airflow on AWS
Chapter 16 AWS-specific hooks and operators
Chapter 16 Building the DAG
Chapter 17 Airflow on Azure
Chapter 17 Overview
Chapter 18 Airflow in GCP
Chapter 18 Integrating with Google services
Chapter 18 GCP-specific hooks and operators
Chapter 18 Getting data into BigQuery