Skip to content
Go back

An Airflow Provider for Telomere

Published:  at  01:49 PM

Reliably reporting Airflow DAG failures was the original motivation for creating Telomere. Here’s an excerpt from our launch blog post:

Our alerting integration inside Airflow itself could fail to deliver if there was a problem with the pipeline code or our Airflow infrastructure. We added an external cron monitoring tool to help, but this split our alerts across multiple systems. Also, the cron monitoring would alert the following day if a pipeline started but died along the way.

Using our new Telomere Airflow Provider, you can painlessly monitor the health of the schedule for your DAGs (ie. “did our nightly process run last night”) as well as health of the DAG process (ie. “did our nightly process finish in 4 hours”). You can get all that with just a couple lines of code (as well as configuring an Airflow connection with your API key):

from telomere_provider.utils import enable_telomere_tracking

# Your existing DAG, scheduled to run every 24 hours with a 4 hour timeout...
dag = DAG("nightly_dag", ...)

# Enable tracking with one line!
enable_telomere_tracking(dag)

This wrapper will add a start node, an end node, and a conditional fail node to the DAG:

Telomere DAG Nodes

We found that running these as regular nodes is more reliable and doesn’t suffer the quirks of other Airflow integration points for DAG lifecycle events, likely stemming from these executing on the workers and not inside of other Airflow services.

When the DAG runs it will make two lifecycles in Telomere: nightly.schedule for the schedule health and nightly.dag for the DAG process health:

Telomere Airflow Integration

The Telomere provider will configure the timeouts for these automatically based on Airflow’s configuration, so nightly.schedule will timeout after 24 hours + 5 minutes grace period and nightly.dag will timeout after 4 hours. You can easily configure email alerts for your team in Telomere and you’ll be all set to know when these happen, or immediately if Airflow fails the DAG which is also reported to Telomere. In the screenshot above you’ll see the nightly.schedule lifecycle running, waiting for the next night’s DAG to kick off on time.

The provider also provides tools to let you track individual tasks, so you can do fine-grained monitoring of critical sections of DAGs as needed. This is especially helpful for long-running DAGs where you might want to check for progress along the way rather than wait for a multi-hour timeout for the main DAG. The provider also lets you configure whether Telomere itself being down should fail your DAG or just log a warning.

Telomere provides a generous free tier that will let you get started monitoring over a dozen daily DAGs like this. Please visit Telomere Airflow Provider on GitHub and reach out at hello@modulecollective.com if you have any questions or feedback!



Next Post
Telomere: Lifetime as a Service