Senior Workflow Orchestration Engineer (Airflow & Scheduling Platforms)
About the role
Were seeking a seasoned engineer to design, operate, and scale our workflow orchestration platform with a primary focus on Apache Airflow. Youll own the Airflow control plane and developer experience end-to-end—architecture, automation, security, observability, and reliability—while also evaluating and operating complementary schedulers where appropriate. Youll build automation infrastructure and partner across data, trading, and engineering teams to deliver mission-critical pipelines at scale.
What you’ll do
- Architect, deploy, and operate production‑grade Airflow on Kubernetes including all components and user application dependencies, with focus on upgrades, capacity planning, HA, security, and performance tuning
- Operate a multi‑scheduler ecosystem: determine when to use Airflow, distributed compute schedulers, or lightweight task runners based on workload requirements; provide unified developer experience across schedulers
- Build automation infrastructure: Terraform modules and Helm charts with GitOps‑driven CI/CD for environment provisioning, upgrades, and zero‑downtime rollouts
- Standardize the developer experience: DAG repo templates, shared operator libraries, connection and secrets management, dependency packaging, code ownership, linting, unit testing, and pre‑commit hooks
- Implement comprehensive observability: metrics collection, dashboards, distributed tracing, SLA/latency monitoring, intelligent alerting, and runbook automation
- Enable resilient workflow patterns: build idempotency frameworks, retry/backoff strategies, deferrable operators and sensors, dynamic task mapping, and data‑aware scheduling
- Ensure reliability at enterprise scale: architect and tune resource allocation (pools, queues, concurrency limits) to support high‑throughput workloads; optimize large‑scale backfill strategies; develop comprehensive runbooks and lead incident response/postmortems
- Partner with teams across the organization to provide enablement, documentation, and self‑service tooling
- Mentor engineers, contribute to platform roadmap and technical standards, and drive engineering best practices
Required qualifications
- 5–8+ years building/operating data or platform systems; 3+ years running Airflow in production at scale (hundreds–thousands of DAGs and high task throughput).
- Deep Airflow expertise: DAG design and testing, idempotency, deferrable operators/sensors, dynamic task mapping, task groups, datasets, pools/queues, SLAs, retries/backfills, cross‑DAG dependencies.
- Strong Kubernetes experience running Airflow and supporting services: Helm, autoscaling, node/pod tuning, topology spread, network policies, PDBs, and blue/green or canary strategies.
- Automation‑first mindset: Terraform, Helm, GitOps (Argo CD/Flux), and CI/CD for platform lifecycle; policy‑as‑code (OPA/Gatekeeper/Conftest) for DAG, connection, and secrets changes.
- Proficiency in Python for authoring operators/hooks/utilities; solid Bash; familiarity with Go or Java is a plus.
- Observability and SRE practices: Prometheus/Grafana/StatsD, centralized logging, alert design, capacity/throughput modeling, performance tuning.
- Data platform experience with at least one major cloud (AWS/Azure/GCP) and systems like Snowflake/BigQuery/Redshift, Databricks/Spark, EMR/Dataproc; strong grasp of IAM, VPC networking, and storage (S3/GCS/ADLS).
- Security/compliance: SSO/OIDC, RBAC, secrets management (Vault/Secrets Manager), auditing, least‑privilege connection management, and change control.
- Proven incident leadership, runbook creation, and platform roadmap execution; excellent cross‑functional communication.
Nice to have
- Experience operating alternative orchestrators (Prefect 2.x, Dagster, Argo Workflows, AWS Step Functions) and leading migrations to/from Airflow.
- OpenLineage/Marquez adoption; Great Expectations or other data quality frameworks; data contracts.
- Cost optimization and capacity planning for schedulers and workers; spot instance strategies.
- Multi‑region HA/DR for Airflow metadata DB; backup/restore and disaster drills.
- Building internal developer platforms/portals (e.g., Backstage) for self‑service pipelines.
- Contributions to Apache Airflow or provider packages; familiarity with recent AIPs/Airflow 2.7+ features.