Lead the end-to-end architecture of a Databricks-centric multi-agent processing engine leveraging Mosaic AI, Model Serving, MLflow, Unity Catalog, Delta Lake, and Feature Store for decoder automation at scale.
Design governed data ingestion, storage, and real-time processing using Delta Lake, Structured Streaming, and Databricks Workflows with enterprise security and full lineage.
Own the model lifecycle with MLflow: experiment tracking, registry/versioning, A/B testing, drift monitoring, and automated retraining pipelines.
Architect low-latency model serving endpoints with auto-scaling and confidence-based routing for sub-second agent decisioning.
Establish robust data governance with Unity Catalog across all environments.
Drive performance and cost optimization
Define production release strategies (blue-green), monitoring and alerting, operational runbooks, and SLOs for dependable operations.
Partner with engineering, MLOps, and product to deliver human-in-the-loop workflows and dashboards using Databricks SQL and a React frontend.
Lead change management, training, and knowledge transfer while managing a parallel shadow-processing path during ramp-up.
Plan and coordinate phased delivery, success metrics, and risk mitigation across foundation, agent development, automation, and production rollou
Our requirements
Proven experience architecting solutions on the Databricks Lakehouse using Unity Catalog, Delta Lake, MLflow, Model Serving, Feature Store, AutoML, and Databricks Workflows.
Expertise in real-time/low-latency model serving architectures with auto-scaling, confidence-based routing, and A/B testing.
Strong knowledge of cloud security and governance on Azure or AWS, including Azure AD/AWS IAM, encryption, audit trails, and compliance frameworks.
Hands-on MLOps skills across experiment tracking, model registry/versioning, drift monitoring, automated retraining, and production rollout strategies.
Proficiency in Python and Databricks-native tooling, with practical integration of REST APIs/SDKs and Databricks SQL in analytics products.
Familiarity with React dashboards and human-in-the-loop operational workflows for ML and data quality validation.
Demonstrated ability to optimize performance, reliability, and cost for large-scale analytics/ML platforms with strong observability.
Experience leading multi-phase implementations with clear success metrics, risk management, documentation, and training/change management.
Domain knowledge in telemetry, time-series, or industrial data (aerospace a plus) and prior work with agentic patterns on Mosaic AI.
Databricks certifications and experience in enterprise deployments of the platform are preferred.