We are seeking an advanced ML Ops Engineer to design and implement the infrastructure required to host, orchestrate, and manage up to 1,500 ML scoring processes within a new Databricks environment. The focus of the role is on operationalizing the ML scoring pipelines by setting up a scalable, secure, and well-monitored platform for data science teams to deploy their models.
The successful candidate will create the ML operational backbone that allows data scientists to run, monitor, and manage high-volume ML scoring efficiently.
Its hybrid role which requires 3x/week work from office.
responsibilities :
Set up Databricks clusters, jobs, and workflows for large-scale ML scoring use cases.
Infrastructure as Code is used for reproducibility and governance (e.g., Terraform).
Implement scalable infrastructure capable of running thousands of ML scoring tasks.
Configure job scheduling, parallel execution strategies, and resource optimization.
Monitoring and alerting are integrated into the platform using cloud-native tools.
Security, compliance, and cost-efficiency are key pillars of the operational setup.
Develop deployment processes for ML models using Databricks MLflow or equivalent.
Implement version control and tracking for models, scoring code, and configuration files.
Integrate logging, alerting, and dashboards to monitor scoring throughput, latency, and failures.
Establish model performance monitoring hooks for post-scoring analytics.
Work alongside Dev Ops Engineers to ensure common infrastructure and processes (e.g., shared storage, Delta Lake tables) serve both ML and BI use cases.
Automate provisioning of resources and deployments from CI/CD pipelines
Utilize Infrastructure as Code (IaC) where feasible for reproducibility
requirements-expected :
Proven experience with ML Ops in production ML environments.
Strong hands-on knowledge of Databricks (MLflow, Jobs, Workflows, Delta Lake).
Experience with large-scale batch job orchestration and distributed computing.
Familiarity with Python for workflow scripting and pipeline integration.
Experience in CI/CD pipelines for ML model deployment (Azure DevOps, GitHub Actions, or similar).
Proficiency with monitoring tools, logging frameworks (e.g., Datadog, Prometheus, Grafana, or built-in cloud monitoring).
Understanding of Infrastructure as Code and cloud environment automation.
Knowledge of model lifecycle management, versioning, and reproducibility.