The Role:
As a Senior MLOps/LLMOps Engineer, you will be at the forefront of building and scaling our AI/ML infrastructure, bridging the gap between cutting-edge large language models and production-ready systems. You will play a pivotal role in designing, deploying, and operating the platforms that power our AI-driven products, working at the intersection of DevOps, MLOps, and emerging LLM technologies.
In this role, youll architect robust, scalable infrastructure for deploying and monitoring large language models (LLMs) such as GPT and Claude-family models in AWS Bedrock & AWS AI Foundry, while ensuring security, observability, and reliability across multi-tenant ML workloads. You will collaborate closely with data scientists, ML engineers, platform teams, and product stakeholders to create seamless, self-serve experiences that accelerate AI innovation across the organization.
This is a hands-on leadership role that blends strategic thinking with deep technical execution. Youll own the end-to-end ML platform lifecycle; from infrastructure provisioning and CI/CD automation to model deployment, monitoring, and cost optimization. As a senior technical leader, youll champion best practices, mentor team members, and drive a culture of continuous improvement, experimentation, and operational excellence.
Essential Experience:
Nice to Have:
The Role:
As a Senior MLOps/LLMOps Engineer, you will be at the forefront of building and scaling our AI/ML infrastructure, bridging the gap between cutting-edge large language models and production-ready systems. You will play a pivotal role in designing, deploying, and operating the platforms that power our AI-driven products, working at the intersection of DevOps, MLOps, and emerging LLM technologies.
In this role, youll architect robust, scalable infrastructure for deploying and monitoring large language models (LLMs) such as GPT and Claude-family models in AWS Bedrock & AWS AI Foundry, while ensuring security, observability, and reliability across multi-tenant ML workloads. You will collaborate closely with data scientists, ML engineers, platform teams, and product stakeholders to create seamless, self-serve experiences that accelerate AI innovation across the organization.
This is a hands-on leadership role that blends strategic thinking with deep technical execution. Youll own the end-to-end ML platform lifecycle; from infrastructure provisioning and CI/CD automation to model deployment, monitoring, and cost optimization. As a senior technical leader, youll champion best practices, mentor team members, and drive a culture of continuous improvement, experimentation, and operational excellence.
,[Run and evolve our ML/LLM compute infrastructure on Kubernetes/EKS (CPU/GPU) for multi-tenant workloads, ensuring portability across AWS/Azure AI Foundry regions with region-aware scheduling, cross-region data access, and artifact management, Engage with platform and infrastructure teams to provision and maintain access to cloud environments (AWS, Azure), ensuring seamless integration with existing systems, Setup and maintain deployment workflows for LLM-powered applications, handling environment-specific configurations across development, staging/UAT, and production, Build and operate GitOps-native delivery pipelines using GitLab CI, Jenkins, ArgoCD, Helm, and FluxCD to enable fast, safe rollouts and automated rollbacks, Deploy, scale, and optimize large language models (GPT, Claude, and similar) with deep consideration for prompt engineering, latency/performance tradeoffs, and cost efficiency, Operate and maintain Argo Workflows as reliable, self-serve orchestration platforms for data preparation, model training, evaluation, and large-scale batch compute, Implement and evaluate models using AI Observability frameworks to track model performance, drift, and quality in production, Design and maintain robust CI/CD pipelines with isolated development, staging, and production environments to support safe iteration, reproducibility, and full lifecycle observability, Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, and Helm to automate provisioning, configuration, and scaling of cloud resources, Manage container orchestration, secrets management (e.g., AWS Secrets Manager), and secure deployment practices across all environments, Set up and analyze comprehensive observability stacks using Prometheus/Grafana and Splunk to monitor model health, infrastructure performance, and system reliability Requirements: AWS, DevOps, MLOps, AWS S3, AWS EC2, Amazon EKS, Amazon RDS, PostgreSQL, IAM, AWS Lambda, CloudWatch, Kubernetes, GPU, Autoscaling, Docker, Jenkins, ArgoCD, Python, FastAPI, Django, pandas, NumPy, Machine Learning, scikit-learn, TensorFlow, PyTorch, Data pipelines, Infrastructure as Code, Terraform, CloudFormation, Helm, GitLab CI, Prometheus, Grafana, Splunk, Datadog, Security, Linux