Lead design and implementation of scalable, secure, and high-performance AI infrastructure supporting large-scale distributed training and inference on both on-prem and cloud environments
Drive architectural decisions and roadmap planning in collaboration with engineering, data science, and platform teams.
Own deployment pipelines and platform operations for Kubernetes-based environments using HELM and GitLab CI/CD.
Mentor junior engineers and serve as escalation point for complex infrastructure incidents and troubleshooting.
Enhance observability and reliability of GPU workloads and model-serving environments through custom tooling and automation in Python and Shell.
Integrate compute infrastructure with enterprise identity and policy controls using Azure AD and role-based access management.
Partner with Network, Storage, and Security teams to ensure AI workloads meet compliance, availability, and performance standards.
Lead incident post-mortems, improve operational playbooks, and develop long-term monitoring and reliability strategies.
Support change management, risk documentation, and ServiceNow processes for production infrastructure.
Participate in a rotating on-call schedule, including occasional weekend coverage.
Wymagania
Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical discipline.
8+ years of experience in infrastructure engineering, DevOps, or SRE roles, with 3+ years in AI/ML platforms or high-performance compute environments.
Deep hands-on experience with Kubernetes, Docker, HELM, GitLab CI/CD, and GPU-enabled infrastructure.
Strong proficiency in Python and Bash/Shell for infrastructure automation, diagnostics, and operations tooling.
Demonstrated leadership in guiding complex infrastructure projects and mentoring engineers.
Familiarity with JIRA, ServiceNow, and enterprise systems for identity (Azure AD), audit, and compliance.
Advanced understanding of networking and storage architectures in the context of AI model training and inference.
Ability to operate in high-stakes environments, balancing delivery speed, risk mitigation, and engineering rigor.
Oferujemy
America’s Most Innovative Companies, Fortune, 2024
World’s Most Admired Companies, Fortune 2024
Human Rights Campaign Foundation, Corporate Equality Index, 100% score, 2023-2024