Product: Global Platform Engineering.
Your role:
• Supervise a team of Site Reliability Engineers
• Report metrics on application performance and incidents
• Act proactively and responsively to infrastructure and application failures
• Build and automate failover and recovery workflows
• Implement observability and monitoring stack for infrastructure and application layers
• Improve high availability and scalability for existing solutions
• Manage application downtime by defining and measuring SLAs and Error Budgets
• Design backup and recovery strategies
Your background:
• You have an Information Technology degree or similar
• You have a hands-on experience with AWS cloud
• You know automation CI/CD tools (Jenkins, Github or similar)
• You know how to automate and script cloud workloads with IaaC and CaaC techniques (Terraform, CloudFormation,
Ansible, Helm)
• You know monitoring tools (Datadog, Prometheus, Grafana, Splunk, or similar)