As a Senior Site Reliability Engineer, you’ll be at the heart of our most critical operations, building the fault-tolerant systems that serve our most important clients: Communications Service Providers (CSP).
We build, test, launch, and operate the complex, high-stakes systems for our global telco customers. Your mission is to ensure the reliability, efficiency, and performance of our core products (like UMP , CEM , BSAP , and DHCP ) across both our cloud and complex on-premise deployments.
This is no small task. Our products handle hundreds of millions of devices in 100+ large deployments worldwide by industry giants like T-Mobile, Play and Vodafone, also via our Cloud offering.
As we embrace more cloud-native and Kubernetes-based deployments, were facing new architectural challenges. We arent just looking for someone to maintain systems; we are looking for an experienced engineer who loves to solve complex operational problems and is passionate about building the automation that will lead us forward.
Requirements
- 5+ years of professional experience in Site Reliability Engineering, DevOps, or a related role such as Systems or Software Engineering.
- Advanced proficiency in a high-level language such as Python or Go, with the ability to design and build complex, maintainable automation services and frameworks.
- Deep expertise with cloud infrastructure (e.g., GCP, AWS, or Azure) and container orchestration (Kubernetes).
- Proven experience with Infrastructure as Code and configuration management (e.g., Terraform, Ansible).
- Expert-level understanding of networking (TCP/IP, OSI) and Unix/Linux systems (Ubuntu, RHEL).
- Expertise in designing, implementing, and managing monitoring and observability tools (e.g., Prometheus, Grafana, Zabbix).
- A strong sense of ownership, and the ability to lead technical discussions and mentor other engineers.
- Proficiency in English (B2+).
A huge plus if you have experience with:
- A formal SRE role in a previous company.
- Database setup and administration (e.g., MongoDB, Redis).
- Performance tuning and debugging of JVM-based applications.
- Building and scaling distributed systems.
Responsibilities
- Design, build, and maintain complex, maintainable automation services and frameworks to eliminate toil and scale our operations.
- Proactively identify, debug, and resolve complex performance and reliability issues within our core product codebases.
- Communicate directly with technical customer teams to troubleshoot, manage, and resolve complex production issues.
- Lead blameless postmortems and Root Cause Analyses (RCAs) for complex incidents, driving preventative measures.
- Establish and monitor Service Level Indicators (SLIs) to align the team with availability and latency objectives.
- Participate in an additionally paid 24/7 on-call rotation, responding to and resolving critical system issues.
- Mentor junior and mid-level engineers through code reviews, design discussions, and pair programming.
- Collaborate with development teams on feature design and architecture to ensure reliability, scalability, and operability from the start.
- Set up and configure software, networks, and operating systems across bare metal, VMs, and cloud/Kubernetes infrastructure.
- Drive improvements to our monitoring and observability stack (Prometheus, Grafana, Loki) to provide a comprehensive view of system health.
What we offer
- Freedom and responsibility. Our goal is to inspire people more than manage them. We want our teams to do what is best for our products. This, in turn, generates a sense of responsibility which drives us to do great work.
- Technical challenges: our customers rely on the reliability of our products to generate revenue in their business. The telco industry is ever-growing and needs us to support that growth.
- Open-source contribution opportunities.
- A team of highly skilled and humorous colleagues.
- Access to the best tools and equipment available in the market.
- A MacBook Pro / ThinkPad with 2 monitors.
- Company events and team building activities.
- Multiple career paths and employee development options – we want you to develop into a tech lead in the future, but we’ll support you in getting another dream role in site reliability, management, product development or sales.
- Flexible working hours/remote work when you need it
- Trainings and conferences
- Multisport card
- Kitchen full of snacks and treats (including Good Lood ice cream)
- Car parking area and bike room
- A relaxed work atmosphere – no dress code, no open space
Come join the best!
Thank you for your interest inSenior Site Reliability Engineerposition.
Havent found a perfect match? Send your CV anyway! Email us atjobs@avsystem.comor just click "Apply."