technologies-expected :

technologies-optional :

about-project :

We’re looking for highly motivated, passionate site reliability engineers to join our growing team. At evertz.io, our teams are building services that are used by the biggest names in the exciting broadcast and media industry. Our services are hosted in AWS, with a Serverless First mindset.
As part of this role you will work with our talented teams to help harden our multi-tenant SaaS platform. Using best in class observability tooling, you will be working to debug incidents, while also identifying and implementing improvements to the platform to ensure its continued reliability. Your drive to eliminate toil will see you automating processes and building the tools to do so.

responsibilities :

Ensure platform reliability and uptime by monitoring, maintaining, and optimizing multi-tenant SaaS infrastructure.
Investigate and resolve production incidents, perform root cause analysis, and drive long-term fixes to improve system resilience.
Develop tools and automation to streamline daily operational processes and improve efficiency.
Implement and maintain observability solutions (monitoring, logging, alerting) to enable proactive issue detection and quick response.
Define and refine SLOs and SLIs, turning performance metrics into actionable reliability improvements.
Design, maintain, and optimize CI/CD pipelines to ensure fast, consistent, and secure deployments.
Design, deploy, and maintain scalable cloud and application infrastructure.
Apply Infrastructure as Code practices to ensure scalable and reproducible environments.
Enhance system security and compliance, conducting assessments and implementing mitigation strategies aligned with industry standards.
Participate in on-call duty.

requirements-expected :

At least 3 years of hands-on experience managing critical, high-availability production infrastructure, demonstrating success in maintaining reliability and maximizing application uptime.
Proficient in at least one programming language (such as Python, Java, or Rust), with experience designing and building production-quality automation, tools, or software libraries.
At least 3 years working with monitoring, log aggregation, and observability platforms such as Datadog, CloudWatch, Honeycomb, Splunk, or New Relic, using data-driven insights to proactively identify and resolve issues.
Excellent analytical skills with the ability to understand end-to-end use cases, map system flows, debug complex issues, and anticipate potential failure points.
Proven track record translating SLO’s and SLI’s into actionable improvements. Reliability, monitoring, and observability are not just words to you.
At least 3 years of experience with cloud technologies, in particular AWS Services and tools such as Cloud Formation, Lambda, DynamoDB, SQS, SNS, EC2, S3, AWS CLI, Boto3.
Solid foundation in Linux systems administration, networking, and security.
Familiarity with the use and configuration of CI & CD pipelines such as Jenkins & AWS CodePipeline.

benefits :

technologies-optional :