We are recruiting a dynamic and experienced Site Reliability Engineering Lead to and grow our digital capabilities while enabling engineering teams with the tools, practices to safely ship features to our while adhering to standards, policies, and regulatory requirements.
You will have extensive domain expertise on client and market facing applications, market connectivity and execution platforms with hands on experience working with external and market data vendors. Through this experience you will partner with Business Service Owners to agree and develop SLOs, SLIs and Error Budgets to assess resilience and drive interventions with engineering teams.
You will drive the strategy on observability, synthetic monitoring & application performance monitoring with your global lead and implement this strategy in the Krakow Tech Center.
responsibilities :
Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLAs) and Error Budgets to measure and improvement system reliability.
Collaborate with cross-functional teams to improvement system reliability, scalability, and performance.
Automate away toil on incident detection, response and remediation though innovate AIOps solutions.
Provide specialist technical knowledge and experience, championing DevOps and Agile practices.
Be responsible for defining org structure for service governance defining shift patterns and ensuring appropriate availability of front-line teams to cover 24x7.
Be Key point of contact for all relevant stakeholders (CIOs, Business stakeholders, Infrastructure Management Head etc.) to maintain visibility on their service availability, operability metrics, risk appetite, incidents, and control effectiveness and to provide robust challenge to the same audience when risk appetites are threatened and instigate the incident process when risk appetite is breached.
Be accountable for swift incident response and service restoration, with succinct and timely communication to the key WPB tech and business stakeholders, responding to the IMT calls, driving MEC calls with an ability to co-ordinate with virtual teams conducting recovery actions. This role will work with the support teams and ITSOs for all owned and consumed services to drive service quality and effective SRE practice.
Lead the PIR/MIR with an intent to restore services swiftly. Work with service owners and problem management teams to critically review incidents and come up with actions to prevent future recurrence. Ensure read across of post incident learnings. Drive closure of repeat incidents.
requirements-expected :
Previous hands-on experience on DevOps Tooling / Site Reliability Engineering roles.
Experience in operating, deploying, and supporting services on AWS.
Experience in owning and maintaining engineering platforms and tools.
Solid understanding of system architecture, network protocols, and distributed systems.
Experience in crisis management, blameless post-mortems, root cause analysis, and implementing corrective actions to prevent recurrence.
Strong troubleshooting and problem-solving skills, with a proven ability to quickly diagnose complex issues and propose effective solutions.
Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams and communicate technical concepts to both technical and non-technical stakeholders.
offered :
Competitive salary
Annual performance-based bonus
Additional bonuses for recognition awards
Multisport card
Private medical care
Life insurance
One-time reimbursement of home office set-up (up to 800 PLN)
Corporate parties & events
CSR initiatives
Nursery discounts
Financial support with trainings and education
Social fund
Flexible working hours
Free parking
benefits :
sharing the costs of sports activities
private medical care
sharing the costs of professional training & courses