We are looking for an experienced Solution Architect to lead the design and implementation of a comprehensive end-to-end (E2E) monitoring architecture and performance strategy for the SCAPE platform. This role involves ensuring full alignment with enterprise architecture standards and seamless integration with the Monitoring and Load Testing Centre of Excellence (COE). The platform consists of infrastructure, application, and data layers, requiring a structured, scalable approach to observability, performance engineering, and operational resilience.
Solution Architect – E2E Monitoring & Performance Strategy
Your responsibilities
- Design and document the monitoring and performance architecture for the SCAPE platform across infrastructure, application, and data layers.
- Standardize logging, tracing, alerting, and performance measurement patterns.
- Collaborate with the Monitoring COE to integrate enterprise-grade monitoring using Splunk.
- Build SLA dashboards, anomaly detection models, and root cause analysis mechanisms for tools such as Domino, Bitbucket, Artifactory, and Jira.
- Work closely with the NVS monitoring team to develop PoCs for critical SCAPE components.
- Support the design of final monitoring solutions based on PoC outcomes.
- Define a comprehensive performance strategy: load testing, baseline metrics, scalability patterns.
- Design proactive performance tests integrated into CI/CD pipelines.
- Gather monitoring requirements from SCAPE teams and ensure full integration with delivery processes.
- Cooperate with the architecture team to ensure alignment with the SCAPE Blueprint and Domino tooling ecosystem.
- Contribute to DRP/BCP planning, focusing on monitoring triggers and recovery actions.
- Implement observability features that enhance platform reliability and risk visibility.
Our requirements
- Proven experience in designing monitoring and performance strategies for enterprise platforms.
- Deep knowledge of observability tools, preferably Splunk.
- Experience with CI/CD environments and integrating performance tests into delivery pipelines.
- Strong understanding of distributed systems architecture, logging, and anomaly detection.
- Familiarity with Atlassian tools, artifact repositories, and scientific computing platforms is a plus.
- Excellent collaboration and stakeholder management skills.
- English language proficiency: minimum B2 level.