The ideal candidate will have strong experience with Docker, Kubernetes, and Unix/Linux systems, along with a deep understanding of incident management, production support, and application monitoring. You will collaborate closely with development, operations, and security teams to resolve production issues quickly and efficiently while continuously improving the systems reliability.
- Production Support & Incident Management:
- Provide production support for mission-critical financial applications, ensuring high availability and performance.
- Lead and coordinate incident management efforts, ensuring incidents are quickly diagnosed, mitigated, and resolved, with a focus on reducing downtime and service interruptions.
- Troubleshoot production issues across applications, infrastructure, and networking, working closely with development and operations teams to implement long-term fixes.
- System Monitoring & Performance Tuning:
- Monitor and optimize the performance, availability, and reliability of systems using modern monitoring tools (e.g., Prometheus, Grafana, Datadog, New Relic).
- Implement and manage alerting systems to proactively detect and resolve potential issues before they impact users.
- Optimize and tune the infrastructure and applications to improve performance and reduce system resource usage.
- Infrastructure Automation & DevOps Practices:
- Automate infrastructure deployment, scaling, and management processes using tools such as Docker, Kubernetes, and CI/CD pipelines to ensure continuous integration and delivery.
- Write and maintain infrastructure-as-code (e.g., Terraform, Ansible, etc.) to enable efficient deployment and scaling of systems and applications.
- Work with DevOps teams to implement best practices for containerization, orchestration, and automation.
- Collaboration with Development Teams:
- Work closely with development teams to ensure that production systems are scalable, reliable, and secure.
- Participate in the design, implementation, and review of new features or systems with an emphasis on their operational readiness for production.
- Provide feedback on system designs and improvements, helping to bridge the gap between development and operations.
- Disaster Recovery & Business Continuity:
- Collaborate with the team on disaster recovery planning and ensure systems have proper backup, failover, and recovery procedures in place.
- Lead efforts in capacity planning and scaling systems to meet growing traffic and data requirements while ensuring minimal impact on performance.
- Security & Compliance:
- Ensure that all production systems are secure and comply with industry standards and regulations related to data security, privacy, and financial compliance.
- Work with security teams to address vulnerabilities and implement security best practices in application and infrastructure management.
- Continuous Improvement & Documentation:
- Contribute to the continuous improvement of processes, systems, and tools for better performance and reliability.
- Maintain detailed documentation of systems, incidents, operational procedures, and troubleshooting steps to improve knowledge sharing and support scalability.
- Proven experience in site reliability engineering or production support within a Fintech or similarly high-demand industry.
- Strong experience with Docker and Kubernetes for container orchestration, scaling, and management.
- Unix/Linux experience (system administration, shell scripting, troubleshooting, performance tuning) is mandatory.
- Hands-on experience with incident management and production support, including using incident response tools (e.g., PagerDuty, Opsgenie) and root cause analysis.
- Solid knowledge of cloud platforms (AWS, GCP, Azure) and experience managing cloud-native applications and infrastructure.
- Experience with application monitoring tools (e.g., Prometheus, Grafana, Datadog, New Relic) to ensure system reliability.
- Experience with CI/CD pipelines and infrastructure automation tools (e.g., Terraform, Ansible, Jenkins, GitLab CI).
- Excellent problem-solving and troubleshooting skills, with the ability to diagnose and resolve complex production issues.
- Experience with disaster recovery, backup strategies, and high availability architectures for critical systems.
- Strong communication skills with the ability to work collaboratively across teams, including developers, operations, and business stakeholders.
- Knowledge of financial services regulations, compliance, and security best practices is a plus.
- Experience with monitoring and alerting solutions in high-volume environments (e.g., Prometheus, ELK Stack).
- Familiarity with microservices architectures and understanding how to manage and scale large distributed systems.
- Exposure to automated testing and performance benchmarking tools for infrastructure and applications.
- Experience with logging and log management tools (e.g., ELK, Splunk).
- Familiarity with networking concepts and troubleshooting in distributed systems.