As a Senior SRE, you’ll balance your passion for both software development and reliability engineering, applying engineering discipline to solve operational challenges at scale. You’ll collaborate closely with development teams as a trusted advisor, influencing system design, establishing reliability standards, and driving quality improvements across the platform. Your role dynamically shifts between hands-on coding—building tools, automation, and infrastructure—and incident response, performance optimisation, and operational excellence.
What You’ll Do
System Reliability & Performance
- Implement comprehensive monitoring and observability using OpenTelemetry standards
- Identify single points of failure in distributed systems
- Analyse system performance across OS and network layers, identifying resource utilisation patterns and bottlenecks to optimise efficiency
- Define and maintain Service Level Objectives (SLOs) for critical trading services
Technical Leadership
- Partner with development teams on system design, capacity planning, and architectural reviews
- Provide technical guidance and hands-on support to help development teams transition their applications from traditional deployment models to containerised infrastructure.
- Lead incident response efforts and conduct blameless postmortems
Infrastructure & Messaging
- Optimise message-driven systems by ensuring reliable event streaming and asynchronous communication patterns
- Scale systems through automation and infrastructure-as-code practices
Software Development Fundamentals
- Write clean, maintainable code following industry best practices and design patterns
- Apply software engineering best practices, including version control, code reviews, and testing strategies
Essential Technical Skills
What you’ll need for this role
- Strong Java development experience with a deep understanding of JVM internals and performance tuning
- Hands-on expertise with message brokers (ActiveMQ, Kafka or similar) in production environments
- Proven experience with containerization and orchestration (Nomad would be an advantage)
- Practical knowledge of OpenTelemetry and distributed tracing concepts
- Solid understanding of reliability patterns, circuit breakers, and fault tolerance
Experience Requirements
- Experience in high-throughput, low-latency production environments
- Track record of improving system reliability and performance at scale
- Experience with continuous delivery and DevOps practices
- Strong troubleshooting skills in distributed systems
- Background in financial services or similar mission-critical domains (preferred)
-
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#AlbionarcJobs#FintechJobs
#AsiaJobs#MiddleEastCareers
#TechTalent#FintechRecruitment
#FinanceOpportunities#
