Our Client Currently looking for Site Reliability Engineer
Responsibilities
- Design, deploy, and operate reliable and scalable systems across cloud and Kubernetes environments.
- Automate infrastructure provisioning, deployments, and operational workflows.
- Build and maintain tools for deployment, monitoring, and system operations.
- Monitor system health and performance, and proactively identify areas for improvement.
- Troubleshoot and resolve issues across development, test, and production environments.
- Participate in incident response, root cause analysis, and reliability improvements.
- Collaborate with engineering teams to improve system operability and deployment safety.
- Support and operate large-scale systems, including data-intensive or AI-driven workloads.
Requirements
- 2 – 6 years of experience managing and operating production infrastructure and services in cloud environments such as AWS, Azure, or GCP.
- Strong hands-on experience with Linux systems in production environments.
- Experience working with containerized workloads and Kubernetes in real-world scenarios.
- Working knowledge of Infrastructure as Code tools such as Terraform, Terragrunt, or Crossplane.
- Experience designing and maintaining CI/CD pipelines using tools such as GitHub Actions, GitLab CI, Jenkins, Azure DevOps, or similar.
- Familiarity with GitOps principles and tools such as Argo CD or Flux.
- Solid understanding of cloud networking concepts, load balancing, and service connectivity.
- Experience with monitoring, logging, and alerting systems such as Prometheus, Grafana, ELK/EFK, Datadog, or equivalent.
- Proficiency in at least one scripting or programming language (e.g., Bash, Python).
- Experience working with relational databases; exposure to NoSQL or data platforms is a plus.
- Experience participating in on-call rotations, responding to production incidents, and performing root cause analysis.
- Understanding of SLIs, SLOs, and error budgets, and how they are used to guide reliability and operational decisions.
- Strong problem-solving skills and the ability to debug complex production issues.
- Good verbal and written communication skills, especially during incidents and technical discussions.
Nice to Have
- Experience operating systems at scale or in high-availability environments.
- Exposure to on-prem or hybrid infrastructure.
- Experience supporting data platforms, analytics, or AI/ML workloads.
-
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#AlbionarcJobs#FintechJobs
#AsiaJobs#MiddleEastCareers
#TechTalent#FintechRecruitment
#FinanceOpportunities#
