Site Reliability Engineer

Engineer

Site Reliability Engineer

Apply Now

- $0.00

  • Date posted
    May 11, 2026
  • Expiration date
    August 11, 2026
  • Application ends
    August 11, 2026

Our Client Currently looking for Site Reliability Engineer

 

 

Responsibilities

  • Design, deploy, and operate reliable and scalable systems across cloud and Kubernetes environments.
  • Automate infrastructure provisioning, deployments, and operational workflows.
  • Build and maintain tools for deployment, monitoring, and system operations.
  • Monitor system health and performance, and proactively identify areas for improvement.
  • Troubleshoot and resolve issues across development, test, and production environments.
  • Participate in incident response, root cause analysis, and reliability improvements.
  • Collaborate with engineering teams to improve system operability and deployment safety.
  • Support and operate large-scale systems, including data-intensive or AI-driven workloads.

Requirements

  • 2 – 6 years of experience managing and operating production infrastructure and services in cloud environments such as AWS, Azure, or GCP.
  • Strong hands-on experience with Linux systems in production environments.
  • Experience working with containerized workloads and Kubernetes in real-world scenarios.
  • Working knowledge of Infrastructure as Code tools such as Terraform, Terragrunt, or Crossplane.
  • Experience designing and maintaining CI/CD pipelines using tools such as GitHub Actions, GitLab CI, Jenkins, Azure DevOps, or similar.
  • Familiarity with GitOps principles and tools such as Argo CD or Flux.
  • Solid understanding of cloud networking concepts, load balancing, and service connectivity.
  • Experience with monitoring, logging, and alerting systems such as Prometheus, Grafana, ELK/EFK, Datadog, or equivalent.
  • Proficiency in at least one scripting or programming language (e.g., Bash, Python).
  • Experience working with relational databases; exposure to NoSQL or data platforms is a plus.
  • Experience participating in on-call rotations, responding to production incidents, and performing root cause analysis.
  • Understanding of SLIs, SLOs, and error budgets, and how they are used to guide reliability and operational decisions.
  • Strong problem-solving skills and the ability to debug complex production issues.
  • Good verbal and written communication skills, especially during incidents and technical discussions.

Nice to Have

  • Experience operating systems at scale or in high-availability environments.
  • Exposure to on-prem or hybrid infrastructure.
  • Experience supporting data platforms, analytics, or AI/ML workloads.
  • Are you interested in this position?

     

    Apply by clicking on the “Apply Now” button below!

     

    #AlbionarcJobs#FintechJobs

    #AsiaJobs#MiddleEastCareers

    #TechTalent#FintechRecruitment

    #FinanceOpportunities#

     

     

     

     

Apply Now

- $0.00

Select your currency