Site Reliability Engineer

Register

Engineer

Site Reliability Engineer

2 hours ago Remote

Apply Now

- $0.00

Date posted

May 11, 2026
Expiration date

August 11, 2026
Application ends

August 11, 2026

Our Client Currently looking for Site Reliability Engineer

Responsibilities

Design, deploy, and operate reliable and scalable systems across cloud and Kubernetes environments.

Automate infrastructure provisioning, deployments, and operational workflows.

Build and maintain tools for deployment, monitoring, and system operations.

Monitor system health and performance, and proactively identify areas for improvement.

Troubleshoot and resolve issues across development, test, and production environments.

Participate in incident response, root cause analysis, and reliability improvements.

Collaborate with engineering teams to improve system operability and deployment safety.

Support and operate large-scale systems, including data-intensive or AI-driven workloads.

Requirements

2 – 6 years of experience managing and operating production infrastructure and services in cloud environments such as AWS, Azure, or GCP.

Strong hands-on experience with Linux systems in production environments.

Experience working with containerized workloads and Kubernetes in real-world scenarios.

Working knowledge of Infrastructure as Code tools such as Terraform, Terragrunt, or Crossplane.

Experience designing and maintaining CI/CD pipelines using tools such as GitHub Actions, GitLab CI, Jenkins, Azure DevOps, or similar.

Familiarity with GitOps principles and tools such as Argo CD or Flux.

Solid understanding of cloud networking concepts, load balancing, and service connectivity.

Experience with monitoring, logging, and alerting systems such as Prometheus, Grafana, ELK/EFK, Datadog, or equivalent.

Proficiency in at least one scripting or programming language (e.g., Bash, Python).

Experience working with relational databases; exposure to NoSQL or data platforms is a plus.

Experience participating in on-call rotations, responding to production incidents, and performing root cause analysis.

Understanding of SLIs, SLOs, and error budgets, and how they are used to guide reliability and operational decisions.

Strong problem-solving skills and the ability to debug complex production issues.

Good verbal and written communication skills, especially during incidents and technical discussions.

Nice to Have

Experience operating systems at scale or in high-availability environments.

Exposure to on-prem or hybrid infrastructure.

Experience supporting data platforms, analytics, or AI/ML workloads.
Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#AlbionarcJobs#FintechJobs

#AsiaJobs#MiddleEastCareers

#TechTalent#FintechRecruitment

#FinanceOpportunities#

Apply Now

- $0.00