DataOps Engineer (AI Platform Engineer)

Engineer

DataOps Engineer (AI Platform Engineer)

Apply Now

- $0.00

  • Date posted
    April 21, 2026
  • Expiration date
    July 21, 2026
  • Application ends
    July 21, 2026

Our Client Currently looking for DataOps Engineer (AI Platform Engineer)

What you’ll actually do

  • Close collaboration with infrastructure teams on selection and configuring GPU servers, high-performance networking, and RDMA-enabled clusters.
  • Perform and manage GPU MIG configurations based on workload requirements and model characteristics.
  • Ensure reliable and scalable GPU operations in Kubernetes, including runtime integration, device plugins, and GPU scheduling capabilities.
  • Design, deploy, and maintain model serving runtimes, including vLLM, ONNX, SGLang, Nvidia Triton Runtimes, and KServe, ensuring high performance, scalability, and efficient GPU utilization.
  • Build and maintain CI/CD pipelines and tooling for model packaging, versioning, and deployment, enabling reliable and model delivery for internal teams.
  • Build and maintain platform tooling for model lifecycle management, including experiment tracking, model versioning, and registry systems (e.g. MLflow).
  • Enable infrastructure and workflows for model fine-tuning and adaptation (e.g. LoRA), focusing on scalability, reproducibility, and automation within the platform.
  • Develop and support internal tooling for managing model inputs and configurations (e.g. prompt templates), enabling consistent and reusable model usage patterns.
  • Conduct performance testing and evaluation of multi-node GPU clusters to identify and resolve bottlenecks.
  • Build and maintain observability for GPU clusters and model workloads, including metrics such as GPU utilization, memory usage, throughput, and latency.
  • Integrate tracing for model inference workflows to provide end-to-end visibility into requests, and model behavior.
  • Ensure compliance with security requirements for platform development.
  • Evaluate and benchmark model inference performance across different runtimes, hardware setups, and configurations to guide platform optimization.

Who We’re Looking For

 

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field
  • 5+ years of experience in infrastructure, platform engineering, or distributed systems, preferably in environments involving machine learning or GPU workloads
  • Strong experience with Kubernetes, including deploying and operating production workloads
  • Experience with Linux-based environments
  • Strong programming skills in Python and/or Go
  • Experience working with GPU infrastructure, including NVIDIA or AMD stack and multi-GPU environments will be considered highly advantageous
  • Understanding of distributed systems and multi-node workloads
  • Experience with model serving and inference systems (e.g. vLLM, ONNX, SGLang, Nvidia Triton Runtimes, KServe)
  • Experience with CI/CD pipelines and automation for deploying services or models
  • Experience with monitoring and observability tools (metrics, tracing, logging)
  • Nice to have familiarity with networking concepts relevant to distributed systems (e.g. RDMA, high-performance networking)
  • Good communication and problem-solving skills
  • Ability to use advanced English for different work and business purposes
  • Critical thinking and attention to detail
  • Decision-making skills and the ability to adapt to new changes
  • Ability to write concise and clear documentation
  • Capability of dealing with constructive critics and knowing how to develop relationships with the team to achieve common goals
  • Are you interested in this position?

     

    Apply by clicking on the “Apply Now” button below!

     

    #AlbionarcJobs#FintechJobs

    #AsiaJobs#MiddleEastCareers

    #TechTalent#FintechRecruitment

    #FinanceOpportunities#

     

     

     

     

     

     

     

Apply Now

- $0.00

Select your currency