Our Client Currently looking for DataOps Engineer (AI Platform Engineer)
What you’ll actually do
- Close collaboration with infrastructure teams on selection and configuring GPU servers, high-performance networking, and RDMA-enabled clusters.
- Perform and manage GPU MIG configurations based on workload requirements and model characteristics.
- Ensure reliable and scalable GPU operations in Kubernetes, including runtime integration, device plugins, and GPU scheduling capabilities.
- Design, deploy, and maintain model serving runtimes, including vLLM, ONNX, SGLang, Nvidia Triton Runtimes, and KServe, ensuring high performance, scalability, and efficient GPU utilization.
- Build and maintain CI/CD pipelines and tooling for model packaging, versioning, and deployment, enabling reliable and model delivery for internal teams.
- Build and maintain platform tooling for model lifecycle management, including experiment tracking, model versioning, and registry systems (e.g. MLflow).
- Enable infrastructure and workflows for model fine-tuning and adaptation (e.g. LoRA), focusing on scalability, reproducibility, and automation within the platform.
- Develop and support internal tooling for managing model inputs and configurations (e.g. prompt templates), enabling consistent and reusable model usage patterns.
- Conduct performance testing and evaluation of multi-node GPU clusters to identify and resolve bottlenecks.
- Build and maintain observability for GPU clusters and model workloads, including metrics such as GPU utilization, memory usage, throughput, and latency.
- Integrate tracing for model inference workflows to provide end-to-end visibility into requests, and model behavior.
- Ensure compliance with security requirements for platform development.
- Evaluate and benchmark model inference performance across different runtimes, hardware setups, and configurations to guide platform optimization.
Who We’re Looking For
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field
- 5+ years of experience in infrastructure, platform engineering, or distributed systems, preferably in environments involving machine learning or GPU workloads
- Strong experience with Kubernetes, including deploying and operating production workloads
- Experience with Linux-based environments
- Strong programming skills in Python and/or Go
- Experience working with GPU infrastructure, including NVIDIA or AMD stack and multi-GPU environments will be considered highly advantageous
- Understanding of distributed systems and multi-node workloads
- Experience with model serving and inference systems (e.g. vLLM, ONNX, SGLang, Nvidia Triton Runtimes, KServe)
- Experience with CI/CD pipelines and automation for deploying services or models
- Experience with monitoring and observability tools (metrics, tracing, logging)
- Nice to have familiarity with networking concepts relevant to distributed systems (e.g. RDMA, high-performance networking)
- Good communication and problem-solving skills
- Ability to use advanced English for different work and business purposes
- Critical thinking and attention to detail
- Decision-making skills and the ability to adapt to new changes
- Ability to write concise and clear documentation
- Capability of dealing with constructive critics and knowing how to develop relationships with the team to achieve common goals
-
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#AlbionarcJobs#FintechJobs
#AsiaJobs#MiddleEastCareers
#TechTalent#FintechRecruitment
#FinanceOpportunities#
Â
