Our Client Currently looking for MLOps Engineer – Generative AI
II. CORE RESPONSIBILITIES :
– Architect self-hosted inference clusters using vLLM, TGI (Text Generation Inference), and TensorRT-LLM on on-premise NVIDIA DGX systems and GPU racks, ensuring sub-100ms latency for 70B+ parameter models.
– Design parallel workflows on AWS SageMaker (Endpoints/Pipelines), Google Vertex AI (Prediction/Training), and Azure ML for elastic training workloads and managed foundation model APIs.
– Implement cloud-agnostic model deployment using Kubernetes (EKS/GKE/AKS) with portability across private data centers and cloud VPCs, ensuring zero vendor lock-in.
– Deploy multi-GPU inference parallelism (tensor + pipeline parallelism) for foundation models using Ray Serve, NVIDIA Triton, and custom FastAPI stacks.
– Optimize inference economics through quantization (AWQ/GPTQ/FP8), KV-cache optimization, and continuous batching – reducing per-token costs by 40%+.
– Build auto-scaling GPU node pools (Karpenter/Cluster Autoscaler) that respond to inference demand spikes within seconds.
– Implement RLHF (Reinforcement Learning from Human Feedback) infrastructure using DeepSpeed, LoRA/QLoRA fine-tuning pipelines, and distributed training orchestration.
– Design evaluation frameworks for LLMs : automated benchmarking (MMLU, HumanEval), A/B testing for model versions, and human-in-the-loop feedback systems.
– Manage vector database infrastructure (Pinecone, Weaviate, Milvus, pgvector) for RAG systems spanning private and cloud environments.
– Build CI/CD for ML using GitOps (ArgoCD/Flux) with model versioning (MLflow/DVC), automated testing for data drift, and canary deployments for model updates.
– Implement feature stores (Feast/Tecton) and experiment tracking (Weights & Biases/MLflow) supporting both cloud and on-premise data lakes.
– Create observability stacks for LLMs : token-level latency tracking, GPU memory saturation alerts, and cost-per-inference dashboards using Prometheus/Grafana/CloudWatch.
– Manage secrets, model encryption at rest (HashiCorp Vault), and network policies (Istio/Linkerd) for multi-tenant model serving.
III. ESSENTIAL QUALIFICATIONS & EXPERIENCE :
Educational Qualifications :
– Bachelor’s degree (B.E./B.Tech) in Computer Science, Engineering, Mathematics, or related technical field from a recognized university. Graduates from IITs, NITs, BITS, IIIT, or top-tier engineering institutions preferred.
– Master’s degree (M.Tech/MS) in Machine Learning, Computer Science, Artificial Intelligence, or related field desirable but not mandatory.
– Relevant professional certifications in cloud platforms (AWS/Azure/GCP) and Kubernetes (CKA/CKAD) highly desirable.
Experience Requirements :
– Minimum 5 – 9 years of hands-on experience in production ML infrastructure engineering, with at least 2 years dedicated to large-scale model deployment and MLOps.
– Demonstrable track record of deploying and maintaining 70B+ parameter models in production environments (are preferred).
– Proven experience managing both on-premise GPU clusters (NVIDIA DGX, A100/H100) and cloud-based ML platforms (AWS SageMaker, Google Vertex AI, or Azure ML).
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#AlbionarcJobs#FintechJobs
#AsiaJobs#MiddleEastCareers
#TechTalent#FintechRecruitment
#FinanceOpportunities#
