MLOps Engineer - Generative AI

Our Client Currently looking for MLOps Engineer – Generative AI

II. CORE RESPONSIBILITIES :

– Architect self-hosted inference clusters using vLLM, TGI (Text Generation Inference), and TensorRT-LLM on on-premise NVIDIA DGX systems and GPU racks, ensuring sub-100ms latency for 70B+ parameter models.

– Design parallel workflows on AWS SageMaker (Endpoints/Pipelines), Google Vertex AI (Prediction/Training), and Azure ML for elastic training workloads and managed foundation model APIs.

– Implement cloud-agnostic model deployment using Kubernetes (EKS/GKE/AKS) with portability across private data centers and cloud VPCs, ensuring zero vendor lock-in.

– Deploy multi-GPU inference parallelism (tensor + pipeline parallelism) for foundation models using Ray Serve, NVIDIA Triton, and custom FastAPI stacks.

– Optimize inference economics through quantization (AWQ/GPTQ/FP8), KV-cache optimization, and continuous batching – reducing per-token costs by 40%+.

– Build auto-scaling GPU node pools (Karpenter/Cluster Autoscaler) that respond to inference demand spikes within seconds.

– Implement RLHF (Reinforcement Learning from Human Feedback) infrastructure using DeepSpeed, LoRA/QLoRA fine-tuning pipelines, and distributed training orchestration.

– Design evaluation frameworks for LLMs : automated benchmarking (MMLU, HumanEval), A/B testing for model versions, and human-in-the-loop feedback systems.

– Manage vector database infrastructure (Pinecone, Weaviate, Milvus, pgvector) for RAG systems spanning private and cloud environments.

– Build CI/CD for ML using GitOps (ArgoCD/Flux) with model versioning (MLflow/DVC), automated testing for data drift, and canary deployments for model updates.

– Implement feature stores (Feast/Tecton) and experiment tracking (Weights & Biases/MLflow) supporting both cloud and on-premise data lakes.

– Create observability stacks for LLMs : token-level latency tracking, GPU memory saturation alerts, and cost-per-inference dashboards using Prometheus/Grafana/CloudWatch.

– Manage secrets, model encryption at rest (HashiCorp Vault), and network policies (Istio/Linkerd) for multi-tenant model serving.

III. ESSENTIAL QUALIFICATIONS & EXPERIENCE :

Educational Qualifications :

– Bachelor’s degree (B.E./B.Tech) in Computer Science, Engineering, Mathematics, or related technical field from a recognized university. Graduates from IITs, NITs, BITS, IIIT, or top-tier engineering institutions preferred.

– Master’s degree (M.Tech/MS) in Machine Learning, Computer Science, Artificial Intelligence, or related field desirable but not mandatory.

– Relevant professional certifications in cloud platforms (AWS/Azure/GCP) and Kubernetes (CKA/CKAD) highly desirable.

Experience Requirements :

– Minimum 5 – 9 years of hands-on experience in production ML infrastructure engineering, with at least 2 years dedicated to large-scale model deployment and MLOps.

– Demonstrable track record of deploying and maintaining 70B+ parameter models in production environments (are preferred).

– Proven experience managing both on-premise GPU clusters (NVIDIA DGX, A100/H100) and cloud-based ML platforms (AWS SageMaker, Google Vertex AI, or Azure ML).

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#AlbionarcJobs#FintechJobs

#AsiaJobs#MiddleEastCareers

#TechTalent#FintechRecruitment

#FinanceOpportunities#