Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via email Operate and manage Kubernetes or OpenShift clusters for multinode orchestration
Deploy and manage LLMs and other AI models for inference using Triton Inference Server or custom endpoints
Automate CI/CD pipelines for model packaging serving retraining and rollback using GitLab CI or ArgoCD
Set up model and infrastructure monitoring systems (Prometheus Grafana NVIDIA DCGM)
Implement model drift detection performance alerting and inference logging
Manage model checkpoints reproducibility controls and rollback strategies
Track deployed model versions using MLFlow or equivalent registry tools
Implement secure access controls for model endpoints and data artifacts
Collaborate with AI / Data Engineer to integrate and deploy finetuned datasets
Ensure high availability performance and observability of all AI services in production
3 years experience in DevOps MLOps or AI/ML infrastructure roles
10 overall experience with solution operations
Proven experience with Kubernetes or OpenShift in production environments preferably certified.
Familiarity with deploying and scaling PyTorch or TensorFlow models for inference
Experience with CI/CD automation tools with Open Shift / Kubernetes
Handson experience with model registry systems (e.g. MLFlow KubeFlow)
Experience with monitoring tools (e.g. Prometheus Grafana) and GPU workload optimization
Strong scripting skills (Python Bash) and Linux system administration knowledge
Full Time