DevOps IT Infrastructure Engineer
ملخص الوظيفة
Our client is a young high-tech company incorporated in the heart of one of the worlds fastest-growing
tech hubsDubai UAE.
tech hubsDubai UAE.
As the exclusive software partner to one of the worlds largest ODMs in the networking equipment space they develop the Network Operating Systems that power critical data centre and telecom routing & switching infrastructure. Building on this foundation they have recently launched an AI division focused on designing our own chips to accelerate inference and
training workloads.
training workloads.
What sets them apart is their unique position at the centre of a historic development: our ODM partner
is establishing the first networking equipment factory of its kind in the GCC region and they are the
software engine driving this groundbreaking initiative.
is establishing the first networking equipment factory of its kind in the GCC region and they are the
software engine driving this groundbreaking initiative.
They are not just building technologythey are building a true networking vendor that serves regional interests while meeting the growing demandfor networking equipment across the MENA region and further.
Their long-term vision extends beyond products to people: creating a thriving ecosystem forembedded systems and ASIC design talent that will produce generations of world-classprofessionals establishing our region as a global centre of excellence for Enterprise Computeinnovation.
As a rapidly growing company at the forefront of AI hardware innovation they are constantly seeking talented and motivated individuals to join their team. We offer a dynamic and challenging work environment with opportunities to make a significant impact on the future of AI technology.
Your Mission
Own the end-to-end design and operation of our on-premise infrastructure for AI and enterpriseworkloadsbuilt as code automated observable and secure. You will architect and runKubernetes clusters for training/inference manage servers networks and core services andenable developers with reliable CI/CD and platform tooling. This is where minutes time-to-
recovery and cost-per-job directly impact AI velocity at scale.
Cost & toil reduction: infra tasks automated to eliminate recurring manual work; fewer custom
one-offs more reusable modules; quarterly infra spend per GPU hour trends down.
Clear docs & self-service: engineers rely on concise runbooks and service catalogs; >80% of
routine requests resolved via self-service workflows rather than ad-hoc ops support.
Own the end-to-end design and operation of our on-premise infrastructure for AI and enterpriseworkloadsbuilt as code automated observable and secure. You will architect and runKubernetes clusters for training/inference manage servers networks and core services andenable developers with reliable CI/CD and platform tooling. This is where minutes time-to-
recovery and cost-per-job directly impact AI velocity at scale.
Responsibilities
Design and operate on-prem infrastructure as code: author reusable Terraform/Ansible/Helm
modules; build GitOps workflows (e.g. Argo CD) for repeatable audited changes acrossenvironments.
Design and operate on-prem infrastructure as code: author reusable Terraform/Ansible/Helm
modules; build GitOps workflows (e.g. Argo CD) for repeatable audited changes acrossenvironments.
Build and run Kubernetes for AI: configure multi-tenant GPU clusters (MIG/GPUDirect RDMA
NVIDIA device plugins/DCGM) scheduling/quotas HPA/Cluster Autoscaler (where applicable)
and workload isolation.
NVIDIA device plugins/DCGM) scheduling/quotas HPA/Cluster Autoscaler (where applicable)
and workload isolation.
Administer servers networks and core services: OS lifecycle (Linux) identity/SSO
(Keycloak/LDAP) secrets (Vault) DNS/DHCP/NTP artifact registries and internal package
mirrors.
(Keycloak/LDAP) secrets (Vault) DNS/DHCP/NTP artifact registries and internal package
mirrors.
Provide storage for AI pipelines: integrate and operate high-bandwidth/low-latency storage
tune for dataset staging and checkpointing patterns.
tune for dataset staging and checkpointing patterns.
Enable CI/CD: partner with developers to design fast reproducible pipelines (GitLab CI/GitHub
Actions) caching and runners on GPU/CPU nodes artifact provenance (SBOM SLSA).
Actions) caching and runners on GPU/CPU nodes artifact provenance (SBOM SLSA).
Youll Collaborate WithPlatform and ML engineers running training/inference at scale silicon and systems teams
integrating hardware in the lab security engineers safeguarding credentials and supply chain
application developers delivering services via CI/CD and site ops supporting data centre
deployments together we turn infrastructure into a product that accelerates the business.
integrating hardware in the lab security engineers safeguarding credentials and supply chain
application developers delivering services via CI/CD and site ops supporting data centre
deployments together we turn infrastructure into a product that accelerates the business.
Minimum Qualifications
5 years in DevOps/SRE/Platform Engineering with hands-on ownership of on-prem
environments.
Proven experience operating Kubernetes in production (multi-tenant RBAC networking/CNI
storage ingress monitoring).
Proficiency with IaC and automation (Terraform Ansible Helm; GitOps with Argo CD/Flux).
Strong Linux administration scripting (Bash/Python) and troubleshooting across the stack
(compute network storage).
CI/CD expertise (GitLab CI/GitHub Actions) container build security (SBOM image signing) and
artifact management.
Solid networking fundamentals (L2/L3 routing BGP VLANs EVPN/VXLAN load balancing
TLS/mTLS).
Experience implementing observability (Prometheus/Grafana logs tracing) and running incident
response.
environments.
Proven experience operating Kubernetes in production (multi-tenant RBAC networking/CNI
storage ingress monitoring).
Proficiency with IaC and automation (Terraform Ansible Helm; GitOps with Argo CD/Flux).
Strong Linux administration scripting (Bash/Python) and troubleshooting across the stack
(compute network storage).
CI/CD expertise (GitLab CI/GitHub Actions) container build security (SBOM image signing) and
artifact management.
Solid networking fundamentals (L2/L3 routing BGP VLANs EVPN/VXLAN load balancing
TLS/mTLS).
Experience implementing observability (Prometheus/Grafana logs tracing) and running incident
response.
Preferred (Nice-to-Haves)
GPU cluster operations for AI (NVIDIA drivers/operator DCGM MIG GPUDirect RDMA Slurm
integration).
Storage for data-intensive workloads (Ceph parallel filesystems NVMe-oF) and performance
tuning.
Secrets/identity platforms (Vault Keycloak/LDAP/SSO) policy-as-code (OPA/Gatekeeper
Kyverno).
Security/compliance practices (CIS benchmarks SLSA supply-chain scanning) and zero-trust
networking.
Data centre experience (rack/stack power/cooling basics) and remote site rollout automation.
Familiarity with configuration management for network devices and API-driven switches/routers.
integration).
Storage for data-intensive workloads (Ceph parallel filesystems NVMe-oF) and performance
tuning.
Secrets/identity platforms (Vault Keycloak/LDAP/SSO) policy-as-code (OPA/Gatekeeper
Kyverno).
Security/compliance practices (CIS benchmarks SLSA supply-chain scanning) and zero-trust
networking.
Data centre experience (rack/stack power/cooling basics) and remote site rollout automation.
Familiarity with configuration management for network devices and API-driven switches/routers.
Reproducible environments by default: any engineer can spin up an identical dev/test stack
(K8s namespace storage secrets runners) from Git in 30 minutes with audit trails for every
change.
(K8s namespace storage secrets runners) from Git in 30 minutes with audit trails for every
change.
Solid CI/CD for AI workflows: model/build/test pipelines are deterministic and cache-efficient;
median pipeline time down 3050% with artifact provenance (SBOM signatures) and traceable
datasets/checkpoints.
median pipeline time down 3050% with artifact provenance (SBOM signatures) and traceable
datasets/checkpoints.
Predictable GPU orchestration: fair-share scheduling quotas and isolation (MIG/namespace
policies) keep queues short; cluster utilization increases >20% without starving latency-
sensitive jobs.
sensitive jobs.
Lab-to-cluster continuity: hardware bring-up images drivers and firmware are versioned and
promoted through the same pipelines; new boards/nodes join clusters with push-button
automation.
promoted through the same pipelines; new boards/nodes join clusters with push-button
automation.
Actionable observability: dashboards and alerts reflect SLOs meaningful to researchers
(throughput time-to-first-token I/O wait GPU mem pressure); MTTR <30 minutes for priority
services.
(throughput time-to-first-token I/O wait GPU mem pressure); MTTR <30 minutes for priority
services.
Cost & toil reduction: infra tasks automated to eliminate recurring manual work; fewer custom
one-offs more reusable modules; quarterly infra spend per GPU hour trends down.
Clear docs & self-service: engineers rely on concise runbooks and service catalogs; >80% of
routine requests resolved via self-service workflows rather than ad-hoc ops support.
Please note that the client can obtain work visas for Dubai
Please ignore the salary level - there is flexibility depending on the persons profile.
KIndly complete teh attached questionnaire to complete application : Experience:
IC
عن الشركة
The Complete Global Talent Acquisition Partner, MBR offers solutions from Executive Search through to highly cost effective Partnership Resourcing models.