Search Open Jobs

All GTN W2 consultants get full benefits. Learn more.

Engineering Manager, HPC Kubernetes Platform

Dallas, TX

Posted: 03/27/2026 Employment Type: Direct Hire Job Number: 28014 Pay Rate: $ - $ Remote Friendly?:

Job Description

Engineering Manager, AI Compute Platform (CaaS / GPUaaS)
Location: Dallas, TX (Relocation available)


Type: Direct Hire

•Competitive base salary + performance bonus
•100% company-paid benefits

Overview
We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.

This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications, enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.

You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads, ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models.

This is a hands-on leadership role focused on platform architecture, performance engineering, and automation, with direct impact on how GPU compute is delivered as a scalable service.

Key Responsibilities
Leadership & Team Development
  • Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
  • Foster a culture of ownership, reliability, and continuous improvement
  • Drive alignment across platform, infrastructure, and product teams

Platform Architecture –GPUaaS / CaaS
  • Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
  • Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
  • Define service models for GPU consumption, including workload orchestration, tenancy, and quota management

GPU Platform & Workload Optimization
  • Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
  • Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
  • Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)

Automation, SRE & Platform Operations
  • Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
  • Implement SRE principles across observability, reliability, and incident response
  • Build automated workflows for cluster provisioning, scaling, and lifecycle management

Performance, Reliability & Capacity Planning
  • Own platform performance across thousands of GPU/CPU nodes
  • Define and track KPIs for utilization, latency, throughput, and system health
  • Lead capacity planning aligned with rapid AI compute demand growth

Cross-Functional & Ecosystem Collaboration
  • Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
  • Collaborate with hardware and software vendors (e.G., NVIDIA ecosystem) to optimize platform capabilities
  • Align platform architecture with evolving AI infrastructure and GPUaaS service offerings

Required Experience
  • 7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
  • Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
  • Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
  • Deep understanding of GPU scheduling, workload orchestration, and resource isolation
  • Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
  • Experience managing large-scale, distributed compute environments
  • Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
  • Excellent leadership and communication skills

Preferred Experience
  • Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
  • Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
  • Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
  • Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
  • Experience operating in hyperscale or AI-first infrastructure environments

Why This Role
  • Direct ownership of a GPUaaS / CaaS platform at scale
  • Work at the forefront of AI infrastructure and high-performance compute
  • Opportunity to define how GPU compute is delivered as a service in next-generation environments
  • High visibility role with impact across platform, product, and customer experience
Apply Online

Send an email reminder to:

Share This Job:

Related Jobs:

Login to save this search and get notified of similar positions.

About Dallas, TX

Unlock your potential in the vibrant job market of the Dallas-Fort Worth metroplex! This bustling region in the great state of Texas boasts a perfect blend of southern charm and big-city opportunities. Dive into a dynamic career scene with access to renowned landmarks like the Dallas Arboretum and Botanical Garden, exquisite cuisine from Tex-Mex to BBQ, and cultural hotspots such as the Dallas Museum of Art and the AT&T Performing Arts Center. Cheer for the Dallas Cowboys at the AT&T Stadium or enjoy the outdoors at White Rock Lake. Discover why Dallas is the ultimate destination for growth, opportunity, and a fulfilling career journey. Explore our job listings today and embark on a new chapter in this captivating city!