Search Open Jobs
All GTN W2 consultants get full benefits. Learn more.
Engineering Manager, HPC Kubernetes Platform
Posted: 03/27/2026
Employment Type:
Direct Hire
Job Number: 28014
Pay Rate: $ - $
Remote Friendly?:
Job Description
Engineering Manager, AI Compute Platform (CaaS / GPUaaS)
Location: Dallas, TX (Relocation available)
Type: Direct Hire
•Competitive base salary + performance bonus
•100% company-paid benefits
Overview
We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.
This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications, enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.
You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads, ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models.
This is a hands-on leadership role focused on platform architecture, performance engineering, and automation, with direct impact on how GPU compute is delivered as a scalable service.
Key Responsibilities
Leadership & Team Development
Platform Architecture –GPUaaS / CaaS
GPU Platform & Workload Optimization
Automation, SRE & Platform Operations
Performance, Reliability & Capacity Planning
Cross-Functional & Ecosystem Collaboration
Required Experience
Preferred Experience
Why This Role
Location: Dallas, TX (Relocation available)
Type: Direct Hire
•Competitive base salary + performance bonus
•100% company-paid benefits
Overview
We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.
This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications, enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.
You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads, ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models.
This is a hands-on leadership role focused on platform architecture, performance engineering, and automation, with direct impact on how GPU compute is delivered as a scalable service.
Key Responsibilities
Leadership & Team Development
- Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
- Foster a culture of ownership, reliability, and continuous improvement
- Drive alignment across platform, infrastructure, and product teams
Platform Architecture –GPUaaS / CaaS
- Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
- Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
- Define service models for GPU consumption, including workload orchestration, tenancy, and quota management
GPU Platform & Workload Optimization
- Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
- Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
- Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)
Automation, SRE & Platform Operations
- Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
- Implement SRE principles across observability, reliability, and incident response
- Build automated workflows for cluster provisioning, scaling, and lifecycle management
Performance, Reliability & Capacity Planning
- Own platform performance across thousands of GPU/CPU nodes
- Define and track KPIs for utilization, latency, throughput, and system health
- Lead capacity planning aligned with rapid AI compute demand growth
Cross-Functional & Ecosystem Collaboration
- Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
- Collaborate with hardware and software vendors (e.G., NVIDIA ecosystem) to optimize platform capabilities
- Align platform architecture with evolving AI infrastructure and GPUaaS service offerings
Required Experience
- 7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
- Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
- Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
- Deep understanding of GPU scheduling, workload orchestration, and resource isolation
- Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
- Experience managing large-scale, distributed compute environments
- Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
- Excellent leadership and communication skills
Preferred Experience
- Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
- Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
- Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
- Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
- Experience operating in hyperscale or AI-first infrastructure environments
Why This Role
- Direct ownership of a GPUaaS / CaaS platform at scale
- Work at the forefront of AI infrastructure and high-performance compute
- Opportunity to define how GPU compute is delivered as a service in next-generation environments
- High visibility role with impact across platform, product, and customer experience
Share This Job:
Related Jobs:
Login to save this search and get notified of similar positions.About Dallas, TX
Unlock your potential in the vibrant job market of the Dallas-Fort Worth metroplex! This bustling region in the great state of Texas boasts a perfect blend of southern charm and big-city opportunities. Dive into a dynamic career scene with access to renowned landmarks like the Dallas Arboretum and Botanical Garden, exquisite cuisine from Tex-Mex to BBQ, and cultural hotspots such as the Dallas Museum of Art and the AT&T Performing Arts Center. Cheer for the Dallas Cowboys at the AT&T Stadium or enjoy the outdoors at White Rock Lake. Discover why Dallas is the ultimate destination for growth, opportunity, and a fulfilling career journey. Explore our job listings today and embark on a new chapter in this captivating city!
