Search Open Jobs

All GTN W2 consultants get full benefits. Learn more.

So sorry, this position is no longer available. Please go ahead and submit your application. We may have other positions that would be the perfect fit for you. Alternatively, you may want to apply to one of the following related jobs:

Compute Platform Engineer

Dallas, TX

Posted: 03/26/2026 Job Number: 27985 Pay Rate: $ - $230000 Remote Friendly?:

Job Description


Compute Platform Engineer
Location: Dallas, TX (Hybrid) -relocation available 

Type: Direct Hire

•Competitive base salary + performance bonus
•100% company-paid benefits
Overview
We are seeking a Compute Platform Engineer to support the reliability, performance, and operational health of large-scale, high-performance compute infrastructure supporting critical research and production workloads.

This role is responsible for maintaining and troubleshooting CPU and GPU-based compute platforms, ensuring consistent performance at scale, and driving operational excellence across the environment. The position works closely with platform engineering, infrastructure, operations teams, and hardware vendors to support a stable and highly available compute ecosystem.

The ideal candidate brings strong hands-on experience with HPC or AI infrastructure, deep knowledge of server hardware, and a proactive approach to troubleshooting, automation, and continuous improvement.
Key ResponsibilitiesCompute Infrastructure Engineering
•Design, configure, and manage high-performance compute infrastructure composed of CPU and GPU nodes
•Support large-scale HPC and AI platforms, ensuring systems are stable, performant, and production-ready
•Perform diagnostics, tuning, and capacity planning to support efficient scale-out of compute environments
Hardware Reliability & Lifecycle Management
•Manage full firmware and BIOS lifecycle across compute infrastructure, including baselines, validation, rollout, and compliance
•Troubleshoot complex hardware issues across CPU, GPU, DPU, NVSwitch, NICs, memory, PSU, and BMC components
•Drive root cause analysis and implement solutions to improve system reliability and reduce recovery time
•Analyze hardware lifecycle processes and recommend improvements for optimization and efficiency
Automation & Platform Operations
•Automate health checks, onboarding workflows, and operational processes to improve deployment efficiency
•Leverage Infrastructure-as-Code (IaC) methodologies to enable scalable and repeatable infrastructure management
•Recommend and implement tooling and process improvements to enhance platform operations
Vendor & Cross-Functional Collaboration
•Collaborate with hardware vendors to resolve firmware and system issues, providing detailed diagnostics, logs, and impact analysis
•Work closely with infrastructure, platform, and operations teams to align on system performance and reliability goals
•Support integration of hardware improvements across the broader environment
Monitoring, Performance & Security
•Monitor hardware performance and identify opportunities for optimization
•Implement best practices for platform security and system hardening
•Ensure adherence to operational standards and data center processes
Technical Leadership
•Act as a subject matter expert for compute infrastructure and hardware-related issues
•Mentor junior engineers and contribute to a culture of continuous improvement and technical excellence
Required Experience
•3+ years of hands-on experience supporting large-scale compute platforms, HPC, or AI infrastructure
•Strong experience with HPE server platforms such as ProLiant and Apollo
•Experience working with NVIDIA GPUs, including A100, H100/H200, or similar
•Solid understanding of server architecture including UEFI/BIOS, PCIe devices, and out-of-band management systems (iLO, BMC)
•Proven ability to troubleshoot complex hardware issues and coordinate with vendors for resolution
•Experience with Linux in high-performance or latency-sensitive environments
•Familiarity with core networking concepts including DNS, DHCP, VLANs, switching, and routing
•Experience working within data center environments and operational processes
Technical Skills
•Experience with automation tools such as Ansible, Terraform, and CI/CD pipelines
•Exposure to Infrastructure-as-Code (IaC) practices
•Working knowledge of Kubernetes and/or OpenStack (preferred)
•Strong problem-solving and analytical skills with the ability to operate in complex environments
Preferred Experience
•Experience supporting AI platforms or next-generation GPU architectures
•Exposure to large-scale distributed compute environments
•Experience working in mission-critical or high-availability infrastructure environments
Apply Online

Send an email reminder to:

Share This Job:

Related Jobs:

Login to save this search and get notified of similar positions.

About Dallas, TX

Unlock your potential in the vibrant job market of the Dallas-Fort Worth metroplex! This bustling region in the great state of Texas boasts a perfect blend of southern charm and big-city opportunities. Dive into a dynamic career scene with access to renowned landmarks like the Dallas Arboretum and Botanical Garden, exquisite cuisine from Tex-Mex to BBQ, and cultural hotspots such as the Dallas Museum of Art and the AT&T Performing Arts Center. Cheer for the Dallas Cowboys at the AT&T Stadium or enjoy the outdoors at White Rock Lake. Discover why Dallas is the ultimate destination for growth, opportunity, and a fulfilling career journey. Explore our job listings today and embark on a new chapter in this captivating city!