Search Open Jobs
All GTN W2 consultants get full benefits. Learn more.
So sorry, this position is no longer available.
Please go ahead and submit your application. We may have other positions that would be the perfect fit for you.
Alternatively, you may want to apply to one of the following related jobs:
HPC Performance and Validation Engineer
Posted: 03/27/2026
Employment Type:
Direct Hire
Job Number: 28022
Pay Rate: $ - $
Remote Friendly?:
Job Description
HPC Performance and Validation Engineer
Location: Dallas, TX
Overview
This organization is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance high-performance computing (HPC) and cloud infrastructure that supports clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.
As an HPC Validation and Performance Engineer, you will take ownership of the validation and optimization of HPC CPU and GPU calc farms. This critical role will involve developing a validation and performance baselining framework, which ensures system readiness for AI/ML and HPC workloads across multiple architectures. Your role will be essential in providing continuous performance benchmarking, real-time observability, and long-term strategic readiness. You will drive the implementation of advanced tooling and frameworks, maintaining an infrastructure that is crucial to cutting-edge research efforts. You will be accountable for providing data driven performance metrics to support architectural design choices as the organization continues to globally scale its datacenter footprint.
We are looking for someone with deep technical expertise in compute, storage or networking optimizations and performance engineering who can develop solutions that scale with growing infrastructure. This role demands a forward-thinking engineer who can anticipate industry trends and adopt emerging architectures and strategies to stay at the forefront of innovation.
Key Responsibilities
Validation & Performance Framework
- Architect and implement a validation framework to certify the readiness and utilization of GPU nodes across a large, distributed HPC environment.
- Define methodologies to continually assess performance and optimising infrastructure across AI/ML workloads.
- Develop and execute comprehensive performance testing using industry and customer specific benchmarks, ensuring optimal performance across HPC compute, storage, and networking.
- Define and implement best practice for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge.
Benchmarking & Research
- Contribute to research reports that will describe the discoveries of the benchmarking, evaluating the complete hardware performance and efficiency.
- Lead efforts to debug, identify, and then resolve bottlenecks in system performance.
- Staying informed on industry trends and advancements to ensure long-term strategic alignment.
Tooling & Automation
- Build robust, scalable tools for automated validation and testing, utilising Python, Go, Kubernetes, and CI/CD pipelines to streamline continuous validation and benchmarking processes.
- Implement monitoring solutions using Prometheus, Grafana, and other modern monitoring technologies to track performance metrics and real-time health of the cluster.
Collaboration
- Work cross-functionally with engineering, infrastructure, and research teams to align validation efforts with the broader business objectives, ensuring that the platform meets evolving research demands.
Required Experience
- Bachelor's degree or equivalent experience.
- Accelerator performance experience, including profiling and tuning with large-scale GPU clusters.
- In-depth understanding of NVIDIA ClusterKit, Nsight and Validation Suite, MLPerf, and DCGM tools for GPU and DPUs.
- Networking & Storage performance experience, including profiling and optimisation with NVIDIA ClusterKit, iPerf, or equivalent across InfiniBand/RoCE network implementations.
- System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent.
- Experience with HPC workloads across distributed global locations, bringing data driven performance data to compliment key architectural decisions.
- Strong proficiency in developing automation tools and micro benchmarking frameworks for validation using Python, Go, and Kubernetes in a Ubuntu Linux environment.
- Expertise with key monitoring platforms including OTEL, Prometheus, ELK, and Grafana and in definition and implementing the overall observability strategy for HPC validation and performance monitoring.
- A deep understanding of emerging technologies, architectures, and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long-term plan.
- Proven ability to lead complex technical projects, influence decisions, and engage with stakeholders across technical and research teams.
Share This Job:
Related Jobs:
Login to save this search and get notified of similar positions.About Dallas, TX
Unlock your potential in the vibrant job market of the Dallas-Fort Worth metroplex! This bustling region in the great state of Texas boasts a perfect blend of southern charm and big-city opportunities. Dive into a dynamic career scene with access to renowned landmarks like the Dallas Arboretum and Botanical Garden, exquisite cuisine from Tex-Mex to BBQ, and cultural hotspots such as the Dallas Museum of Art and the AT&T Performing Arts Center. Cheer for the Dallas Cowboys at the AT&T Stadium or enjoy the outdoors at White Rock Lake. Discover why Dallas is the ultimate destination for growth, opportunity, and a fulfilling career journey. Explore our job listings today and embark on a new chapter in this captivating city!
