Search Open Jobs
All GTN W2 consultants get full benefits. Learn more.
Senior SRE Engineer (Kubernetes)
McKinney, TX US
Job Description
Senior Site Reliability Engineer –Kubernetes
Employment Type: Full-Time, Onsite
Location: McKinney, TX
Role Overview
We’re seeking a seasoned Site Reliability Engineer with deep expertise in Kubernetes to lead the design, deployment, and ongoing operations of a scalable cloud-native platform. This role focuses on two key areas: ensuring the overall stability and observability of the system, and architecting a flexible traffic management layer within a multi-tenant SaaS environment.
Key Responsibilities Cloud Platform Reliability & Scaling
-
Operate and maintain highly available, multi-region Kubernetes environments powering real-time applications.
-
Ensure 99.99%+ uptime across complex, latency-sensitive workloads.
-
Design comprehensive observability solutions across infrastructure, networking, and application layers.
-
Seamlessly integrate Kubernetes with underlying infrastructure built on OpenStack technologies.
-
Evolve platform architecture to support growing demand while maintaining robust performance and security.
-
Take primary ownership of service availability and performance targets.
-
Develop fault-tolerant Kubernetes deployments spanning multiple availability zones and regions.
-
Lead proactive scaling and capacity planning initiatives.
-
Establish strong incident response procedures, including automation and root cause analysis workflows.
-
Implement structured change control processes to minimize update-related disruptions.
-
Conduct resilience testing and disaster recovery simulations regularly.
-
Build and maintain a metrics-driven observability stack for system, network, and app-level insights.
-
Set up alerting systems using golden signals and establish automated remediation paths.
-
Maintain dashboards and tracing tools to surface issues quickly and inform root cause analysis.
-
Define and enforce logging strategies and retention practices.
-
Manage the full lifecycle of Kubernetes clusters—provisioning, upgrades, migrations, autoscaling, and hardening.
-
Champion GitOps practices using tools such as ArgoCD, Flux, Helm, and Terraform.
-
Lead incident investigations and platform stability efforts.
-
Optimize resource usage and implement cost governance strategies within Kubernetes environments.
-
Design a Kubernetes-native control system to enforce tenant-level policies around bandwidth and session limits.
-
Use CRDs and custom controllers to support dynamic, usage-based enforcement logic.
-
Extend Kubernetes policies for global fairness and tenant isolation.
-
Leverage service mesh tools for routing, security, and traffic observability.
-
Deploy and operate telemetry pipelines using agents such as Cilium Hubble, WireGuard exporters, and Prometheus.
-
Feed flow data into OpenTelemetry systems and visualize metrics via Grafana.
-
Create event-triggered remediation workflows using real-time metrics.
-
Conduct chaos engineering to test system behavior under stress.
-
Shape cloud-native infrastructure strategy and contribute to architectural decisions.
-
Mentor peers and drive best practices in Kubernetes and SRE disciplines.
-
Collaborate cross-functionally with product, security, and engineering teams.
-
Contribute knowledge to the broader tech community through documentation, talks, or open-source engagement.
Required Skills & Experience
-
Proven expertise managing high-availability Kubernetes platforms across multiple regions.
-
Strong background in observability tools (Prometheus, Grafana, OpenTelemetry, etc.).
-
Demonstrated ability to maintain uptime targets of 99.9% or higher for critical systems.
-
Deep understanding of Kubernetes internals, including CRDs, controllers, and operator patterns.
-
Hands-on experience integrating Kubernetes with OpenStack components (Nova, Neutron, Ceph).
-
Knowledge of CNI technologies, ideally Cilium, and software-defined network enforcement.
-
Linux networking fundamentals: tc, nftables, conntrack, iptables, WireGuard.
-
Proficiency in Go, with scripting skills in Python or Bash.
-
Experience with GitOps and infrastructure-as-code tools.
-
Familiarity with overlay networking protocols and secure network architectures.
Preferred Qualifications
-
Familiarity with service mesh technologies (Istio, Linkerd).
-
Exposure to multi-cluster management tools (e.G., Cluster API, Fleet, Rancher).
-
Experience running Kubernetes-based edge computing environments.
-
Experience building platform abstractions for internal development teams.
-
Implementation of chaos testing frameworks like Litmus or Chaos Mesh.
-
Background in network function virtualization or SDN.
-
Experience managing stateful workloads (e.G., databases, queues) within Kubernetes.
-
Contributions to open-source or active engagement in tech communities.
What We Value
-
Strategic mindset with ability to execute at scale.
-
High sense of ownership and accountability.
-
Clear, calm communicator—especially under pressure.
-
Deep curiosity about systems reliability and performance.
-
Effective collaborator across technical and non-technical teams.
-
Strong troubleshooting skills, especially in complex networking scenarios.
