Freenome's infrastructure team needs to build and maintain secure and scalable infrastructure across cloud environments to support clinical lab systems, scientific computing pipelines, and regulated production workloads, enabling rapid iteration without compromising reliability.
Requirements
- Experience with NVIDIA DGX systems and NVIDIA software tech stack
- Production experience with Kubernetes in cloud environments (e.g., AKS, GKE, or EKS)
- Proficiency with Terraform, Pulumi, or similar IaC tools
- Experience with CI/CD, including deployment automation and release strategies
- Familiarity with cloud IAM, networking, and security best practices
- Strong troubleshooting and root cause analysis skills in distributed systems
- Demonstrated ability to work autonomously and own technical outcomes
Responsibilities
- Design and implement cloud infrastructure components using Pulumi (Python)
- Manage and maintain Kubernetes clusters (AKS, GKE), including node pools, ILBs, and autoscaling configurations
- Define observability patterns and implement metrics, dashboards, and alerts in support of production reliability
- Contribute to our CI/CD platforms, including build pipeline improvements, deployment strategies, and release automation
- Participate in incident response and postmortem analysis for infrastructure-related outages or events
- Lead technical implementation of projects or initiatives within your scope
- Perform thorough, constructive code reviews and help level up peers through pairing and design discussions
Other
- This role is Hybrid and will entail managing our onsite Nvidia infrastructure.
- Partner with team members, TPMs, and security stakeholders to deliver infrastructure that meets compliance and reliability requirements
- Communicate clearly with cross-functional teams and represent infrastructure in collaborative settings
- Model Freenome’s values and principles in your work and interactions
- Promote a collaborative, respectful engineering culture with clear communication and inclusive practices