Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

HPC System Software Engineer

Lawrence Berkeley National Laboratory

$156,864 - $218,364

Oct 25, 2025

San Francisco Bay Area, CA, US

Lawrence Berkeley National Laboratory is looking to architect, develop, deploy, and support the software that forms the backbone of NERSC's world-class supercomputing infrastructure, specifically engineering robust, scalable, dynamic, and automated solutions for high-performance computing (HPC) system management and large-scale monitoring.

Requirements

Minimum of 4 years of experience with systems programming in Linux environments or management of large-scale Linux-based systems in a high-performance computing, cloud computing, or hyper-scale environment.
Experience with some or all of our key technologies: containers (such as Docker or Kubernetes), configuration management (such as Ansible or Puppet), monitoring and observability (such as VictoriaMetrics, Prometheus, or Nagios), virtualization (such as Proxmox or Harvester), git-based CI/CD pipelines (such as GitLab runners or GitHub Actions), continuous delivery tools (such as Argo CD or Flux), modern programming languages (such as Go or Rust), complex scripting with tools such as Python 3 or bash.
Familiarity with provisioning tools (such as Chef, Foreman, or Terraform).
Working knowledge of software engineering best practices for performance and security.
Strong Linux systems programming skills and knowledge of Linux system internals.
Demonstrated experience in to resolving complex issues in creative and effective ways.
Demonstrated experience in working on and resolving significant and unique issues where analysis of situations or data requires an evaluation of intangibles.

Responsibilities

Develop and maintain software for automated provisioning, configuration management, and orchestration across thousands of servers, with a focus on the OpenCHAMI system management software stack.
Contribute to the development and operation of NERSC's large-scale data center monitoring framework.
Analyze system telemetry and logs to debug complex, system-wide issues, identify performance bottlenecks.
Develop and maintain plugins for the Slurm workload manager.
Identify and automate operational tasks and system management processes to improve the efficiency, reliability, and scalability of HPC systems.
Participate in the full lifecycle of HPC systems, including installation, configuration, testing, operation, and maintenance.
Design major software components for system management and monitoring, creating long-term roadmaps to ensure scalability, reliability, and future-readiness.

Other

You will join a collaborative environment, working with engineers at NERSC, other national laboratories, leading HPC vendors, and vibrant open-source communities.
Contribute to a shared on-call rotation to provide 24x7 support for critical HPC systems and infrastructure.
Take ownership of new technical assignments, determine appropriate methods and procedures, and coordinate the activities of other personnel on smaller projects or focused technical efforts.
Excellent oral and written communication skills.
Demonstrated ability to work effectively as part of a cross-disciplinary team.