Leidos Public Health and Human Services Operation is seeking a Senior HPC Linux System Administrator to manage a high-performance computing (HPC) infrastructure for public health researchers and scientists, ensuring secure and always-on infrastructure services for data analysis and scientific research.
Requirements
- Requires extensive experience (7+ years) in designing and operating HPC infrastructure. (High performance computing)
- Mastery of Linux systems and administration, including troubleshooting, security, performance monitoring, and various distributions (e.g., Red Hat, Ubunut) to support scientific computing.
- Proficiency in working with applicable network devices to include routers and switches, gateways and hubs
- Develop the infrastructure deliverables, continuous diagnostics and mitigation, threat mitigation and incident response, security architecture support, critical infrastructure protection, patch management, vulnerability management, risk management, information assurance, and Security Assessment and Authorization (SA&A) documentation.
- Experienced in managing VM infrastructure.
- Proven experience with HPC clusters, job schedulers (Slurm), and high-speed networking (10/40/100Gb)
- Proficiency in Bash and Python scripting for automation is essential. Experience with cloud technologies (hybrid-cloud integration) and container environments (e.g., Docker, Singularity, Kubernetes).
Responsibilities
- Deploy, administer, monitor HPC clusters.
- Manage multi-petabtyes of data using Pure Storage flash memory storage, AWS S3 Glacier.
- Install, maintain, and upgrade scientific software, libraries, and batch schedulers such as GridEngine and Slurm.
- Manage the VMware vSphere Foundation for virtual server provisioning, deployment, and configuration, as well as hardware and software implementation and maintenance.
- System monitoring, routine and ad hoc security patch management, trouble shooting, performance tuning,
- Lead automation efforts to streamline system management tasks using scripting languages (Bash, Python) and configuration management tools (Puppet,Ansible).
- Lead the technical design, integration, and optimization of on-site HPC and cloud resources.
Other
- be located in the Atlanta, GA area for partial onsite work
- be a US Citizen with the ability to obtain a Public Trust Clearance
- Strong problem-solving and communication skills are critical for collaborating with customers, bioinformatics developers, researchers and leading a team.
- Proven leadership in planning, coordinating infrastructure support activities, leading and mentoring system administrators
- Experience working with a team to introduce and integrate new technologies and process into existing production environments