NAVSEA 03S is seeking a Site Reliability Engineer for the Navy Maintenance and Modernization Enterprise Solution (NMMES) to ensure the reliability, performance, and scalability of IT systems supporting naval ship and submarine maintenance operations for over 45,000 users globally. The role involves bridging the gap between development and operations, modernizing infrastructure, and implementing robust monitoring and incident response processes.
Requirements
- Strong knowledge of Linux/Unix systems administration
- Experience with configuration management tools (e.g., Ansible, Puppet, Chef)
- Proficiency in scripting languages (e.g., Python, Bash)
- Familiarity with containerization technologies (e.g., Docker, Kubernetes)
- Experience with cloud platforms (e.g., AWS, Azure, GCP)
- Knowledge of monitoring and logging tools (e.g., Prometheus, ELK stack)
- Experience with high-availability and disaster recovery strategies
Responsibilities
- Design, implement, and maintain scalable and reliable infrastructure for NMMES applications and services
- Develop and implement automation solutions for deployment, scaling, and management of NMMES systems
- Monitor system performance, availability, and capacity, and proactively address potential issues
- Implement and maintain robust logging, monitoring, and alerting systems
- Participate in on-call rotations to provide 24/7 support for critical NMMES systems
- Collaborate with development teams to improve application performance and reliability
- Conduct post-incident reviews and implement improvements to prevent future incidents
Other
- Must be a US Citizen with an active Secret clearance
- SAFe Agilist (SA) certification or higher
- At least 10 years of experience in systems engineering, DevOps, or site reliability engineering
- Bachelor's degree in Computer Science, Information Systems, or related field
- Experience working with DoD/Navy programs or similar complex government IT systems