Walmart is seeking a Principal Engineer (Ceph Storage) to architect, operate, and troubleshoot multi-petabyte scale Ceph clusters in mission-critical environments, ensuring the reliability, security, and high performance of storage for business operations, customer platforms, and innovation workloads.
Requirements
- 10+ years hands-on experience with Ceph, including architecture, operations, and large-scale production support.
- Strong expertise in: Linux Systems: Kernel tuning, cgroups, systemd, process/thread debugging.
- Strong expertise in: Networking: TCP/IP, VLANs, BGP/OSPF, bonding, load balancing, RDMA, Jumbo Frames.
- Strong expertise in: Storage Internals: LVM, OSD design, Bluestore, RocksDB tuning, journaling, caching layers.
- Strong expertise in: Performance Tools: perf, iostat, atop, strace, tcpdump, Wireshark, eBPF.
- Strong expertise in: Debugging: Core dump analysis, kernel crash dump (kdump), system call tracing.
- Proficiency in Python and Shell scripting for automation and tooling.
Responsibilities
- Architect, deploy, and manage large-scale clusters across multiple production sites.
- Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
- Define upgrade strategy, cluster augmentation, node rebalancing, and hardware refreshes with minimal downtime.
- Own end-to-end lifecycle management of storage clusters, including OS/Kernel tuning, firmware upgrades, and hardware integration.
- Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale-Out storage solution, Linux kernel, networking, and hardware layers.
- Build and standardize automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts.
- Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance.
Other
- 10 years+ of deep technical experience in distributed storage systems.
- This role requires a technical leader and subject matter expert (SME) who can architect resilient storage platforms, resolve production incidents under pressure, and drive innovation in private cloud storage at scale.
- Act as technical SME for Storage within the organization, mentoring junior engineers.
- Collaborate with cross-functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration.
- Partner with hardware and software stakeholders and the Ceph community to drive adoption of best practices and contribute to open-source improvements.