Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Walmart Logo

(USA) Principal, Software Engineer

Walmart

$110,000 - $286,000
Nov 19, 2025
Sunnyvale, CA, US
Apply Now

Walmart is seeking a Principal Engineer (Ceph Storage) to architect, operate, and troubleshoot multi-petabyte scale Ceph clusters in mission-critical environments, ensuring the reliability, security, and high performance of storage for business operations, customer platforms, and innovation workloads.

Requirements

  • 10+ years hands-on experience with Ceph, including architecture, operations, and large-scale production support.
  • Strong expertise in: Linux Systems: Kernel tuning, cgroups, systemd, process/thread debugging.
  • Strong expertise in: Networking: TCP/IP, VLANs, BGP/OSPF, bonding, load balancing, RDMA, Jumbo Frames.
  • Strong expertise in: Storage Internals: LVM, OSD design, Bluestore, RocksDB tuning, journaling, caching layers.
  • Strong expertise in: Performance Tools: perf, iostat, atop, strace, tcpdump, Wireshark, eBPF.
  • Strong expertise in: Debugging: Core dump analysis, kernel crash dump (kdump), system call tracing.
  • Proficiency in Python and Shell scripting for automation and tooling.

Responsibilities

  • Architect, deploy, and manage large-scale clusters across multiple production sites.
  • Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
  • Define upgrade strategy, cluster augmentation, node rebalancing, and hardware refreshes with minimal downtime.
  • Own end-to-end lifecycle management of storage clusters, including OS/Kernel tuning, firmware upgrades, and hardware integration.
  • Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale-Out storage solution, Linux kernel, networking, and hardware layers.
  • Build and standardize automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts.
  • Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance.

Other

  • 10 years+ of deep technical experience in distributed storage systems.
  • This role requires a technical leader and subject matter expert (SME) who can architect resilient storage platforms, resolve production incidents under pressure, and drive innovation in private cloud storage at scale.
  • Act as technical SME for Storage within the organization, mentoring junior engineers.
  • Collaborate with cross-functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration.
  • Partner with hardware and software stakeholders and the Ceph community to drive adoption of best practices and contribute to open-source improvements.