Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

W.W. Grainger Logo

Staff Software Engineer - Machine Learning Operations

W.W. Grainger

$121,500 - $202,500
Aug 28, 2025
Chicago, IL, USA
Apply Now

Grainger is looking to build and maintain core infrastructure components and tooling that enable self-service development and deployment of machine learning applications, aiming to accelerate business outcomes through AI-driven features.

Requirements

  • Experience with Python, Golang, or similar language preferred.
  • Strong working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred.
  • Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm).
  • Deep expertise with GitOps practices and tools (Argo CD app‑of‑apps, RBAC, sync policies) as well as policy‑as‑code (OPA/Kyverno) for safe rollouts.
  • Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, Datadog, ELK).
  • Deep, hands‑on experience with containers and Kubernetes (cluster operations/upgrades, HA/DR patterns).
  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs).

Responsibilities

  • Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
  • Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale.
  • Architect multi‑cluster/region topologies (e.g., with High Availability (HA), Disaster Recovery (DR), failover/federation, blue/green) for ML workloads and lead progressive delivery (canary, auto‑rollback) patterns in CI/CD.
  • Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices.
  • Define org‑wide observability standards (logs/metrics/traces schemas, retention) for ML system and model reliability; drive adoption across teams and integrate with enterprise tools (Prometheus/Grafana + Splunk/Datadog).
  • Collaborate with the SRE team to define and drive SRE standards for ML systems by setting and reviewing SLOs/error budgets, partnering on org-wide reliability scorecards and improvement plans, and scaling blameless RCA rituals.
  • Institute compatibility and deprecation/versioning policies for clusters and runtimes; integrate enterprise SSO (Okta/AD) and define RBAC scopes across clusters / pipelines.

Other

  • Hybrid work location type.
  • Bachelor’s degree and 7+ years’ relevant work experience or equivalent staff-level impact in platform / infrastructure roles.
  • Experience leading org-wide platform initiatives (e.g., multi‑cluster K8s, CI/CD platform evolution, observability standards) and mentoring senior engineers.
  • Ability to work collaboratively and empathetically in a team environment.
  • Studies show people are hesitant to apply if they don’t meet all requirements listed in a job posting. If you feel you don’t have all the desired experience, but it otherwise aligns with your background and you’re excited about this role, we encourage you to apply.