Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Staff Software Engineer - Machine Learning Operations

W.W. Grainger

$121,500 - $202,500

Aug 28, 2025

Chicago, IL, USA

Grainger is looking to build and maintain core infrastructure components and tooling that enable self-service development and deployment of machine learning applications, aiming to accelerate business outcomes through AI-driven features.

Requirements

Experience with Python, Golang, or similar language preferred.
Strong working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred.
Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm).
Deep expertise with GitOps practices and tools (Argo CD app‑of‑apps, RBAC, sync policies) as well as policy‑as‑code (OPA/Kyverno) for safe rollouts.
Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, Datadog, ELK).
Deep, hands‑on experience with containers and Kubernetes (cluster operations/upgrades, HA/DR patterns).
Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs).

Responsibilities

Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale.
Architect multi‑cluster/region topologies (e.g., with High Availability (HA), Disaster Recovery (DR), failover/federation, blue/green) for ML workloads and lead progressive delivery (canary, auto‑rollback) patterns in CI/CD.
Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices.
Define org‑wide observability standards (logs/metrics/traces schemas, retention) for ML system and model reliability; drive adoption across teams and integrate with enterprise tools (Prometheus/Grafana + Splunk/Datadog).
Collaborate with the SRE team to define and drive SRE standards for ML systems by setting and reviewing SLOs/error budgets, partnering on org-wide reliability scorecards and improvement plans, and scaling blameless RCA rituals.
Institute compatibility and deprecation/versioning policies for clusters and runtimes; integrate enterprise SSO (Okta/AD) and define RBAC scopes across clusters / pipelines.

Other

Hybrid work location type.
Bachelor’s degree and 7+ years’ relevant work experience or equivalent staff-level impact in platform / infrastructure roles.
Experience leading org-wide platform initiatives (e.g., multi‑cluster K8s, CI/CD platform evolution, observability standards) and mentoring senior engineers.
Ability to work collaboratively and empathetically in a team environment.
Studies show people are hesitant to apply if they don’t meet all requirements listed in a job posting. If you feel you don’t have all the desired experience, but it otherwise aligns with your background and you’re excited about this role, we encourage you to apply.