Grainger is looking to build and maintain core infrastructure components and tooling that enable self-service development and deployment of machine learning applications, aiming to accelerate business outcomes through AI-driven features.
Requirements
- Experience with Python, Golang, or similar language preferred.
- Strong working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred.
- Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm).
- Deep expertise with GitOps practices and tools (Argo CD app‑of‑apps, RBAC, sync policies) as well as policy‑as‑code (OPA/Kyverno) for safe rollouts.
- Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, Datadog, ELK).
- Deep, hands‑on experience with containers and Kubernetes (cluster operations/upgrades, HA/DR patterns).
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs).
Responsibilities
- Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
- Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale.
- Architect multi‑cluster/region topologies (e.g., with High Availability (HA), Disaster Recovery (DR), failover/federation, blue/green) for ML workloads and lead progressive delivery (canary, auto‑rollback) patterns in CI/CD.
- Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices.
- Define org‑wide observability standards (logs/metrics/traces schemas, retention) for ML system and model reliability; drive adoption across teams and integrate with enterprise tools (Prometheus/Grafana + Splunk/Datadog).
- Collaborate with the SRE team to define and drive SRE standards for ML systems by setting and reviewing SLOs/error budgets, partnering on org-wide reliability scorecards and improvement plans, and scaling blameless RCA rituals.
- Institute compatibility and deprecation/versioning policies for clusters and runtimes; integrate enterprise SSO (Okta/AD) and define RBAC scopes across clusters / pipelines.
Other
- Hybrid work location type.
- Bachelor’s degree and 7+ years’ relevant work experience or equivalent staff-level impact in platform / infrastructure roles.
- Experience leading org-wide platform initiatives (e.g., multi‑cluster K8s, CI/CD platform evolution, observability standards) and mentoring senior engineers.
- Ability to work collaboratively and empathetically in a team environment.
- Studies show people are hesitant to apply if they don’t meet all requirements listed in a job posting. If you feel you don’t have all the desired experience, but it otherwise aligns with your background and you’re excited about this role, we encourage you to apply.