Grainger is looking to build core components of a scalable, self-service machine learning platform that powers customer-facing applications, enabling machine learning scientists and engineers to continuously develop, deploy, monitor, and refine machine learning models, and improving the ML software development process.
Requirements
- Track record building and operating production-grade, cloud-deployed systems (AWS preferred) with strong software engineering fundamentals (Python/Go or similar).
- Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments using DevOps or GitOps best practices (e.g., Terraform/Helm + GitHub Actions/ArgoCD).
- Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, DataDog, ELK).
- Familiarity with containerization as well as container management and orchestration technologies (e.g., Docker, Kubernetes).
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs).
- Working knowledge of the machine learning lifecycle and experience working with machine learning systems and associated frameworks/tools, particularly for monitoring and observability.
- Experience with big data technologies, distributed computing frameworks, and/or streaming data processing tools (e.g., Spark, Kafka, Presto, Flink).
Responsibilities
- Build self-service and automated components of the machine learning platform to enable the development, deployment, scaling, and monitoring of machine learning models.
- Ship production platform components end-to-end across multiple modules; own reliability, performance, security, and cost from design through operation.
- Design Helm releases and author GitOps objects (ArgoCD Applications/Projects) with RBAC/sync policies; keep deployments predictable and auditable.
- Collaborate with machine learning, network, security, infrastructure, and platform engineers to ensure performant access to data, compute, and networked services.
- Ensure a rigorous deployment process using DevOps standards and mentor users in software development best practices.
- Partner with teams across the business to drive broader adoption of ML, enabling teams to improve the pace and quality of ML system development.
- Build and maintain core infrastructure components (i.e., Kubernetes clusters) and tooling enabling self-service development and deployment of a variety of applications leveraging GitOps practices.
Other
- Ability to work collaboratively in a team environment.
- We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex (including pregnancy), national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or expression, protected veteran status or any other protected characteristic under federal, state, or local law.
- We are proud to be an equal opportunity workplace.
- We are committed to fostering an inclusive, accessible work environment that is both welcoming and supportive.
- We are committed to providing reasonable accommodations to individuals with disabilities during the application and hiring process and throughout the course of one’s employment.