ServiceNow's PLATO organization is looking to build and evolve an AI platform, and partner with teams to build products and end-to-end AI-powered work experiences. They also aim to research, experiment, and de-risk AI technologies that unlock new work experiences in the future.
Requirements
- Proficient in prompt engineering and developing LLM based features
- Experience with methods of training and fine tuning large language models, such as distilation, supervised fine-tunning and policy optimization
- 8+ years of experience with infrastructure and platform operations, deployments, SRE, and DevOps with a continued focus on improving Platform health;
- 6+ years of experience operating highly-available distributed workloads on Kubernetes following a DevOps approach.
- 6+ years of development experience with Python, GoLang, Java or similar languages;
- Experience with DevOps tooling (e.g. Helm / Ansible / Kubernetes / Prometheus /Splunk/ GitLab CI);
- Strong working experience operating distributed systems built on Linux and J2EE;
Responsibilities
- Contribute to the design, development and implementation of infrastructure, platform, deployment and observability features that power AI workloads.
- Collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.
- Contribute to the continuous improvement of the SRE practice by turning operational use cases into requirements for software tooling.
- Contribute to the execution of deployment and support activities for AI/ML developers;
- Build high-quality, clean, scalable and reusable code by enforcing best practices around software engineering architecture and processes (Code Reviews, Unit testing, etc.);
- Work with the product owners to understand detailed requirements and own your code from design, implementation, test automation and delivery of high-quality product to our users;
- Experience with operating LLMs on NVIDIA GPUs.
Other
- This role requires you to be in our Santa Clara office for two days per week.
- Be a mentor for colleagues and help promote knowledge-sharing.
- Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
- Experience in using AI productivity tools such as Cursor, Windsurf, etc
- Ability to drive outcome in projects with material technical risk.