DigitalOcean is looking to build the next generation of agentic applications on the GradientAI platform, where multi-agent systems of LLM-powered agents collaborate, make decisions, and adapt at scale, and needs someone to design robust, scalable, and safe agent workflows that empower developers to build sophisticated AI-driven systems with confidence.
Requirements
- Proven experience in software development at scale, with strong foundations in distributed systems, system design, and cloud-native engineering.
- Hands-on experience in shipping AI/ML systems into production.
- Drive observability, guardrails, and evaluation best practices for multi-agent workflows, ensuring visibility, safety, and continuous improvement.
- Strong software engineering background and deep expertise in generative AI, multi-agent system design, guardrails, monitoring, and evaluation methodologies.
- Experience with scalable orchestration patterns (sequential, router, parallel, map-reduce)
- Knowledge of cloud-native engineering and distributed systems
- Experience with AI/ML systems and multi-agent systems
Responsibilities
- Architect and deliver production-grade agentic systems: multi-agent orchestration, workflow management, state/memory handling, and runtime governance.
- Design and orchestrate modular, LLM-powered agents (e.g., Planner, Tool Executor, QA, Validator) using scalable orchestration patterns (sequential, router, parallel, map-reduce), with clear handoff protocols, shared memory, and structured communication.
- Define and enforce guardrails and governance: prompt sanitization, access control, audit trails, threat modeling, and strategies for injection defense, hallucination control, misuse prevention, and compliance.
- Establish evaluation and monitoring methods for multi-agent systems: accuracy, safety, cost, and latency—leveraging observability practices (logs, telemetry, tracing, capturing intermediate outputs) and feedback loops to continuously refine performance.
- Build fine-tuning and deployment pipelines: supervised fine-tuning, inference optimization, post-deployment updates, and scaling hardened systems with retries, error handling, and fairness checks.
- Rapidly define and deliver MCPs (Minimum Capable Products): identify minimal agent roles and orchestration logic, validate quickly, and expand iteratively into robust multi-agent applications.
- Integrate seamlessly with the GradientAI platform: ensuring agents leverage DO services (inference, KBs, Functions, storage, networking) for scale, reliability, and cost-efficiency.
Other
- 5+ years of relevant industry experience in software engineering and deploying agentic AI systems in production within high-growth environments.
- Ability to balance engineering trade-offs (reliability, latency, cost) with business outcomes.
- Ability to work remotely
- Must be willing to participate and support in operational excellence
- Must be able to independently ship product features from planning to launch to maintenance with high autonomy