DataRobot is looking to solve the problem of ensuring their AI platform applications perform with high accuracy, minimal latency, and robust scalability while maintaining reliability, cost-effectiveness, and maintainability.
Requirements
- 7+ years of backend engineering experience building scalable, high-performance distributed systems / services.
- Strong experience with performance optimization: e.g. profiling, latency tuning, concurrency, caching strategies.
- Deep experience with autoscaling, resource management, load balancing, throughput/latency SLAs.
- Solid programming skills in one or more backend languages (e.g. Python, Java, Go, C++, or equivalent).
- Strong understanding of observability and monitoring: metrics, tracing, logging; and instrumentation of services.
- Experience operating across multiple cloud providers (AWS, GCP, Azure) and/or hybrid environments.
- Experience with Docker and building containerized applications.
Responsibilities
- Architect, build, and lead backend services that scale to handle large workloads, high concurrency, and low latency requirements.
- Design and implement autoscaling strategies (horizontal/vertical), dynamic resource allocation, and load balancing to ensure responsive, cost-efficient service.
- Improve end-to-end request pipelines, optimizing for latency, throughput, reliability, and correctness.
- Instrument, monitor, and profile systems in production; identify bottlenecks, troubleshoot performance issues, and proactively tune services.
- Collaborate with ML/AI teams to ensure models’ serving pipelines uphold accuracy, consistency, and performance under load.
- Drive best practices in systems reliability, observability, error handling, capacity planning, resilience, and failover.
- Contribute to defining architecture, coding standards, performance benchmarks, and technical roadmap items related to scalability and performance.
Other
- Mentor and coach other engineers; provide technical leadership and influence across teams.
- Ability to solve ambiguous challenges and influence technical direction across teams, balancing performance, accuracy, and cost.
- Experience with AI/ML model deployment, serving, inference, and production integration.
- Experience with Gen AI / serving LLMs, embeddings, etc.
- Exposure to on-prem delivery models or regulated environments.