Datadog's Observability Data Platform (ODP) needs to evolve to handle the demands of AI agents, scaling with data growth, exposing new query mechanisms, rethinking storage, transformation, and serving of telemetry, and enforcing security and reliability guardrails. The team's focus is to build an intelligent control plane for production systems where AI agents can safely and effectively take action in live environments.
Requirements
- You have a strong software engineering foundation, ideally in C++, Rust, Go, or Python, and are comfortable writing performant, maintainable code
- You have deep expertise in at least one of the following areas: query optimization, data center scheduling, compiler design, reinforcement learning, or distributed systems design
- You have experience applying search, planning, or learning techniques to solve real-world optimization problems
- You are excited by systems that learn, adapt, and improve over time using feedback from runtime metrics and human-defined objectives
- You are hypothesis-driven and enjoy designing experiments and evaluation loops, whether through simulations, benchmarks, or live systems
- You have 8+ years of experience in systems engineering, database internals, or infrastructure research, including hands-on experience in a production environment
Responsibilities
- Design and prototype intelligent systems for AI-native observability, including cost-aware agent orchestration, adaptive query execution, and self-optimizing system components.
- Apply reinforcement learning, search, or hybrid approaches to infrastructure-level decision-making, such as autoscaling, scheduling, or load shaping.
- Collaborate with AI researchers and platform engineers to design experimentation loops and verifiers that guide LLM outputs using runtime metrics and formal models.
- Explore emerging paradigms like AI compilers, "programming after code," and runtime-aware prompt engineering to inform Datadog's infrastructure and product design.
- Help define the direction of BitsEvolve - Datadog's optimization agent that uses LLMs and evolutionary search to discover code improvements, optimize GPU kernels, and tune configurations to improve performance.
- Partner with product teams and platform stakeholders to ensure scientific advances translate into measurable improvements in cost, performance, and observability depth.
Other
- You have a BS/MS/PhD in a scientific field or equivalent experience
- You thrive in ambiguity, enjoy reading papers and building prototypes, and want to help shape the future of infrastructure in the AI era
- You enjoy collaborating across research, engineering, and product to bring scientific insights to practical outcomes
- At Datadog, we place value in our office culture - the relationships and collaboration it builds and the creativity it brings to the table. We operate as a hybrid workplace to ensure our Datadogs can create a work-life harmony that best fits them.