Datadog is looking to solve high-risk, high-reward projects grounded in real-world challenges in cloud observability and security by turning research ideas into working systems
Requirements
- Strong software engineering skills with experience in domains such as observability, SRE, or security
- Depth in distributed computing and ML systems for training and inference at scale; experience with Ray, Slurm, or similar frameworks is a plus
- Proficient in Python, familiar with a systems language (e.g., Rust, C++, or Go), and comfortable with modern cloud and data infrastructure
- Practical experience implementing and operating ML training and inference systems (e.g., PyTorch or JAX), including containerization, orchestration, and GPU acceleration
- Familiar with efficient training, fine-tuning, and inference techniques for large foundation models
- Experience with GPU programming and optimization, including experience in CUDA
- Experience writing production data pipelines and applications
Responsibilities
- Build and operate datasets, training and evaluation pipelines, benchmarks, and internal tooling
- Implement models, run experiments at scale, and profile for reliability, performance, and cost
- Orchestrate distributed training and distributed RL with Ray, including scheduling, scaling, and failure recovery
- Make the research stack observable, reproducible, and easier to use
- Establish rigorous automated benchmarks and regression tests for forecasting, anomaly detection, multi-modal analysis, agents, and code repair tasks
- Collaborate with Research Scientists, Product, and Engineering to integrate advanced AI capabilities into Datadog's product ecosystem and to harden prototypes into reliable services
- Contribute high-quality code, documentation, and open-source artifacts that enable the community and internal teams to reproduce, extend, and evaluate results
Other
- Bachelor's, Master's, or Ph.D. degree in a relevant field
- Ability to explain design and performance trade-offs clearly to both technical and non-technical audiences
- Strong interest in open-science and open-source contributions, including establishing rigorous benchmarks and sharing artifacts with the community
- Ability to work in a collaborative environment and communicate effectively with colleagues
- Passion for pushing the boundaries of AI while maintaining a strong focus on customer impact, scalability, and responsible deployment of new technologies