Cartesia aims to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. The current limitation is that even the best models cannot continuously process and reason over a year-long stream of audio, video, and text, especially on-device. This role focuses on tackling challenging problems in audio perception to advance this mission.
Requirements
- Deep expertise in ASR, audio understanding, language modeling, or generative modeling more broadly.
- Experience with large-scale training, GPU/TPU acceleration, and model optimization.
Responsibilities
- Architect and develop novel, large-scale models for complex audio understanding tasks, including multi-speaker ASR, diarization, and non-speech audio classification and deploy them to production at scale.
- Pioneer research in areas like self-supervised learning for audio, few-shot learning, and robust audio-visual perception.
- Set new standards for how we evaluate and benchmark our audio understanding systems.
- Build large scale pre-training and fine-tuning datasets for audio understanding capabilities.
Other
- Strong applied mindset—able to balance scientific novelty with product impact.
- We’re an in-person team based out of San Francisco.
- Relocation and immigration support.