Cartesia is looking to develop the next generation of AI that is ubiquitous, interactive, and can continuously process and reason over large streams of audio, video, and text, even on-device. The specific role aims to develop next-generation speech models for tasks like multi-lingual text-to-speech (TTS), voice conversion, music generation, and sound effect synthesis, with a focus on near-zero latency and precise creative control.
Requirements
- Proven experience in developing and training novel generative models, preferably for audio or speech.
- Clear understanding of the architectural trade-offs between model quality, inference speed, and memory footprint.
- Hands-on experience with model conditioning and control mechanisms.
Responsibilities
- Develop & optimize speech and audio models for production.
- Work with engineering to ship and scale your models across our target platforms: cloud, on-premise, and on-device.
- Develop model architectures and inference strategies specifically for low-latency, real-time performance on consumer hardware.
- Implement and refine mechanisms for fine-grained controllability, allowing for the manipulation of attributes like speaker identity, emotion, prosody, and acoustic style.
- Pioneer the latest research on new architectures for generative modeling.
Other
- We’re an in-person team based out of San Francisco.
- Relocation and immigration support.