Rad AI is looking to hire a Senior Machine Learning Engineer to build and maintain the infrastructure that supports their AI research and products, specifically focusing on accelerating language model R&D and serving these models to radiologists to improve clinical outcomes.
Requirements
- 5+ years of industry experience in ML Engineering in cloud-native environments
- In-depth knowledge of Python and Javascript/Typescript (preferable), or other modern languages in the ML domain
- Strong experience with infrastructure and DevOps tools such as Kubernetes, Docker, and Ansible
- Experience in distributed systems, storage systems, and databases
- Strong knowledge of cloud computing platforms such as AWS (preferable), GCP, and Azure.
- Experience with infrastructure-as-code tools such as Terraform (preferable), Pulumi, Cloud Formation, etc.
- Experience with monitoring, tracing, and logging tools such Cloudwatch, NewRelic, Grafana, etc.
Responsibilities
- Design, implement, and maintain the infrastructure that supports our machine learning applications, services, and workflows
- Build, maintain, and improve our ML platform that supports continuous integration, continuous delivery, and continuous training for our machine learning models
- Develop fullstack, cloud-native services and serverless architectures to build scalable and resilient systems
- Plan, design, and develop components in the data pipeline to enable various machine learning models in production
- Write code that meets our internal standards for security, style, maintainability, and best practices for a high-scale HIPAA web environment
- Design, deploy, and maintain the full ML platform stack including monitoring and observability, data analytics, backend integration with customer-facing products, and the full model R&D lifecycle
- Work with Product Management, Research, and Engineering to iterate on new features and address inefficiencies across our AI/ML infrastructure
Other
- Excellent communication skills, with a strong sense of ownership and a systematic approach to problem-solving
- Proven ability to manage and lead active incidents, address what caused them, and establish systems to avoid them in the future via blameless postmortems
- Experience working at a fast-growing startup
- Experience in a HIPAA-compliant environment
- Location Flexibility (Remote-first company!)