Red Cell Partners is building and investing in technology-led companies. Trase Systems, a Red Cell company, aims to empower enterprise leaders to harness the full potential of AI without complexity and risks by providing an end-to-end solution for deploying, managing, and optimizing AI. The Principal MLOps Engineer will advance Trase's ML systems, focusing on model training, pipeline development, and fine-tuning LLMs.
Requirements
- Expertise in designing and operating scalable, production-grade ML systems on AWS, GCP, or Azure.
- Mastery of Docker and Kubernetes for managing production ML workloads.
- Proven experience managing complex infrastructure as code (IaC) with tools like Terraform.
- Deep experience architecting CI/CD/CT pipelines for complex ML workflows (e.g., GitHub Actions, Jenkins).
- Strong Python programming skills for infrastructure automation, tooling, and services.
- Experience architecting solutions across the full ML lifecycle, from experiment tracking to advanced deployment patterns and monitoring.
- Familiarity with modern MLOps tools like MLflow, Kubeflow, SageMaker, or Vertex AI.
Responsibilities
- Own the technical vision, strategy, and end-to-end architecture for Trase’s MLOps platform, ensuring scalability, reliability, security, and cost-efficiency.
- Architect and build a sophisticated CI/CD/CT ecosystem to automate the entire ML lifecycle, from data validation to production monitoring.
- Lead the design of scalable and resilient ML infrastructure using IaC (Terraform) and container orchestration (Kubernetes) on a major cloud platform.
- Establish MLOps best practices, including frameworks for version control, experiment tracking, model governance, and responsible AI.
- Implement a robust monitoring and alerting framework to track model performance, detect drift, and ensure the reliability of production ML services.
- Define patterns for operating large-scale LLMs and multi-modal AI in production with efficiency and compliance.
- Solve highly ambiguous, large-scale ML deployment challenges where no precedent exists, defining best practices for the org.
Other
- 10+ years in software/infrastructure engineering, with 5+ years in a senior/lead MLOps, ML Infrastructure, or Platform role.
- Exceptional communication skills to articulate complex architectural strategy to stakeholders at all levels.
- Serve as the organization's thought leader on MLOps, mentoring engineers, and driving cross-functional alignment on platform strategy and best practices.
- Define the multi-year roadmap for Trase’s MLOps ecosystem in alignment with business and product strategy.
- Some travel is required.