84.51° is looking to solve the problem of creating, deploying, and maintaining computationally efficient proprietary SLM, LLM, and embedding model implementations, serving infrastructure, and end-to-end solutions. The role focuses on model serving and operations within their foundation models team, requiring expertise in distributed systems, model serving architectures, GPU cluster management, and MLOps best practices for enterprise workloads and large-scale model deployments.
Requirements
- 5+ years of experience developing cloud-based software solutions with understanding of design for scalability, performance, and reliability in distributed systems
- 2+ years hands-on experience with foundation models (LLMs, SLMs, embedding models) in production environments; 2+ years of experience in model serving and inference optimization preferred
- Deep knowledge of foundation model serving frameworks, particularly Triton Inference Server and vLLM
- Working experience with PyTorch models and optimization for inference (quantization, pruning, ONNX, TensorRT)
- Knowledge of distributed GPU computing, CUDA programming, and GPU memory optimization techniques
- Hands-on experience with GCP and Azure cloud platforms, including GPU instances, managed services, and networking
- Kubernetes & Docker experience with focus on GPU workloads and model serving deployments
Responsibilities
- Lead large-scale foundation model projects that can span months, focusing on model serving, inference optimization, and production deployment
- Leverage known patterns, frameworks, and tools for automating & deploying foundation model serving solutions using Triton, vLLM, and other inference engines
- Develop new tools, processes and operational capabilities to monitor and analyze foundation model performance, latency, throughput, and resource utilization
- Work with researchers and ML engineers to optimize and scale foundation model serving using best practices in distributed systems, GPU orchestration, and MLOps
- Abstract foundation model serving solutions as robust APIs, microservices, or components that can be reused across the business with high availability and low latency
- Build, steward, and maintain production-grade foundation model serving infrastructure (robust, reliable, maintainable, observable, scalable, performant) to manage and serve LLMs, SLMs, and embedding models at scale
- Research state-of-the-art foundation model serving technologies, inference optimization techniques, and distributed GPU architectures to identify new opportunities for implementation across the enterprise
Other
- Bachelor's degree or higher in Machine Learning, Computer Science, Computer Engineering, Applied Statistics, or related field
- Foster a collaborative and innovative team environment, encouraging professional growth and development among junior team members in foundation model technologies
- Understand business requirements and trade-off latency, cost, throughput, and model accuracy to maximize value and translate research into production-ready serving solutions
- Responsible for code reviews, infrastructure reviews, and production readiness assessments for foundation model deployments
- Apply appropriate documentation, version control, infrastructure as code practices, and other internal communication practices across channels
- Make time-sensitive decisions and solve urgent production issues in foundation model serving environments without escalation
- Excellent communication skills, particularly on technical topics related to distributed systems and model serving architectures