Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Baseten Logo

Software Engineer - Model API's

Baseten

Salary not specified
Oct 11, 2025
San Francisco, CA, US • New York, NY, US
Apply Now

Baseten powers inference for AI companies, and the Model Performance team is responsible for ensuring models on their platform are fast, reliable, and cost-efficient, specifically focusing on Model APIs for hosted API endpoints.

Requirements

  • 3+ years experience building and operating distributed systems or large‑scale APIs.
  • Proven track record of owning low‑latency, reliable backend services (rate‑limiting, auth, quotas, metering, migrations).
  • Infra instincts with performance sensibilities: profiling, tracing, capacity planning, and SLO management.
  • Comfortable debugging complex systems, from runtime internals to GPU execution traces.
  • Experience with LLM runtimes (vLLM, SGLang, TensorRT-LLM) or contributions to open-source inference engines (vLLM, TensorRT-LLM, SGLang, TGI)
  • Knowledge of Kubernetes, service meshes, API gateways, or distributed scheduling.
  • Background in developer‑facing infrastructure or open‑source APIs.

Responsibilities

  • Design, build, and operate the Model APIs surface with focus on advanced inference capabilities: structured outputs (JSON mode, grammar-constrained generation), tool/function calling and multi-modal serving
  • Profile and optimize TensorRT-LLM kernels, analyze CUDA kernel performance, implement custom CUDA operators, tune memory allocation patterns for maximum throughput and optimize communication patterns across multi-GPU setups
  • Productionize performance improvements across runtimes with deep understanding of their internals: speculative decoding implementations, guided generation for structured outputs, custom scheduling and routing algorithms for high-performance serving
  • Build comprehensive benchmarking frameworks that measure real-world performance across different model architectures, batch sizes, sequence lengths, and hardware configurations
  • Productionize performance improvements across runtimes (e.g.TensorRT, TensorRT‑LLM): speculative decoding, quantization, batching, and KV‑cache reuse.
  • Instrument deep observability (metrics, traces, logs) and build repeatable benchmarks to measure speed, reliability, and quality.
  • Implement platform fundamentals: API versioning, validation, usage metering, quotas, and authentication.

Other

  • Strong written communication; able to produce clear design docs and collaborate across functions.
  • We value infra‑leaning generalists who bring strong engineering fundamentals and curiosity.
  • ML experience is a plus, but not required.
  • This is a unique opportunity to be part of a rapidly growing startup in one of the most exciting engineering fields of our era.
  • An inclusive and supportive work culture that fosters learning and growth.