Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Contextual AI Logo

Member of Technical Staff (Research Engineer - LLM Systems & Performance)

Contextual AI

$170,000 - $200,000
Dec 8, 2025
Mountain View, CA, US
Apply Now

Contextual AI is revolutionizing how AI Agents work by solving AI's most critical challenge: context. The right context at the right time unlocks the accuracy and production scale that enterprises leveraging AI require. Our enterprise AI development platform sits at the intersection of breakthrough AI research and practical developer needs. Our end-to-end platform allows AI developers to easily and accurately ingest and query documents from enterprise data sources and easily embed retrieval results into their business workflows.

Requirements

  • Strong programming skills in Python.
  • Experience with at least one major ML framework: PyTorch or JAX.
  • Solid understanding of GPU computing fundamentals (threads/warps/blocks, memory hierarchy, bandwidth vs compute, etc.).
  • Familiarity with distributed training or inference concepts (e.g., model parallelism, collective communication, disaggregated serving, KV caching).
  • Interest in performance engineering: profiling, kernel fusion, memory layout, and end-to-end system efficiency.

Responsibilities

  • Implement and improve components of our SFT and RL training pipelines (e.g., Verl, SkyRL), including data loading, training loops, logging, and evaluation.
  • Contribute to LLM inference infrastructure (e.g., vLLM, SGLang), including batching, KV-cache management, scheduling, and serving optimizations.
  • Profile and optimize end-to-end performance (throughput, latency, compute/memory/bandwidth), using tools like Nsight and profilers to identify and fix bottlenecks.
  • Work with distributed training and inference setups using NCCL, NVLink, and data/tensor/pipeline/expert/context parallelism on multi-GPU clusters.
  • Help experiment with and productionize quantization (e.g., INT8, FP8, FP4, mixed-precision) for both training and inference.
  • Write and optimize GPU kernels using tools like CUDA or Triton, and leverage techniques such as FlashAttention and Tensor Cores where appropriate.
  • Collaborate with researchers to take ideas from paper → prototype → scaled experiments → production.

Other

  • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field (or equivalent practical experience).
  • Ability to work in a fast-paced environment, communicate clearly, and collaborate closely with other engineers and researchers.