Qualcomm is investing in Deep Learning and developing hardware and software solutions for Inference Acceleration to play a central role in the evolution of Cloud AI.
Requirements
- Hands-on experience in one or more of thefollowing LLM serving/Orchestration packages (Triton-Inference Server, vLLM, SGLang, Ollama, llm-d, KServe, LMCache, MoonCake)
- Deep understanding of foundational LLMs, VLMs, SLMs, transformer-basedarchitectures.
- Strong experience in developinglanguage models using PyTorch.
- Strong computer science fundamentals - algorithms, data structures, parallel and distributed programming.
- Understanding of computer architecture,ML accelerators,in-memory processing anddistributed systems.
- Strong Python development skills for large-scale projects with passion for software engineering.
- Experience in analyzing, profiling, and optimizing deep learning workloads.
Responsibilities
- Building a scalable LLM inference platform using inference techniques (e.g.disaggregated serving and KV-Cache management,advanced parallelism,speculative algorithms, model optimization, specialized kernels).
- Contribute to the development of LLM Servingpackages (e.g.vLLM, SGLang, TGI, Triton-Inference server, Dynamo, LLM-d).
- Work closely with customers to drive solutions by collaborating with internal compiler, firmware and platform teams.
- Work at the forefront of GenAI by understanding advanced algorithms (e.g. attention mechanisms, MoEs) and numerics to identify new optimization opportunities.
- Drive efficient serving through smart autoscaling, load balancing androuting.
- Engage with open-sourceserving communitiesto evolvethe framework.
Other
- Excellent communication and problem-solving skills, with the ability to thrive in afast-pacedand collaborative environment.
- MS in Computer Science, Machine Learning, Computer Engineering or Electrical Engineering.
- Open-source contribution to any GenAI package.
- Experience architecting and developing large-scale distributed systems.
- High-level kernel design experience (PyTorch, CUDA, Triton).