HP IQ is looking to solve the problem of making AI models, particularly large language and multimodal models, fast, accurate, and efficient for on-device inference in resource-constrained environments, enabling intelligent decision-making systems across HP's product portfolio.
Requirements
- Proficiency in Python and ML frameworks ecosystem (HuggingFace, PyTorch).
- Strong understanding of transformer architectures, attention mechanisms, and PEFT techniques.
- Experience with on-device inference optimization (OpenVINO, ONNX, QNN).
- Familiarity with orchestration/planning architectures and techniques for AI assistants.
- Track record of delivering production-ready ML solutions in latency-sensitive environments.
- Experience with multi-agent systems or AI assistant orchestration.
- Familiarity with advanced inference optimization techniques such as KV cache paging, flash attention.
Responsibilities
- Fine-tune large language models, multimodal models, and task-specific models for orchestration, planning, and any other workflows as defined.
- Design and run experiments to improve task accuracy, robustness, and generalization.
- Explore and apply methods like full fine-tuning, LoRA, QLoRA and other types of parameter-efficient fine-tuning.
- Employee advanced techniques such as QAT, DPO, GRPO to further improve the model quality.
- Prune, quantize and compress models (e.g., INT8, INT4, mixed-precision) for CPU, GPU, NPU and edge accelerators.
- Optimize models for low-latency inference using frameworks like OpenVINO, ONNX Runtime, QNN etc..
- Build robust data pipelines for domain-specific datasets, including synthetic data generation and annotation.
Other
- 7+ years of experience in applied machine learning, including at least 3 years in LLM fine-tuning.
- 7+ years of experience in applied machine learning, including at least 3 years in LLM fine-tuning.
- Flexible Work Environment
- Forward-Thinking Culture
- Equal Opportunity Employer (EEO) Statement