DAT's Convoy Platform Science team is seeking a Principal ML Platform Engineer to scale and evolve Convoy's most critical Data and ML Platform capabilities to increase ability to experiment, learn, and adapt in real time across marketplace, fraud detection, and pricing systems.
Requirements
- 8-12+ years of experience in ML engineering, data infrastructure, platform engineering, or closely related production engineering roles.
- Deep hands-on experience with real-time ML platforms, including feature stores, stream processing, low-latency data services, and online inference systems.
- Strong proficiency in Python, with the ability to work across non-Python stacks including TypeScript/Node, gRPC services, and Kubernetes-based microservice ecosystems.
- Expertise in modern data and ML infrastructure, including Kafka, Kubernetes, Postgres-like OLTP systems, cloud platforms, and production observability tooling.
- Experience building and operating robust data and ML pipelines (both batch and streaming), ideally in high-scale environments such as marketplaces, fraud detection systems, pricing, personalization, or real-time decision platforms.
- Strong DevOps and MLOps fundamentals, including CI/CD, containerization, infrastructure-as-code (Terraform/Helm), automated monitoring, and cloud cost and performance optimization.
- Collaborative platform mindset, with a track record of partnering with scientists and product engineers to co-design durable service patterns for model serving, deployment, monitoring, and API design.
Responsibilities
- Deliver lower-latency data to models, unlocking online learning, adaptive policies, and improved real-time decision-making for Convoy's auction mechanism, fraud detection apparatus, and carrier engagement campaigns.
- Evolve our ML platform to support generative AI, including orchestration, retrieval, standardized service patterns, and scalable model serving needed for foundational model applications in document digitization and voice-based features.
- Experiment faster and safer, through robust causal inference tooling, richer randomized experimentation, and reliable evaluation infrastructure to help us learn more about the unique spatio-temporal dynamics of a Trucking marketplace.
- Define and implement durable service architectures, build the real-time systems that power ML in production, and partner closely with scientists to accelerate iteration and innovation.
- Drive the evolution of Convoy's experimentation and model-evaluation foundations, enable rigorous causal measurement, reliable online experimentation, scalable model iteration, and adaptive learning systems.
- Harden our evaluation infrastructure, including offline/online pipelines, drift detection mechanisms, and structured feedback loops that ensure reliable model behavior over time.
- Implement orchestration layers that combine inference, retrieval, business logic, guardrails, and human-in-loop flows into reliable, auditable multi-step AI agents.
Other
- 8-12+ years of experience in a related field.
- Ability to operate at Principal scope, setting technical direction, identifying and retiring platform risk, mentoring engineers, and delivering solutions whose impact scales across teams and the broader organization.
- Medical, Dental, Vision, Life, and AD&D insurance
- Parental Leave
- Up to 20 days of paid time off starting in year one
- 401k matching (immediately vested)