Algolia is building the next generation of AI powered search products, aiming to make AI explainable and help customers make data-driven decisions. The company seeks to enable Data Scientists to move faster and customers to receive smarter search & discovery experiences by turning prototypes into robust, scalable, and observable AI services.
Requirements
- Strong coding skills in Python (preferred) and at least one statically typed language (Go preferred).
- Hands-on expertise with containerization (Docker), orchestration (Kubernetes/EKS/GKE/AKS), and cloud platforms (AWS, GCP, or Azure).
- Proven record of building CI/CD pipelines and automated testing frameworks for data or ML workloads.
- Deep understanding of REST/gRPC APIs, message queues (Kafka, Kinesis, Pub/Sub), and stream/batch data processing frameworks (Spark, Flink, Beam).
- Experience implementing monitoring, alerting, and logging for mission-critical services.
- Familiarity with common ML lifecycle tools (MLflow, Kubeflow, SageMaker, Vertex AI, Feature Store, etc.).
- Working knowledge of ML concepts such as feature engineering, model evaluation, A/B testing, and drift detection.
Responsibilities
- Productionization & Packaging: Convert notebooks and research codebase into production-ready Python and Go micro-services, libraries, or kubeflow pipelines, and design reproducible build pipelines (Docker, Conda, Poetry) and manage artefacts in centralized registries.
- Scalable Deployment: Orchestrate real-time and batch inference workloads on Kubernetes, AWS/GCP managed services, or similar platforms, ensuring low latency and high throughput, and Implement blue-green / canary rollouts, automatic rollback, and model versioning strategies (SageMaker, Vertex AI, KServe, MLflow, BentoML, etc.).
- MLOps & CI/CD: Build and maintain CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, Argo) covering unit, integration, data-quality, and performance tests, and Automate feature store updates, model retraining triggers, and scheduled batch jobs using Airflow, Dagster, or similar orchestration tools.
- Observability & Reliability: Define and monitor SLIs/SLOs for model latency, throughput, accuracy, drift, and cost, and Integrate logging, tracing, and metrics (Datadog etc.) and establish alerting & on-call practices.
- Data & Feature Engineering: Collaborate with data engineers to create scalable pipelines that ingest clickstream logs, catalog metadata, images, and user signals, and Implement real-time and offline feature extraction, validation, and lineage tracking.
- Performance & Cost Optimization: Profile models and services; leverage hardware acceleration (GPU, TPU), libraries (ONNX, OpenVINO), and caching strategies (Redis, Faiss) to meet aggressive latency targets, and Right-size clusters and workloads to balance performance with cloud spend.
- Governance & Compliance: Embed security, privacy, and responsible-AI checks in pipelines; manage secrets, IAM roles, and data-access controls via Terraform or CloudFormation, and Ensure auditability and reproducibility through comprehensive documentation and artifact tracking.
Other
- Spend 1-2 days per week in a local coworking space to collaborate with your teammates in person.
- 5+ years of experience in software engineering with 2+ years focused on deploying ML/AI systems at scale.
- GRIT - Problem-solving and perseverance capability in an ever-changing and growing environment
- TRUST - Willingness to trust our co-workers and to take ownership
- CANDOR - Ability to receive and give constructive feedback.