Mirage is looking for a Software Engineer to build and scale the data systems that power their machine learning products, focusing on data engineering and ML infrastructure to handle large-scale streaming pipelines and ensure reliable, discoverable, and performant feature data.
Requirements
- 4+ years building distributed data systems, feature platforms, or ML infrastructure at scale.
- Strong experience with streaming and batch pipelines (e.g. Pub/Sub, Kafka, Dataflow, Beam, Flink, Spark).
- Deep knowledge of cloud-native data stores (e.g. Bigtable, BigQuery, DynamoDB, Snowflake) and schema/versioning best practices.
- Proficiency in Python and experience building developer-facing libraries or SDKs.
- Experience with Kubernetes, containerized data infrastructure, and workflow orchestration tools (e.g. Airflow, Temporal).
- Familiarity with ML workflows and feature store design — enough to partner closely with ML teams.
- Bonus: Experience working with video, audio, or other unstructured media data in a production environment.
Responsibilities
- Design and scale feature pipelines: Build distributed data processing systems for feature extraction, orchestration, and serving — including real-time streaming, batch ingestion, and CDC workflows.
- Feature Extraction: Design and implement reliable, reusable feature pipelines for ML models, ensuring features are accurate, scalable, and production-ready through well-designed SDKs and orchestration tools.
- Build and evolve storage infrastructure: Manage multi-tier data systems (e.g. Bigtable for online features/state, BigQuery for analytics and offline training), including schema evolution, versioning, and compatibility.
- Own orchestration and reliability: Lead workflow orchestration design (e.g. Pub/Sub, Busboy, Airflow/Temporal), monitoring, and alerting to ensure reliability at 100M+ video scale.
- Collaborate with ML teams: Partner with ML engineers on feature availability, dataset curation, and streaming pipelines for training and inference.
- Optimize for performance and cost: Tune GPU utilization, resource allocation, and data processing efficiency to maximize system throughput and minimize cost.
- Enable analytics and insights: Support downstream analytics and data science workflows by ensuring data accessibility, discoverability, and performance at scale.
Other
- All of our roles will require you to be in-person at our NYC HQ (located in Union Square)
- We do not work with third-party recruiting agencies, please do not contact us
- Comprehensive medical, dental, and vision plans
- 401K with employer match
- Generous PTO policy