Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Data Engineer, Platform

Basis Research Institute

Salary not specified

Nov 23, 2025

New York, NY, US

Basis is looking to build trustworthy data pipelines with comprehensive provenance and quality gates, curate documented datasets for training and evaluation, and ensure data infrastructure scales reliably to understand and build intelligence, and advance society's ability to solve intractable problems.

Requirements

Building data pipelines for model training or evaluation at scale
Developing feature stores or data platforms serving multiple teams
Creating data quality frameworks and implementing governance systems
Designing data architectures that enabled new ML capabilities
Possess strong proficiency in data technologies including SQL (expert level), Python for data processing, distributed computing frameworks (Spark, Dask), and workflow orchestration tools (Airflow, Dagster, Prefect).
Have experience with cloud data platforms including data warehouses (Snowflake, BigQuery, Redshift), data lakes, object storage (S3), and streaming systems (Kafka, Kinesis, Flink) for both batch and real-time processing.
Understand ML data requirements including feature engineering, training/validation/test splits, data versioning, experiment reproducibility, and the specific data needs of different model types and training procedures.

Responsibilities

Design and build data pipelines for training and evaluation across Basis research projects and platform offerings, ensuring reliability, performance, and scalability.
Implement data quality frameworks including validation rules, quality gates, anomaly detection, and monitoring that catch data issues before they impact research or production systems.
Develop and maintain feature stores or equivalent systems that enable consistent feature access across training and serving environments, preventing train-serve skew.
Ensure data provenance and lineage tracking so researchers and engineers can understand data origins, transformations applied, and dependencies, enabling reproducible experiments and debugging.
Curate documented datasets for model training and evaluation, including dataset versioning, comprehensive documentation, quality metrics, and metadata that enables appropriate usage.
Coordinate cross-project data initiatives to prevent duplicate data work, facilitate shared datasets, and ensure consistent data practices across Basis as the organization scales.
Optimize data infrastructure for scale as compute grows, including cost optimization, performance tuning, caching strategies, and efficient data access patterns.

Other

We are looking for people who are technically excellent and treat data quality as a first-class concern.
The ideal Data Engineer has experience with ML data pipelines, understands the full lifecycle from raw data through model training and evaluation, and brings rigor to data provenance, lineage tracking, and quality assurance.
You combine software engineering discipline with deep understanding of data systems and ML requirements.
This role is embedded across Platform and Research teams, working on infrastructure that supports both commercial offerings and internal research.
We seek individuals who aspire to do rigorous, high-quality, robust data engineering, but are not afraid to iterate, learn from real usage, and explore different approaches to achieve excellence.