Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Data Engineer, Scientific Data Ingestion

Mithrl

Salary not specified

Dec 30, 2025

San Francisco, CA, US

Mithrl is building the world's first commercially available AI Co-Scientist to empower life science teams to go from messy biological data to novel insights in minutes, accelerating the discovery of novel drugs and therapies.

Requirements

5+ years of experience in data engineering / data wrangling with real-world tabular or semi-structured data.
Strong fluency in Python, and data processing tools (Pandas, Polars, PyArrow, or similar).
Excellent experience dealing with messy Excel / CSV / spreadsheet-style data — inconsistent headers, multiple sheets, mixed formats, free-text fields — and normalizing it into clean structures.
Comfort designing and maintaining robust ETL/ELT pipelines, ideally for scientific or lab-derived data.
Ability to combine classical data engineering with LLM-powered data normalization / metadata extraction / cleaning.
Experience with workflow orchestration tools (e.g. Nextflow, Prefect, Airflow, Dagster), or building pipeline abstractions.
Experience with cloud infrastructure and data storage (AWS S3, data lakes/warehouses, database schemas) to support multi-tenant ingestion.

Responsibilities

Build and own an AI-powered ingestion & normalization pipeline to import data from a wide variety of sources — unprocessed Excel/CSV uploads, lab and instrument exports, as well as processed data from internal pipelines.
Develop robust schema mapping, coercion, and conversion logic (think: units normalization, metadata standardization, variable-name harmonization, vendor-instrument quirks, plate-reader formats, reference-genome or annotation updates, batch-effect correction, etc.).
Use LLM-driven and classical data-engineering tools to structure “semi-structured” or messy tabular data — extracting metadata, inferring column roles/types, cleaning free-text headers, fixing inconsistencies, and preparing final clean datasets.
Ensure all transformations that should only happen once (normalization, coercion, batch-correction) execute during ingestion — so downstream analytics / the AI “Co-Scientist” always works with clean, canonical data.
Build validation, verification, and quality-control layers to catch ambiguous, inconsistent, or corrupt data before it enters the platform.
Collaborate with product teams, data science / bioinformatics colleagues, and infrastructure engineers to define and enforce data standards, and ensure pipeline outputs integrate cleanly into downstream analysis and storage systems.

Other

Strong desire and ability to own the ingestion & normalization layer end-to-end — from raw upload → final clean dataset — with an eye for maintainability, reproducibility, and scalability.
Good communication skills; able to collaborate across teams (product, bioinformatics, infra) and translate real-world messy data problems into robust engineering solutions.
Mission-driven impact: you’ll be the gatekeeper of data quality — ensuring that all scientific data entering Mithrl becomes clean, consistent, and analysis-ready.
High ownership & autonomy: this role is yours to shape. You decide how ingestion works, define the standards, build the pipelines.
Location: Beautiful SF office with a high-energy, in-person culture