Mithrl is building the world's first commercially available AI Co-Scientist to empower life science teams to go from messy biological data to novel insights in minutes, accelerating the discovery of novel drugs and therapies.
Requirements
- 5+ years of experience in data engineering / data wrangling with real-world tabular or semi-structured data.
- Strong fluency in Python, and data processing tools (Pandas, Polars, PyArrow, or similar).
- Excellent experience dealing with messy Excel / CSV / spreadsheet-style data — inconsistent headers, multiple sheets, mixed formats, free-text fields — and normalizing it into clean structures.
- Comfort designing and maintaining robust ETL/ELT pipelines, ideally for scientific or lab-derived data.
- Ability to combine classical data engineering with LLM-powered data normalization / metadata extraction / cleaning.
- Experience with workflow orchestration tools (e.g. Nextflow, Prefect, Airflow, Dagster), or building pipeline abstractions.
- Experience with cloud infrastructure and data storage (AWS S3, data lakes/warehouses, database schemas) to support multi-tenant ingestion.
Responsibilities
- Build and own an AI-powered ingestion & normalization pipeline to import data from a wide variety of sources — unprocessed Excel/CSV uploads, lab and instrument exports, as well as processed data from internal pipelines.
- Develop robust schema mapping, coercion, and conversion logic (think: units normalization, metadata standardization, variable-name harmonization, vendor-instrument quirks, plate-reader formats, reference-genome or annotation updates, batch-effect correction, etc.).
- Use LLM-driven and classical data-engineering tools to structure “semi-structured” or messy tabular data — extracting metadata, inferring column roles/types, cleaning free-text headers, fixing inconsistencies, and preparing final clean datasets.
- Ensure all transformations that should only happen once (normalization, coercion, batch-correction) execute during ingestion — so downstream analytics / the AI “Co-Scientist” always works with clean, canonical data.
- Build validation, verification, and quality-control layers to catch ambiguous, inconsistent, or corrupt data before it enters the platform.
- Collaborate with product teams, data science / bioinformatics colleagues, and infrastructure engineers to define and enforce data standards, and ensure pipeline outputs integrate cleanly into downstream analysis and storage systems.
Other
- Strong desire and ability to own the ingestion & normalization layer end-to-end — from raw upload → final clean dataset — with an eye for maintainability, reproducibility, and scalability.
- Good communication skills; able to collaborate across teams (product, bioinformatics, infra) and translate real-world messy data problems into robust engineering solutions.
- Mission-driven impact: you’ll be the gatekeeper of data quality — ensuring that all scientific data entering Mithrl becomes clean, consistent, and analysis-ready.
- High ownership & autonomy: this role is yours to shape. You decide how ingestion works, define the standards, build the pipelines.
- Location: Beautiful SF office with a high-energy, in-person culture