Elicit is an AI research assistant that uses language models to help professional researchers and high-stakes decision makers break down hard questions, gather evidence from scientific/academic sources, and reason through uncertainty. Elicit aims to radically increase the amount of good reasoning in the world and be a scalable ML system based on human-understandable task decompositions.
Requirements
- Strong proficiency in Python (5+ years experience)
- Experience with architecting and optimizing large data pipelines, ideally with particular experience with Spark
- Strong SQL skills, including understanding of aggregation functions, window functions, UDFs, self-joins, partitioning, and clustering approaches
- Experience with columnar data storage formats like Parquet
- Experience with distributed computing frameworks beyond Spark (e.g., Dask, Ray)
- Hands-on experience with industry standard tools like Airflow, DBT, or Hadoop
- Hands-on experience with standard paradigms like data lake, data warehouse, or lakehouse
Responsibilities
- Build a complete corpus of academic papers and clinical trials, available as soon as they're published, combining different data sources and ingestion methods.
- Figure out the best way to ingest massive amounts of heterogeneous data in such a way as to make it usable by LLMs.
- Integrate into our customers' custom data providers to that they can create task-specific workflows over them.
- Architect and implement robust, scalable solutions to handle our growing data needs while maintaining high performance and data quality.
- Building and optimizing our academic research paper pipeline
- Expanding the datasets Elicit works over
- Data for our ML systems
Other
- 5+ years of experience as a data engineer: owning make-or-break decisions about how to ingest, manage, and use data
- You have created and owned a data platform at rapidly-growing startups—gathering needs from colleagues, planning an architecture, deploying the infrastructure, and implementing the tooling
- Strong opinions, weakly-held about approaches to data quality management
- Creative and user-centric problem-solving
- You should be excited to play a key role in shipping new features to users—not just building out a data platform!
- spend about 1 week out of every 6 with teammates
- Flexible work environment: work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8), as long as you can travel for in-person retreats and coworking events