The Allen Institute for AI (Ai2) is hiring a Data Engineer to help integrate a large U.S. patent corpus into the Semantic Scholar platform. This NSF-funded role focuses on high-impact data engineering: linking patent and academic research data, resolving citations, disambiguating inventors and authors, applying topic models, and extending data products and APIs.
Requirements
- Strong Python engineering skills, especially for building and maintaining data pipelines
- Experience with SQL and schema design in production settings (PostgreSQL preferred)
- Familiarity with common ML workflows (training classifiers, tuning models, and deploying for inference), particularly for large-scale or ambiguous structured datasets
- Comfortable working with structured datasets (XML/JSON/Parquet) and writing ETL code
- Experience with workflow orchestration tools (Airflow or similar) and cloud infrastructure (e.g. AWS, S3, Docker)
- Experience with author disambiguation, entity resolution, or record linkage problems
- Experience applying vector-based similarity or topic modeling techniques to real-world corpora at scale
Responsibilities
- Build scalable data pipelines (Airflow) for citation resolution and corpus integration
- Develop and deploy lightweight ML models for inventor disambiguation and author linking
- Train or adapt a topic model to classify patents using titles, abstracts, claims, and specs
- Extend REST APIs to expose linked metadata and topic classifications
- Contribute to dashboards and tools for evaluating data quality and model precision
- Collaborate with Ai2 engineers to ensure maintainability, test coverage, and robust deployment
- Produce reliable, well-documented code and contribute technical designs that support long-term maintainability
Other
- Persons in these roles are welcome to work remotely from any state in the US.
- This is a fixed term position scheduled for 2 years with the possibility of renewal.
- Strong communicator and a strong sense of ownership for results
- Must be able to remain in a stationary position for long periods of time.
- The ability to communicate information and ideas so others will understand.