Stanford University is seeking an ML Data Engineer to address the need for programmatic curation, cleaning, and generation of healthcare data, focusing on developing and maintaining automated, ML-accelerated pipelines to ensure high-quality data for machine learning applications in a complex healthcare environment.
Requirements
- 3+ years of experience in software development and data engineering with a strong focus on data cleaning, transformation, and creation.
- Proficiency in Python and experience with data processing libraries (e.g., Pandas, Polars, NumPy).
- Hands-on experience in building and maintaining automated data pipelines for large-scale data processing.
- Familiarity with machine learning frameworks (e.g., PyTorch, JAX, scikit-learn) as applied to data quality and augmentation tasks.
- Expertise in working with healthcare data, including familiarity with the OMOP Common Data Model (OMOP CDM).
- Strong experience in a Linux environment and comfort with UNIX command-line tools.
- Experience with relational, NoSQL, or NewSQL database systems and data modeling, structured and unstructured.
Responsibilities
- Design, implement, and maintain robust pipelines for the programmatic cleaning, transformation, and curation of healthcare data.
- Develop automated processes to curate and validate data, ensuring accuracy and compliance with healthcare standards (e.g. OMOP CDM, FHIR).
- Leverage core machine learning techniques to generate datasets, clean existing health records, join heterogeneous data sources, and enhance data quality for model training.
- Implement innovative solutions to detect and correct data inconsistencies and anomalies in large-scale healthcare datasets.
- Work extensively with patient-level health data, ensuring that data handling practices adhere to industry regulations and ethical standards.
- Utilize the OMOP Common Data Model (OMOP CDM) to standardize and harmonize disparate healthcare data sources, enhancing interoperability and scalability.
- Continuously monitor, troubleshoot, and optimize data workflows to support dynamic research and operational needs.
Other
- Work closely with scientific staff, IT professional and project managers to understand their data requirements for existing and future projects involving Big Data.
- Contribute to the development of guidelines, standards, and processes to ensure data quality, integrity and security of systems and data appropriate to risk.
- Participate in and/or contribute to setting strategy and standards through data architecture and implementation, leveraging Big Data, analytics tools and technologies.
- Work with IT and data owners to understand the types of data collected in various databases and data warehouses.
- Proven ability to work collaboratively in multidisciplinary teams and communicate technical concepts effectively.