Axle is seeking a Data Engineer to design and automate data pipelines for clinical and research datasets at the National Institutes of Health (NIH) supporting the National Center for Advancing Translation Sciences (NCATS). The goal is to ensure timely and reliable data delivery for downstream analysis, harmonize data across systems, and meet healthcare and research compliance requirements.
Requirements
- Proven ability to design, build, and maintain scalable data pipelines and automate ETL processes.
- Hands-on experience working with clinical or research data and familiarity with healthcare data standards and Common Data Models (e.g., CDISC, OMOP).
- Familiarity with big data frameworks like Apache Spark or Hadoop.
- Strong skills in Python, SQL, and shell scripting (e.g., Bash).
- Experience using Docker to containerize data workflows for reproducibility and scalability.
- Proficiency with version control systems like Git and continuous integration practices.
- Experience with workflow management systems such as Snakemake, Nextflow, or similar tools.
Responsibilities
- Build and maintain scalable and efficient data pipelines for clinical and research datasets.
- Automate the extraction, transformation, and loading (ETL) processes to ensure timely and reliable data delivery, while optimizing workflows for downstream analysis.
- Ingest large-scale datasets from diverse clinical and research sources.
- Collaborate with data science teams to harmonize data across systems.
- Implement best practices for cleaning and standardizing data to enable consistent analytics.
- Ensure datasets meet healthcare and research compliance requirements by aligning data with established Common Data Models such as CDISC and OMOP.
- Develop, optimize, and automate workflows using tools like Snakemake or Nextflow.
Other
- Bachelor's degree in computer science, Data Engineering, Bioinformatics, or a related field, with 5+ years of relevant experience; or a Master's degree with 2-3 years of experience.
- Work closely with multidisciplinary teams including data scientists, biostatisticians, and software engineers to align data infrastructure with project needs.
- Maintain comprehensive documentation of pipeline architectures and workflow logic to ensure clarity, transparency, and reproducibility.
- Experience with cloud platforms (e.g., AWS, GCP, Azure) for large-scale data processing.