The company aims to improve mental healthcare by developing AI-enabled experiences that enhance the human connection in therapy, making it more transparent, personalized, and accessible. The Data Engineer will support this by building and maintaining data pipelines for training machine learning models and AI tools to improve patient outcomes.
Requirements
- 8+ years of Data Pipeline Development – specifically building and maintaining scalable ETL/ELT pipelines for ML/AI training workflows, using tools like AWS Glue, DBT, Dagster, Spark, or Ray for distributed processing of large-scale structured and unstructured data from Data Lakes
- Strong proficiency in Spark, Python, and SQL for feature engineering, data transformation, and ensuring high-quality, versioned datasets suitable for model training and inference
- 8+ Years of Cloud Infrastructure & Data Warehousing experience, 4+ of which with a focus in AWS
- proficient in AWS services such as Redshift, S3, Glue, IAM, EMR, and SageMaker for supporting ML/AI pipelines
- Experience optimizing data warehouses (e.g., Redshift, Snowflake, BigQuery) and managing data lakes (e.g., S3, GCS, Azure Blob) for large-scale, versioned ML training datasets, with a focus on partitioning, access controls, and integration with distributed processing frameworks like Spark
- Implementing scalable data validation, quality checks, and error-handling mechanisms tailored for ML/AI pipelines, including bias detection, anomaly identification, and dataset integrity to ensure high-fidelity training data
- Experience with data security measures (encryption, role-based access control, data masking)
Responsibilities
- build and maintain the data pipelines that pull information from our central storage system to train machine learning models and AI tools
- designing reliable flows of information
- testing for accuracy
- solving unexpected challenges
- building and maintaining scalable ETL/ELT pipelines for ML/AI training workflows, using tools like AWS Glue, DBT, Dagster, Spark, or Ray for distributed processing of large-scale structured and unstructured data from Data Lakes
- Implementing scalable data validation, quality checks, and error-handling mechanisms tailored for ML/AI pipelines, including bias detection, anomaly identification, and dataset integrity to ensure high-fidelity training data
- Optimizing data pipelines, queries, and managing large datasets for efficiency and scalability
Other
- Strong ability to work cross-functionally with data analysts, data scientists, and stakeholders
- Effective communication skills to explain technical concepts to non-technical audiences
- Adaptability to thrive in a fast-paced startup environment
- 100% remote work environment (US-based only)
- Working hours to support a healthy work-life balance