The company is seeking a Machine Learning Data Engineer to design, build, and maintain scalable data pipelines to ingest, transform, and load data from various sources into cloud-based systems, ensuring data is accurate, enriched, reliable, and readily available for analytics and model training.
Requirements
- Strong software engineering skills, proficiency in Python
- Experience with data processing tools and formats such as Apache Parquet, WebDataset, TorchData, Pandas, Shell Scripting, Protobuf, TFRecord
- Knowledge of data warehouse architectures and cloud-based systems (e.g., AWS S3)
- Familiarity with natural language processing (NLP), machine learning (ML) concepts and frameworks (PyTorch)
- Experience with data curation and enrichment techniques, particularly for large scale text, image and video data
- Proficiency in Python, SQL
- Experience with TorchData, WebDataset
Responsibilities
- Design and Build Data Pipelines: Create efficient, reliable, streamable, and scalable data pipelines using industry-standard tools and techniques, such as TorchData, WebDataset, Apache Parquet., Python, and SQL.
- Data Ingestion: Develop strategies for ingesting data from data providers, ensuring data quality and consistency.
- Data Pre-processing: Implement parallel pre-processing to clean, transform, de-duplicate, combine and normalize data.
- Data Curation and Enrichment: Curate, augment, and enrich existing datasets to improve data quality and provide valuable insights to stakeholders.
- Synthetic Data Generation: Collaborate with synthetic data teams to generate data and incorporate into existing pipelines.
- Collaboration with ML Teams: Work closely with ML scientists, engineers, and product teams to understand data requirements, and collaborate on data delivery.
- Monitoring, Maintenance & Updating: Monitor data pipelines for performance, errors, and bottlenecks, and implement regular maintenance and updates.
Other
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- At least 3 years of experience as a Software Engineer or Data Engineer.
- Excellent communication and collaboration skills.
- Master's degree in Data Science or a related field (Preferred)
- Location: Seattle, WA (in-person 3 days/wk, remote 2 days/wk)