Eduworks is seeking a Data Engineer to support their research and development in a government-funded autonomous vehicle (AV) driving project by designing and maintaining scalable video data pipelines, preparing annotated training corpora, and generating adversarial scenarios.
Requirements
- Strong programming experience in Python, with proficiency in data libraries (Pandas, PySpark, Dask).
- Experience in multimodal or video dataset preparation, including alignment of video-text pairs including large-scale video or image dataset processing pipelines.
- Experience contributing to training datasets for LLMs or multimodal LLMs.
- Experience implementing ETL pipelines with schema validation, logging, and quality checks.
- Knowledge of Docker containerization.
- Familiarity with AV datasets (e.g., BDD, nuScenes, Waymo) and annotation schemas.
- Experience with using AV driving simulators (e.g. CARLA).
Responsibilities
- Design, implement, and optimize data ingestion pipelines for large-scale AV datasets such as BDD100K, BDD-X, nuScenes, and Waymo Open.
- Standardize, preprocess, and normalize raw video streams (e.g. frame decoding, resolution/frame-rate harmonization, perspective correction).
- Develop ETL pipelines to validate schema conformity, synchronize annotations, and compute cryptographic hashes for source authenticity.
- Synthetic adversarial data generation from CARLA and CHALLENGER simulators as well as diffusion-based video models.
- Implement semi-supervised annotation workflows combining auto-labeling tools (e.g. YOLOv8, DETR) with human-in-the-loop quality control.
- Develop tools to manage multimodal datasets (video, annotations, metadata, hashes) and package them into efficient formats such as Parquet for distributed training.
- Work with ML teams to generate datasets for instruction tuning by pairing manipulated and clean sequences with interpretive rationales.
Other
- 2 to 5 years of Data Engineering experience
- Bachelor’s or Master’s degree in Computer Science or a related field