Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AI/ML Data Engineer

College Board

$137,000 - $148,000

Nov 11, 2025

Remote, US

College Board's Aquifer team needs to design, build, and operate the data and ML plumbing that powers personalized student experiences at scale, transforming how students connect with colleges.

Requirements

4+ years in data engineering (or 3+ with substantial ML productionization), with strong Python and distributed compute (Spark/Glue/Dask) skills.
Proven experience shipping ML data systems (training/eval datasets, feature or embedding pipelines, artifact/version management, experiment tracking).
MLOps/LLMOps: orchestration (Airflow/Step Functions), containerization (Docker), and deployment (SageMaker/EKS/ECS); CI/CD for data & models.
Expert SQL and data modeling for lakehouse/warehouse (Redshift/Athena/Iceberg), with performance tuning for large datasets.
Data quality & contracts (Great Expectations/Deequ), lineage/metadata (OpenLineage/DataHub/Amundsen), and drift/skew monitoring.
Cloud experience preferably with AWS services such as S3, Glue, Lambda, Athena, Bedrock, OpenSearch, API Gateway, DynamoDB, SageMaker, Step Functions, Redshift and Kinesis BI tools like Tableau, Quicksight, or Looker for real-time analytics and dashboards
RAG & vector search experience (OpenSearch KNN/pgvector/FAISS) and prompt/eval frameworks.

Responsibilities

Design, build, and own batch and streaming ETL (e.g., Kinesis/Kafka → Spark/Glue → Step Functions/Airflow) for training, evaluation, and inference use cases.
Stand up and maintain offline/online feature stores and embedding pipelines (e.g., S3/Parquet/Iceberg + vector index) with reproducible backfills.
Implement data contracts & validation (e.g., Great Expectations/Deequ), schema evolution, and metadata/lineage capture (e.g., OpenLineage/DataHub/Amundsen).
Optimize lakehouse/warehouse layouts and partitioning (e.g., Redshift/Athena/Iceberg) for scalable ML and analytics.
Productionize training and evaluation datasets with versioning (e.g., DVC/LakeFS) and experiment tracking (e.g., MLflow).
Build RAG foundations: document ingestion, chunking, embeddings, retrieval indexing, and quality evaluation (precision@k, faithfulness, latency, and cost).
Define SLOs and instrument observability across data and model services (freshness, drift/skew, lineage, cost, and performance).

Other

This is a fully remote role that requires working EST hours.
A passion for expanding educational and career opportunities and mission-driven work
Authorization to work in the United States for any employer
Curiosity and enthusiasm for emerging technologies, with a willingness to experiment with and adopt new AI-driven solutions and a comfort learning and applying new digital tools independently and proactively.
Excellent communication, collaboration, and documentation habits.