College Board's BigFuture Division needs to implement data and analytics services to personalize higher-education recruitment and student engagement, requiring the design, building, and operation of data and ML plumbing for personalized student experiences at scale.
Requirements
- 4+ years in data engineering (or 3+ with substantial ML productionization), with strong Python and distributed compute (Spark/Glue/Dask) skills.
- Proven experience shipping ML data systems (training/eval datasets, feature or embedding pipelines, artifact/version management, experiment tracking).
- MLOps/LLMOps: orchestration (Airflow/Step Functions), containerization (Docker), and deployment (SageMaker/EKS/ECS); CI/CD for data & models.
- Expert SQL and data modeling for lakehouse/warehouse (Redshift/Athena/Iceberg), with performance tuning for large datasets.
- Data quality & contracts (Great Expectations/Deequ), lineage/metadata (OpenLineage/DataHub/Amundsen), and drift/skew monitoring.
- Cloud experience preferably with AWS services such as S3, Glue, Lambda, Athena, Bedrock, OpenSearch, API Gateway, DynamoDB, SageMaker, Step Functions, Redshift and Kinesis
- RAG & vector search experience (OpenSearch KNN/pgvector/FAISS) and prompt/eval frameworks.
Responsibilities
- Design, build, and own batch and streaming ETL (e.g., Kinesis/Kafka → Spark/Glue → Step Functions/Airflow) for training, evaluation, and inference use cases.
- Stand up and maintain offline/online feature stores and embedding pipelines (e.g., S3/Parquet/Iceberg + vector index) with reproducible backfills.
- Implement data contracts & validation (e.g., Great Expectations/Deequ), schema evolution, and metadata/lineage capture (e.g., OpenLineage/DataHub/Amundsen).
- Optimize lakehouse/warehouse layouts and partitioning (e.g., Redshift/Athena/Iceberg) for scalable ML and analytics.
- Productionize training and evaluation datasets with versioning (e.g., DVC/LakeFS) and experiment tracking (e.g., MLflow).
- Build RAG foundations: document ingestion, chunking, embeddings, retrieval indexing, and quality evaluation (precision@k, faithfulness, latency, and cost).
- Collaborate with DS to ship models to serving (e.g., SageMaker/EKS/ECS), automate feature backfills, and capture inference data for continuous improvement.
Other
- This is a fully remote role that requires working EST hours.
- Authorization to work in the United States for any employer
- Curiosity and enthusiasm for emerging technologies, with a willingness to experiment with and adopt new AI-driven solutions and a comfort learning and applying new digital tools independently and proactively.
- Clear and concise communication skills, written and verbal
- A learner's mindset and a commitment to growth: welcoming diverse perspectives, giving and receiving timely, respectful feedback, and continuously improving through iterative learning and user input.