Sanity.io is looking to solve the challenge of enabling machines to truly understand and use human-created content by building systems that structure and enrich large volumes of information for AI agents and LLMs.
Requirements
- 5+ years of data engineering experience, with at least 2 years focused on AI/ML data pipelines or supporting machine learning workloads.
- High level of proficiency in Python and SQL.
- Strong experience with distributed data processing frameworks like Apache Spark, Dask, or Ray.
- Proficiency with GCP and their data services.
- Experience with real-time data streaming technologies like Kafka, Redpanda or NATS.
- Familiarity with vector databases (e.g., Milvus, ElasticSearch, Vespa) and their role in AI applications.
- Experience with data modeling, schema design, and working with both relational and NoSQL databases (PostgreSQL, MongoDB, Cassandra).
Responsibilities
- Design, build, and optimize scalable data pipelines for AI and ML workloads, handling large volumes of structured and unstructured content data.
- Architect data processing systems that transform, enrich, and prepare content for LLM consumption, with a focus on latency optimization and cost efficiency.
- Build ETL/ELT workflows that extract, transform, and load data from diverse sources to support real-time and batch AI operations.
- Implement data quality monitoring and observability systems to ensure pipeline reliability and data accuracy for AI models.
- Collaborate with engineers and product teams to understand data requirements and design optimal data architectures that support AI features.
- Optimize data storage strategies across data lakes, warehouses, and vector databases to balance performance, cost, and scalability.
- Build automated data validation and testing frameworks to maintain data integrity throughout the pipeline.
Other
- Based in the San Francisco Bay Area and able to work at least 2 days per week in our San Francisco office.
- Strong focus on performance optimization, cost management, and building systems that scale efficiently.
- Ability to write clean, well-documented, maintainable code with proper testing practices.
- Excellent problem-solving skills and a data-driven approach to decision making.
- Strong communication skills and ability to collaborate effectively with cross-functional teams.