Monstro is looking to build and operate pipelines that turn real-world financial information into reliable, queryable data to support retrieval, knowledge graphs, agents, analytics, and machine learning.
Requirements
- Strong Python and SQL.
- Hands-on document parsing and ETL across PDFs, HTML, JSON, and XML.
- Experience operating vector databases such as pgvector, Pinecone, or Weaviate, with multiple collections.
- Building and scheduling ingestion via APIs, web downloads, and cron or an orchestrator, plus cloud storage and queues.
- Understanding of embeddings, chunking strategies, metadata design, and retrieval evaluation.
- Solid data modeling, schema design, indexing, and performance tuning across storage types.
- History of implementing data quality checks, observability, and access controls for sensitive data.
Responsibilities
- Build and own scalable pipelines that parse and normalize unstructured sources for retrieval, knowledge graphs, and agents.
- Conceive and implement novel processes for processing thousands of types of unstructured documents with accuracy and consistency
- Process semi-structured sources into consistent, validated schemas.
- Transform structured datasets for analytics, features, and retrieval workloads.
- Create, version, and maintain multiple collections in a vector database.
- Design and implement robust multi-modal document processing systems that handle heterogeneous file formats (PDFs, images, HTML, XML) with automatic schema inference, content extraction validation, and graceful degradation for malformed inputs, maintaining 99.9% pipeline uptime SLA.
- Own ingestion from APIs, file drops, partner feeds, and scheduled jobs with monitoring, retries, and alerting.
Other
- Minimum 2 years in a dedicated Data Engineering role at an AI-native startup or 4+ years of experience in traditional Data Engineering, with ~8+ years of experience in Tech overall.
- Ownership mindset, clear written communication, and effective collaboration with product and engineering.
- Proven ownership of end-to-end pipelines (ingestion → transformation → serving), including scalable sourcing processes, ETL pipelines, and serving services.
- Experience owning and operating infrastructure in production environments.
- Track record of delivering high-consistency systems for mission-critical data pipelines.