INKHUB is ingesting 10 million raw PDFs to build the internet’s richest catalog of marketing-grade B2B content - tagged, summarized, and searchable by topic, company, or intent. The applied ML engineer will own the semantic ingestion pipeline, from raw PDFs to tagged, summarized, and embedded assets.
Requirements
- Python, PyTorch, sentence-transformers, OpenAI APIs, or similar pretrained LLMs.
- FastAPI, Milvus or pgvector, PyPDF/Tika, Airflow or Lambda for orchestration
- Docker, GPU scheduling, Athena/Redshift SQL
- You’ve built ML pipelines that touched real users, not just notebooks
- You’ve worked on semantic search, embeddings, or large-scale tagging
- You’ve wrestled with unstructured data and love turning chaos into clarity
Responsibilities
- Own the ETL pipeline from raw PDFs (S3-ingested) to structured resources
- Finalize our summarization + classification flow using open-source models with GPT-4o fallback
- Apply filtering logic (≤3 years old, ≤100 pages, etc) to enforce resource quality
- Map each asset to the specific topic taxonomy (10+ per topic across ~9,000 topics)
- Generate dense embeddings using sentence-transformers
- Load and query embeddings using Milvus or pgvector
- Implement “freshness” logic to identify and index only new or updated content based on file diffing, crawl timestamp, or document hash
Other
- You like working fast, iterating with feedback, and tracking metrics that matter