INKHUB is ingesting 10 million raw PDFs to build the internet’s richest catalog of marketing-grade B2B content - tagged, summarized, and searchable by topic, company, or intent. The applied ML engineer will own the semantic ingestion pipeline, from raw PDFs to tagged, summarized, and embedded assets.
Requirements
- Python, PyTorch, sentence-transformers, OpenAI APIs, or similar pretrained LLMs.
- FastAPI, Milvus or pgvector, PyPDF/Tika, Airflow or Lambda for orchestration
- Docker, GPU scheduling, Athena/Redshift SQL
- You’ve built ML pipelines that touched real users, not just notebooks
- You’ve worked on semantic search, embeddings, or large-scale tagging
- You’ve wrestled with unstructured data and love turning chaos into clarity
Responsibilities
- Own the ETL pipeline from raw PDFs (S3-ingested) to structured resources
- Finalize our summarization + classification flow using open-source models with GPT-4o fallback
- Apply filtering logic (≤3 years old, ≤100 pages, etc) to enforce resource quality
- Map each asset to the specific topic taxonomy (10+ per topic across ~9,000 topics)
- Generate dense embeddings using sentence-transformers
- Load and query embeddings using Milvus or pgvector
- Implement “freshness” logic to identify and index only new or updated content based on file diffing, crawl timestamp, or document hash
Other
- We do not employ machine learning technologies during this phase as we believe every human deserves attention from another human.
- We do not think machines can evaluate your application quite like our seasoned recruiting professionals—every person is unique.
- We promise to give your candidacy a fair and detailed assessment.
- We do not conduct interviews via text message, Telegram, etc. and we never hire anyone into our organization without having met you face-to-face (or via Zoom).
- You will be invited to come to a live meeting or Zoom, where you will meet our INFUSE team.