The company is looking to solve the problem of reliably fetching, extracting, and normalizing data across the web and APIs, and building robust search/indexing pipelines.
Requirements
- Hands-on experience with agentic architectures (tool calling, structured outputs/JSON, planning/execution loops) and prompt engineering.
- Deep knowledge of OpenSearch/Elasticsearch: index design, analyzers, ingestion pipelines, snapshots, rolling upgrades, and zero-downtime reindexing/data migrations.
- Proven web scraping expertise: solving CAPTCHAs, session/auth flows, proxy rotation, stealth techniques, and legal/ethical constraints.
- AWS + Docker in production (at least two of: ECS/EKS, Lambda, SQS/SNS, Batch, Step Functions, CloudWatch).
- Building high-throughput data/IO pipelines with concurrency (asyncio/multiprocessing), resilient retries, and rate-limit aware scheduling.
- Integrating diverse external APIs (auth patterns, pagination, webhooks); designing stable interfaces and backfills.
- Strong data wrangling with Pandas or equivalent; comfort with large CSV/Parquet workflows and memory/perf tuning.
Responsibilities
- Design and ship agentic systems (tool calling, multi-agent workflows, structured outputs) that reliably fetch, extract, and normalize data across the web and APIs.
- Build and operate search/indexing pipelines on OpenSearch/Elasticsearch (schema design, analyzers, reindex/data migration strategies, relevance tuning).
- Own robust web scraping: directory crawling, CAPTCHA handling, headless browsers, rotating proxies, anti-bot evasion, and backoff/retry policies.
- Develop backend services in Python + FastAPI with clean contracts and strong observability.
- Scale workloads on AWS + Docker (batch/queue workers, autoscaling, fault tolerance, cost control).
- Parallelize external API requests safely (rate limits, idempotency, circuit breakers, retries, dedupe).
- Integrate third-party APIs for enrichment and search; model and cache responses; manage schema evolution.
Other
- Excellent ownership, product sense, and pragmatic debugging.
- Familiarity with Stripe (subscriptions, metered billing, webhooks) and basic front-end changes (React/TypeScript or similar).
- Security & compliance basics for data handling and PII.
- CI/CD, infrastructure as code (Terraform), and cost/perf observability.
- Entity resolution/record linkage at scale (probabilistic matching, blocking, deduping).