Blue Coding is looking to hire a Senior Data Engineer to design and build a next-gen data platform for one of their clients, focusing on ingesting, transforming, and governing document-centric datasets to power analytics, dashboards, and AI model training.
Requirements
- 6–10+ years in data engineering with 3+ years building production workloads on AWS; expert-level Python and SQL plus strong Spark (Glue/EMR/Databricks).
- Proven experience designing and operating data lakes/warehouses at scale, including file formats (Parquet/Delta/Iceberg/Hudi), partitioning, and performance/cost tuning.
- Hands-on document ETL: OCR pipelines, text/metadata extraction, schema design, and incremental processing for millions of files.
- Solid orchestration and DevOps chops: Airflow/MWAA or Step Functions, Docker, Terraform/CDK, and CI/CD best practices.
- Data governance mindset: lineage, quality frameworks, IAM least privilege, KMS, VPC endpoints/private networking, secrets management, and compliance awareness (e.g., SOC 2/ISO 27001).
- Practical ML enablement: crafting reproducible, versioned datasets; experience with embeddings/feature pipelines and at least one vector-store pattern (OpenSearch/pgvector/etc).
Responsibilities
- Design and build an AWS-first data platform: stand up an S3-based (or equivalent) data lake, Glue Data Catalog/Lake Formation, and a performant warehouse layer (Redshift/Snowflake/Athena) using medallion (bronze/silver/gold) patterns.
- Implement a robust ETL/ELT solution for document data, including OCR (Textract), text parsing, metadata enrichment, schema inference, incremental loads, partitioning, and optimization for large-scale semi-structured/unstructured files.
- Make data AI-ready: create curated, versioned training datasets, embeddings/feature pipelines, as well as ML-friendly exports for SageMaker/Bedrock or downstream services; prepare for a future AI developer to plug in models easily.
- Orchestrate and productionize pipelines with Airflow/MWAA or Step Functions/Lambda; containerize where necessary (ECS/EKS) and deploy with Terraform or AWS CDK, along with CI/CD (CodePipeline/GitHub Actions).
- Establish data quality, lineage, and governance: Utilize Great Expectations/Deequ checks, OpenLineage/Marquez, fine-grained permissions with Lake Formation, and perform cost/performance monitoring.
- Partner with Analytics/BI to provision trusted marts powering dashboards (QuickSight/Power BI/Tableau) and self-serve queries.
- Manage your own delivery process, including backlog grooming, sprint planning, estimates, stand-ups, reviews, and retrospectives.
Other
- Speaking both Spanish and English at a fluent level is a must.
- This position is open exclusively to candidates based in LATAM countries.
- Excellent stakeholder communication and leadership: comfortable being the first and only data engineer, translating client needs into clear sprint goals, and later mentoring/partnering with an AI developer as the team grows.
- 100% Remote