Re:Build Manufacturing is looking to operationalize and expand its enterprise Data Lake by implementing efficient data ingestion strategies, integrating diverse data sources, and ensuring data is structured for accessibility and analysis.
Requirements
- 3+ years of proven experience building production-grade data systems with a strong understanding of cloud-based data lake architectures and data warehouses.
- Demonstrated expertise in designing and operating data pipelines (batch, streaming, CDC), including schema evolution, backfills, and performance tuning.
- Hands-on proficiency with Python and SQL, including experience with distributed processing frameworks (e.g., Apache Spark) and CI/CD for data workflows.
- Proven ability to design and implement ETL/ELT workflows and data modeling techniques (e.g., star schemas, wide tables, semantic models).
- Proficiency with cloud data platforms and services such as AWS, Databricks, and Snowflake, with a focus on scalability and reliability.
- Familiarity with open table formats (e.g., Iceberg, Delta, Hudi) and business intelligence data modeling.
- Understanding of data governance, lineage, and data quality frameworks to ensure reliability, accuracy, and compliance.
Responsibilities
- Co-design data interfaces and pipelines in close collaboration with software engineers and technical leads, ensuring alignment with application domain models and product roadmaps.
- Build and operate batch, streaming, and change data capture (CDC) pipelines from diverse sources (ERP, CRM, Accounting, knowledge repositories, and other enterprise systems) into the data lake.
- Model curated data within the lake into data warehouse structures (e.g., star schemas, wide tables, semantic layers) optimized for business intelligence (BI), ad-hoc analytics, and key performance indicator (KPI) reporting.
- Publish certified datasets and policy-aware retrieval assets (tables, document embeddings, vector indexes) to enable analytics, AI, and retrieval-augmented generation (RAG) use cases.
- Establish robust data observability and quality checks to ensure reliability and consistency.
- Apply governance, security, and compliance controls across the data lake and warehouse — including role-based access, encryption, auditing, and data retention — in alignment with applicable regulations.
- Operate the platform reliably by orchestrating jobs, monitoring pipelines, and continuously tuning cost and performance.
Other
- Work Week: Remote - anywhere across the USA.
- Hours: Due to working remote and the team being based in Los Angeles, CA we require this hire to work either MST hours or PST hours.
- Travel Required: There is quarterly travel required for meetings that are held onsite in Los Angeles, CA and occasionally travel to a company location as needed.
- Bachelor's degree (BA/BS) in Computer Science, Data Science, Mathematics, Analytics, or a related quantitative field (or equivalent experience).
- Fluency in written and spoken English.