Build and maintain large-scale data processing pipelines and ML workflows, productize models, and deploy them into production environments.
Requirements
- 5+ years of professional Python development experience, with strong object-oriented programming and software engineering fundamentals.
- Hands-on experience with PyTorch for model training and inference.
- Deep understanding of Apache Spark for distributed data processing (PySpark or Scala is a plus).
- Strong experience with Apache Airflow for workflow orchestration in production environments.
- Proficiency in SQL and working with relational and NoSQL databases.
- Experience with Docker, Kubernetes, and cloud platforms (AWS/GCP/Azure).
- Familiarity with data versioning and ML model lifecycle management (MLflow or similar).
Responsibilities
- Build and maintain large-scale data processing pipelines using Apache Spark for batch and streaming data.
- Design and implement ML training and inference workflows using PyTorch and integrate them into production systems.
- Develop and orchestrate ETL and ML pipelines with Apache Airflow, ensuring reliability, scalability, and observability.
- Optimize performance of data pipelines and ML model training on distributed clusters.
- Collaborate with Data Scientists and ML Engineers to productize models and deploy them into production environments.
- Implement best practices for code quality, CI/CD, unit testing, and monitoring.
- Ensure data quality, integrity, and security across all pipelines.
Other
- Onsite Role
- Charlotte NC
- Dental insurance
- Health insurance
- Paid time off