The Wikimedia Foundation is looking to unify data systems across the organization to deliver scalable solutions that support internal and external platform users and the open knowledge movement.
Requirements
- Expertise in tools like Airflow, Kafka, Spark, and Hive.
- Advanced proficiency in Python and Java/Scala, with deep knowledge of one language and its ecosystem.
- Advanced working knowledge of SQL and experience with various database/query dialects (e.g., MariaDB, HiveQL, CassandraQL, Spark SQL, Presto).
- Familiarity with additional technologies such as Flink, Iceberg, Druid, Presto, Cassandra, Kubernetes, and Docker.
- Expertise in AI development tooling and AI applications in data engineering and analytics.
Responsibilities
- Designing and Building Data Pipelines: Develop scalable, robust infrastructure and processes using tools such as Airflow, Spark, and Kafka.
- Monitoring and Alerting for Data Quality: Implement systems to detect and address potential data issues promptly.
- Supporting Data Governance and Lineage: Assist in designing and implementing solutions to track and manage data across pipelines.
- Data Platform Development: Contribute to the design and improvement of the shared data platform, enabling critical use cases such as product analytics, bot detection, and image classification.
- Enhancing Operational Excellence: Identify and implement improvements in system reliability, maintainability, and performance.
Other
- 5+ years of data engineering experience, with a significant portion focused on on-premise systems (e.g., Hadoop, HDFS).
- Practical knowledge of engineering best practices with a strong emphasis on system robustness and maintainability.
- Hands-on experience in troubleshooting systems and pipelines for performance and scaling.
- Demonstrated consistency with tenure at companies (e.g., average of 2+ years, ideally including longer engagements).
- Strong communication and collaboration skills to interact effectively within and across teams.
- Ability to produce clear, well-documented technical designs and articulate ideas to both technical and non-technical stakeholders.