TikTok's Compute Platform SRE team is newly established and needs talented individuals to shape its future by ensuring the reliability of all major data warehouse products, services, and query engines.
Requirements
- In-depth understanding of Linux, computer networking, and databases.
- Proficient in common SRE/DevOps open-source toolsets, system monitoring tools, and container orchestration platforms like Kubernetes.
- Experience or familiarity with open-source or commercial technologies such as ClickHouse, Hadoop, Doris, Spark, Presto and Kubernetes.
- Strong coding skills in at least one scripting or programming language, including but not limited to Python, Shell, Java, Go, etc.
- Excellent problem-solving skills and the ability to think critically under pressure.
Responsibilities
- Responsible for the reliability of all TikTok's major data warehouse products, services, and query engines, such as ClickHouse, Spark, Presto, Doris, etc.
- Uphold Service Level Agreements (SLAs): Ensure that all service level objectives and agreements from ByteDance's Data Platform services are met.
- Respond promptly to any system outages or issues.
- Continuous Performance Optimization: Analyze service performance and reliability patterns to identify potential performance bottlenecks.
- Implement proactive measures to prevent service disruptions.
- Work with development teams to optimize application performance, ensuring that services run efficiently and that resources are utilized effectively.
- Incident Management: Lead efforts to troubleshoot and resolve service incidents and postmortems.
Other
- Currently pursuing an Undergraduate/Master's degree in Software Development, Computer Science, Computer Engineering, or a related technical discipline.
- Able to commit to working for 12 weeks during Summer 2026
- Strong customer-first mindset.
- Strong sense of ownership and easy to collaborate with.
- Graduating December 2026 onwards with the intent to return to degree program after the completion of the internship.