The business problem that this Site Reliability Engineering (SRE) role at TikTok is looking to solve is to develop and run a massively distributed AI/ML recommendation system for the United States and all around the world, ensuring high availability, scalability, and fault tolerance.
Requirements
- Expertise in analyzing and troubleshooting Linux-based distributed systems.
- Experience programming with at least one commonly used language (C, C++, Python, Go).
- Strong understanding of data structures and algorithms.
- Competent knowledge of relational database systems.
- Ability to design and maintain large-scale systems.
- Strong understanding of code optimization and routine task automation.
- Proficiency in at least one machine learning framework: TensorFlow, PyTorch, MXNet or PaddlePaddle
Responsibilities
- Design, build, and maintain highly available, scalable, and fault-tolerant systems.
- Monitor and analyze system performance, identifying and resolving issues before causing user impact.
- Develop and maintain automated monitoring, alerting, and incident response systems.
- Collaborate closely with software engineering teams to ensure that applications are designed with reliability, scalability, and performance in mind.
- Implement and maintain security best practices and ensure compliance with regulatory requirements.
- Participate in on-call rotations and respond to issues and incidents within and outside of normal business hours.
- Conduct root cause analysis of incidents, hold post-mortem reviews with stakeholders, and implement preventative measures to minimize the risk of similar incidents occurring in the future.
Other
- Bachelor's/Master's degree in Computer Science, Computer Engineering, or equivalent years of experience in a SRE or software engineering role.
- Hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department.
- Ability to interact and occasionally have unsupervised contact with internal/external clients and/or colleagues.
- Ability to appropriately handle and manage confidential information including proprietary and trade secret information and access to information technology systems.
- Exercising sound judgment.