TikTok is looking to solve the problem of ensuring the reliability, fault-tolerance, scalability, and cost-effectiveness of its large-scale data platforms and infrastructures that support the TikTok app.
Requirements
- Experience writing code in Java, Scala, Go, Python, or a similar language.
- Strong scripting skills (e.g., Bash and Shell) for automation tasks.
- Experience with algorithms, data structures, complexity analysis, and software design: Solid understanding of how to build scalable and efficient systems.
- Basic SQL (MySQL, PostgreSQL, or similar): Strong understanding of traditional relational databases like MySQL or PostgreSQL.
- Systems and Infrastructure: Knowledge of Linux/Unix systems, as most infrastructure is based on Linux.
- Hands-on experience with observability tools such as Prometheus, Grafana, & OpenTSDB: For monitoring, logging, and real-time performance tracking.
- CI/CD Tools: Familiarity with Continuous Integration/Continuous Deployment pipelines and tools (e.g., Jenkins, GitLab CI)
Responsibilities
- End-to-End Service Lifecycle Management: Participate in and continuously improve the full lifecycle of services, from initial design and development to deployment, ongoing operation, and iterative optimization.
- Ensure Reliability and Scalability: Maintain highly reliable, fault-tolerant, and scalable systems that are both cost-effective and efficient, ensuring data, services, and infrastructure meet business needs.
- Performance Troubleshooting: Diagnose and resolve performance issues, including slow queries, resource contention, and bottlenecks across distributed storage engines and services.
- Cluster Scaling and Data Growth: Plan and implement strategies for scaling clusters effectively to accommodate increasing data volume while optimizing performance and cost-efficiency.
- Documentation and Incident Response: Develop and maintain clear runbooks, Standard Operating Procedures (SOPs), and lead sustainable, blameless incident response practices with post-incident analysis to drive continuous improvement.
- Big Data System Design: Architect and implement robust, scalable, and extensible big data systems that support the core business and products, ensuring seamless data flow and system integration.
- On-Call Rotation: Participate in on-call rotations for production incidents, ensuring critical issues are addressed swiftly, with availability to troubleshoot and resolve problems outside of regular business hours as needed.
Other
- Bachelor's degree in Computer Science, a related technical field involving software or systems engineering, or equivalent practical experience.
- Candidates for this position must be legally authorized to work in the United States.
- This position is not eligible for visa sponsorship or support.
- Hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department.
- Ability to work with and support systems designed to protect sensitive data and information.