Ensure the data, services and infrastructures of one of the largest data platforms in the world, which directly supports the TikTok app, are reliable, fault-tolerant, efficiently scalable and cost-effective.
Requirements
- Experience writing code in Java, Scala, Go, Python, or a similar language. Strong scripting skills (e.g., Bash and Shell) for automation tasks.
- Experience with algorithms, data structures, complexity analysis, and software design: Solid understanding of how to build scalable and efficient systems.
- Basic SQL (MySQL, PostgreSQL, or similar): Strong understanding of traditional relational databases like MySQL or PostgreSQL. Ability to write queries, perform joins, use aggregate functions, and optimize basic SQL queries.
- Systems and Infrastructure: Knowledge of Linux/Unix systems, as most infrastructure is based on Linux. Familiarity with system internals, networking, and resource management (memory, CPU, storage).
- Hands-on experience with observability tools such as Prometheus, Grafana, & OpenTSDB: For monitoring, logging, and real-time performance tracking.
- CI/CD Tools: Familiarity with Continuous Integration/Continuous Deployment pipelines and tools (e.g., Jenkins, GitLab CI).
- Hands-on experience with distributed processing frameworks like Apache Spark and Apache Flink: Expertise in using these frameworks for large-scale data processing and stream processing.
Responsibilities
- End-to-End Service Lifecycle Management: Participate in and continuously improve the full lifecycle of services, from initial design and development to deployment, ongoing operation, and iterative optimization.
- Ensure Reliability and Scalability: Maintain highly reliable, fault-tolerant, and scalable systems that are both cost-effective and efficient, ensuring data, services, and infrastructure meet business needs.
- Performance Troubleshooting: Diagnose and resolve performance issues, including slow queries, resource contention, and bottlenecks across distributed storage engines and services.
- Cluster Scaling and Data Growth: Plan and implement strategies for scaling clusters effectively to accommodate increasing data volume while optimizing performance and cost-efficiency.
- Documentation and Incident Response: Develop and maintain clear runbooks, Standard Operating Procedures (SOPs), and lead sustainable, blameless incident response practices with post-incident analysis to drive continuous improvement.
- Big Data System Design: Architect and implement robust, scalable, and extensible big data systems that support the core business and products, ensuring seamless data flow and system integration.
- On-Call Rotation: Participate in on-call rotations for production incidents, ensuring critical issues are addressed swiftly, with availability to troubleshoot and resolve problems outside of regular business hours as needed.
Other
- In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
- As a condition of employment, all successful candidates must be able to establish authorization to work in the United States.
- For this position, the Company does not provide sponsorship for any immigration-related benefits.
- This role requires the ability to work with and support systems designed to protect sensitive data and information. As such, this role will be subject to strict national security-related screening.