Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

TikTok Logo

Data Ingestion Site Reliability Engineer - Data Platform

TikTok

$118,657 - $259,200
Sep 20, 2025
New York, NY, USA
Apply Now

TikTok is looking to solve the problem of ensuring the reliability, fault-tolerance, scalability, and cost-effectiveness of its large-scale data platforms and infrastructures that support the TikTok app.

Requirements

  • Experience writing code in Java, Scala, Go, Python, or a similar language.
  • Strong scripting skills (e.g., Bash and Shell) for automation tasks.
  • Experience with algorithms, data structures, complexity analysis, and software design: Solid understanding of how to build scalable and efficient systems.
  • Basic SQL (MySQL, PostgreSQL, or similar): Strong understanding of traditional relational databases like MySQL or PostgreSQL.
  • Systems and Infrastructure: Knowledge of Linux/Unix systems, as most infrastructure is based on Linux.
  • Hands-on experience with observability tools such as Prometheus, Grafana, & OpenTSDB: For monitoring, logging, and real-time performance tracking.
  • CI/CD Tools: Familiarity with Continuous Integration/Continuous Deployment pipelines and tools (e.g., Jenkins, GitLab CI)

Responsibilities

  • End-to-End Service Lifecycle Management: Participate in and continuously improve the full lifecycle of services, from initial design and development to deployment, ongoing operation, and iterative optimization.
  • Ensure Reliability and Scalability: Maintain highly reliable, fault-tolerant, and scalable systems that are both cost-effective and efficient, ensuring data, services, and infrastructure meet business needs.
  • Performance Troubleshooting: Diagnose and resolve performance issues, including slow queries, resource contention, and bottlenecks across distributed storage engines and services.
  • Cluster Scaling and Data Growth: Plan and implement strategies for scaling clusters effectively to accommodate increasing data volume while optimizing performance and cost-efficiency.
  • Documentation and Incident Response: Develop and maintain clear runbooks, Standard Operating Procedures (SOPs), and lead sustainable, blameless incident response practices with post-incident analysis to drive continuous improvement.
  • Big Data System Design: Architect and implement robust, scalable, and extensible big data systems that support the core business and products, ensuring seamless data flow and system integration.
  • On-Call Rotation: Participate in on-call rotations for production incidents, ensuring critical issues are addressed swiftly, with availability to troubleshoot and resolve problems outside of regular business hours as needed.

Other

  • Bachelor's degree in Computer Science, a related technical field involving software or systems engineering, or equivalent practical experience.
  • Candidates for this position must be legally authorized to work in the United States.
  • This position is not eligible for visa sponsorship or support.
  • Hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department.
  • Ability to work with and support systems designed to protect sensitive data and information.