Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Data Ingestion Site Reliability Engineer - Data Platform

TikTok

$118,657 - $259,200

Sep 20, 2025

New York, NY, USA

TikTok is looking to solve the problem of ensuring the reliability, fault-tolerance, scalability, and cost-effectiveness of its large-scale data platforms and infrastructures that support the TikTok app.

Requirements

Experience writing code in Java, Scala, Go, Python, or a similar language.
Strong scripting skills (e.g., Bash and Shell) for automation tasks.
Experience with algorithms, data structures, complexity analysis, and software design: Solid understanding of how to build scalable and efficient systems.
Basic SQL (MySQL, PostgreSQL, or similar): Strong understanding of traditional relational databases like MySQL or PostgreSQL.
Systems and Infrastructure: Knowledge of Linux/Unix systems, as most infrastructure is based on Linux.
Hands-on experience with observability tools such as Prometheus, Grafana, & OpenTSDB: For monitoring, logging, and real-time performance tracking.
CI/CD Tools: Familiarity with Continuous Integration/Continuous Deployment pipelines and tools (e.g., Jenkins, GitLab CI)

Responsibilities

End-to-End Service Lifecycle Management: Participate in and continuously improve the full lifecycle of services, from initial design and development to deployment, ongoing operation, and iterative optimization.
Ensure Reliability and Scalability: Maintain highly reliable, fault-tolerant, and scalable systems that are both cost-effective and efficient, ensuring data, services, and infrastructure meet business needs.
Performance Troubleshooting: Diagnose and resolve performance issues, including slow queries, resource contention, and bottlenecks across distributed storage engines and services.
Cluster Scaling and Data Growth: Plan and implement strategies for scaling clusters effectively to accommodate increasing data volume while optimizing performance and cost-efficiency.
Documentation and Incident Response: Develop and maintain clear runbooks, Standard Operating Procedures (SOPs), and lead sustainable, blameless incident response practices with post-incident analysis to drive continuous improvement.
Big Data System Design: Architect and implement robust, scalable, and extensible big data systems that support the core business and products, ensuring seamless data flow and system integration.
On-Call Rotation: Participate in on-call rotations for production incidents, ensuring critical issues are addressed swiftly, with availability to troubleshoot and resolve problems outside of regular business hours as needed.

Other

Bachelor's degree in Computer Science, a related technical field involving software or systems engineering, or equivalent practical experience.
Candidates for this position must be legally authorized to work in the United States.
This position is not eligible for visa sponsorship or support.
Hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department.
Ability to work with and support systems designed to protect sensitive data and information.