The Frontline Engineering team at TCGplayer (part of eBay) plays a pivotal role in ensuring the reliability, availability, and seamless performance of our platform, which serves millions of buyers and sellers globally within the $25B collectible hobbyist space. As the first line of defense for incident response and problem management, you'll have a direct impact on customer trust and satisfaction.
Requirements
- Direct experience as an incident commander, including managing live incident calls, coordinating triage efforts, and driving communications during high-pressure situations.
- Hands-on operational experience with AWS in a production environment, specifically executing runbooks, restarting EC2 instances, checking alarms, and pulling logs from CloudWatch.
- Proficiency with Kubernetes, including troubleshooting containerized workloads, understanding pod health, managing deployments, and interacting directly with Kubernetes clusters.
- Experience with scripting (Python, PowerShell, or Bash) to automate operational tasks or assist in incident resolution workflows.
Responsibilities
- Serve as Incident Commander, leading real-time response efforts, managing communication across teams, triaging issues, and driving resolution of high-priority incidents to minimize downtime and business disruption.
- Execute documented runbooks for troubleshooting and resolving production incidents involving AWS services (EC2, CloudWatch, IAM) and Kubernetes clusters (pods, deployments, scaling).
- Collaborate closely with engineering teams post-incident, performing root cause analysis, documenting lessons learned, and driving the implementation of durable solutions.
- Drive operational excellence by measuring and analyzing critical metrics (e.g., MTTR, SLA adherence) to identify improvement opportunities and implement impactful solutions.
- Continuously refine and update operational runbooks and procedures, ensuring alignment with evolving technologies and business needs.
- Proactively contribute to long-term strategic initiatives to improve incident management practices.
Other
- This position is fully remote with a preference for candidates working within Eastern Standard Time (EST) or Central Standard Time (CST) hours.
- Participation in an on-call rotation and occasional off-hours support for incidents is required.
- A Bachelor’s degree in a technical field or equivalent experience (5+ years) in system administration, infrastructure engineering, or related roles; relevant certifications are a plus.
- Strong communication skills with the ability to clearly articulate technical details and strategies to both technical and non-technical stakeholders.
- Excellent problem-solving capabilities, able to stay composed and decisive under pressure during high-impact incidents.