Upbound is redefining how modern infrastructure is built by leading the shift toward agentic infrastructure. They are looking to build and operate Upbound Spaces, a multiple control plane management software, to scale their platform to reliably support thousands of control planes and extend enterprise control plane management and operations.
Requirements
- Have experience operating production cloud services at scale: monitoring, alerting, incident response, post-mortems, and continuous improvement of service reliability.
- Have strong debugging skills across distributed systems, including experience with observability tools (Prometheus, Grafana, OpenTelemetry, distributed tracing) and techniques for diagnosing issues in production environments.
- Have experience building and operating controllers that interact with the Kubernetes API server, including troubleshooting reconciliation loops, managing API rate limits, and optimizing controller performance.
- Are comfortable working directly with customers to understand, reproduce, and resolve complex technical issues in their environments.
- Write and maintain Go code that interfaces with the Kubernetes API, such as operators, controllers, add-ons, etc., with a focus on observability, debuggability, and operational excellence.
- Deploy, manage, and troubleshoot our Kubernetes services in production, using metrics, logs, and traces to identify and resolve issues quickly.
- Build and maintain operational tooling for debugging customer environments, analyzing control plane health, and automating incident response.
Responsibilities
- Actively build and operate Upbound Spaces in production, troubleshooting and resolving issues across multi-tenant SaaS environments, as well as contributing to Upbound's open-source projects, including Crossplane.
- Take ownership of building features in high demand by Upbound's customers and deliver new functionality that will delight and amaze our users.
- Investigate and debug complex issues in customer environments, including multi-control plane scenarios, resource reconciliation problems, and performance bottlenecks.
- Communicate through thoughtful and thorough design documents for new initiatives and detailed post-incident reviews that drive system improvements.
- Support the full project lifecycle for highly scalable and reliable services running in a cloud environment – discovery, analysis, architecture, design, review, documentation, building, migration, automation, deployment, production-readiness, and ongoing operational support.
- Write and maintain Go code that interfaces with the Kubernetes API, such as operators, controllers, add-ons, etc., with a focus on observability, debuggability, and operational excellence.
- Deploy, manage, and troubleshoot our Kubernetes services in production, using metrics, logs, and traces to identify and resolve issues quickly.
Other
- Take ownership of building features in high demand by Upbound's customers and deliver new functionality that will delight and amaze our users.
- Communicate through thoughtful and thorough design documents for new initiatives and detailed post-incident reviews that drive system improvements.
- Take responsibility and ownership for solving problems even if they are outside your lane, especially during incidents affecting customer workloads.
- Demonstrate excellence in your work, constantly trying to improve your skills and the operational posture of the systems you build.
- Have empathy for customers and keep them in mind as you build solutions, understanding that reliability and debuggability are features.