Ridgeline is looking to solve the problem of scaling reliability across their cloud-native platform by improving systems like Health Manager, Incident Command, and observability infrastructure, while also driving FinOps tooling and AI-assisted automation to reduce operational burden and surface critical insights.
Requirements
- 10+ years in software engineering position or similar function, with experience operating large-scale, mission-critical systems
- Proficiency in one or more of: Kotlin, Java, JavaScript, Python
- Experience with observability platforms (e.g., Datadog, Prometheus) and monitoring best practices
- Strong familiarity with infrastructure-as-code tools (e.g., Terraform, CDKTF) and CI/CD systems
- Experience leading or participating in incident response and service ownership
- Experience deploying, monitoring, and maintaining multi-tenant architectures
- Familiarity with AI-assisted tooling or workflows is a plus, but not required
Responsibilities
- Build and evolve systems like Health Manager, Incident Command, and observability platforms that support zero-downtime deployments and operational readiness
- Partner with development and infrastructure teams to embed reliability into services and processes
- Participate in the SRE on-call rotation and lead incident response as needed
- Design metrics, tooling, and workflows that enable zero-downtime deployments, fast detection, and proactive issue resolution
- Develop and maintain FinOps tooling to drive cost visibility, usage transparency, and financially-informed engineering decisions
- Lead incident triage and retrospectives with a blameless, data-driven approach
- Define observability signals that make system health visible, actionable, and reliable
Other
- You must be work authorized in the United States without the need for employer sponsorship.
- Foster an outcomes-focused team culture through honest communication, clarity, and accountability
- Think creatively, own problems, seek solutions, and communicate clearly along the way
- Contribute to a collaborative environment rooted in learning, teaching, and transparency
- Ability to work effectively across teams and communicate technical concepts with clarity
- Strong written and verbal communication skills, especially in facilitating incident response and working sessions with service teams
- Comfortable navigating ambiguity and working toward measurable outcomes
- Proven ability to balance individual contribution with cross-functional impact
- Experience or interest in FinOps, cost-aware system design, or cloud usage optimization is a plus
- Willingness to learn about cutting-edge technologies while cultivating expertise in a business domain/problem space.
- An aptitude for problem solving
- Ability to communicate effectively
- Serious interest in having fun at work
- A systems thinker who brings clarity and direction to complex, ambiguous environments
- A strong communicator who can model transparency, collaboration, and constructive disagreement
- An engineer who delivers—not just ideas, but real improvements that teams rely on
- Passionate about outcomes, not just effort—you prioritize what matters and follow through
- Committed to enabling others by reducing friction, building shared tooling, and simplifying operations
- Comfortable offering candid feedback and engaging in disagreement with respect and clarity—then committing fully once a decision is made, aligning with the team to drive results
- And finally—you have a serious interest in having fun at work