Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. The Cerebras AI cluster management software is a strategic software product initiative intended to deliver Cerebras’s high performance AI benefits to on-premises customers and sovereign/neo clouds. Cerebras cluster management is intended to simplify deployment and maintenance for platform operators, making it easier to manage complex AI infrastructure at scale.
Requirements
- 5+ years of product management experience, preferably in infrastructure observability and security domains.
- Expert knowledge of security tools such as IAM, IDP, SIEM, Key management systems.
- Expert knowledge of observability solutions such as Prometheus, Grafana, log management systems, observability management systems.
- Familiarity with cluster orchestration tools and concepts (e.g., Kubernetes).
- Strong ability to think at the API and platform layers, designing solutions for operator workflows.
- Technical background (e.g., computer science, engineering) or the ability to engage deeply with engineering teams.
- Experience in enterprise software, cloud infrastructure, or AI/ML platforms.
Responsibilities
- Define and deliver a world-class cluster management experience with a focus on observability, management, monitoring and security.
- Collaborate with engineering to design reliable, scalable solutions and APIs tailored to cluster operator workflows.
- Develop a deep understanding of cluster operator needs through user and market research.
- Communicate product updates and roadmap progress clearly to internal and external stakeholders.
Other
- Excellent communication and collaboration skills, with the ability to work effectively across diverse teams.
- Proven ability to excel in a fast-paced, dynamic environment.
- Understanding of security and authentication principles in software systems.
- Familiarity with monitoring, telemetry, and fault tolerance in distributed systems.