Design, develop, and implement AI-driven solutions to enhance the reliability, performance, and efficiency of critical IT and business systems by leveraging the core AI platform to build sophisticated AIOps capabilities.
Requirements
- 10+ years of experience in software engineering, reliability engineering, or IT operations, including at least 5 years leading the design and implementation of AIOps solutions at scale.
- Proven expertise in applying machine learning algorithms and data analysis techniques to solve complex IT operational challenges.
- Strong hands-on experience in building and maintaining scalable data pipelines and workflows for efficient data collection, processing, and analysis from diverse IT sources.
- Proficiency in programming languages such as Python, Go, Java, or Scala.
- Extensive experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes).
- Familiarity with data processing frameworks (e.g., Apache Kafka, Apache Spark) and IT monitoring tools (e.g., Prometheus, Grafana, Datadog, Splunk).
- Deep understanding of distributed systems architecture, microservices, and their operational challenges.
Responsibilities
- Design, develop, and implement advanced AIOps solutions, leveraging machine learning algorithms and data analytics to automate and enhance IT operations.
- Lead the implementation of AI/ML models for proactive anomaly detection, root cause analysis, and predictive insights into system health and performance across applications and infrastructure at enterprise scale.
- Drive the automation of routine operational tasks, incident response, and remediation workflows using AI-driven agents and orchestration tools, minimizing manual intervention and improving operational efficiency.
- Collaborate with observability teams to ensure the efficient collection, processing, and transformation of high-volume, cross-domain data from diverse sources (events, logs, metrics, tickets, monitoring tools) into actionable intelligence for the AIOps platform.
- Integrate AIOps insights with existing incident management systems, providing real-time intelligence to rapidly identify, diagnose, and resolve IT issues, leading to proactive issue resolution and reduced mean time to recovery (MTTR).
- Utilize AI insights to continuously monitor, analyze, and fine-tune IT systems for peak operational efficiency, capacity planning, and resource optimization.
- Provide technical leadership and mentorship to other engineers, promoting architectural excellence, innovation, and best practices in AIOps development and operations.
Other
- Partner with data scientists, ML engineers, software engineers, SREs, and IT operations teams to integrate AI/ML agents into the platform and ensure AIOps solutions align with business needs and deliver measurable ROI.
- Actively research and evaluate emerging AIOps technologies, generative AI, LLM models, ChatOps AI, and advanced RAGs, bringing promising innovations into production through POCs and long-term architectural evolution.
- Demonstrated ability to translate business requirements and operational pain points into technical specifications and deliver robust AIOps solutions.
- Excellent problem-solving skills and the ability to troubleshoot complex platform-related issues.
- Strong communication and interpersonal skills, with a track record of influencing technical and cross-functional stakeholders.
- Master's degree or Ph.D. in Computer Science, Machine Learning, or a related technical field or equivalent military experience required.
- Experience with agentic systems and AI agents for automation.
- Experience with DevOps practices and CI/CD pipelines in an AIOps context.
- Prior experience in cybersecurity operations or building AIOps solutions for security threat detection and response.