Analog Devices (ADI) aims to deliver a world-class AI/ML developer experience for its software engineers and data scientists by establishing a global XOps team. This role is crucial for designing and optimizing complete systems, resolving technical issues, and leading the development of major ML/AI operational features to enhance the developer experience across infrastructure, pipelines, deployment, monitoring, governance, and cost/risk optimization.
Requirements
- Expert in infrastructure-as-code and GitOps practices, with demonstrable skills in Terraform, AWS CDK (Python), Argo CD and/or other IaC and CI/CD systems.
- Hands-on experience managing Kubernetes clusters (for ML workloads) and designing/implementing ML workflow orchestration solutions and data pipelines (e.g., Argo, Kubeflow, Airflow).
- Solid understanding of foundation models (LLMs) and their applications in enterprise ML/AI solutions.
- Strong background in AWS DevOps practices and cloud architecture — e.g., AWS services such as Bedrock, SageMaker, EC2, S3, RDS, Lambda, managed MLFlow, etc. Hands-on design and implementation of microservices architectures, APIs, and database management (SQL/NoSQL).
- Proven track record of monitoring and optimizing cloud/ML infrastructure for scalability and cost-efficiency.
- Deep understanding of the Data Science Lifecycle (DSLC) and proven ability to shepherd data science or ML/AI projects from inception through production within a platform architecture.
- Expertise in feature stores, model registries, model governance and compliance frameworks specific to ML/AI (e.g. explainability, audit trails).
Responsibilities
- Design and implement resilient cloud-based ML/AI operational capabilities that advance our system attributes: learnability, flexibility, extensibility, interoperability, and scalability.
- Architect and implement scalable AWS ML/AI cloud infrastructure to support end-to-end lifecycle of models, agents, and services.
- Establish governance frameworks for ML/AI infrastructure management (e.g., provisioning, monitoring, drift detection, lifecycle management) and ensure compliance with industry-standard processes.
- Define and ensure principled validation pathways (testing, QA, evaluation) for early-stage GenAI/LLM/Agent-based proofs-of-concept, across the organization.
- Lead and provide guidance on Kubernetes (k8s) cluster management for ML workflows, including choosing/implementing workflow orchestration solutions (e.g., Argo vs Kubeflow) and data-pipeline creation, management, and governance using tools such as Airflow.
- Design and develop infrastructure-as-code (IaC) in AWS CDK (in Python) and/or Terraform along with GitOps to automate infrastructure deployment and management.
- Monitor, analyze and optimize cloud infrastructure and ML/AI model workloads for scalability, cost-efficiency, reliability, and performance.
Other
- Foster and contribute to a culture of operational excellence: high-performance, mission-focused, interdisciplinary collaboration, trust, and shared growth.
- Drive proactive capability and process enhancements to ensure enduring value creation, analytic compounding interest, and operational maturity of the ML/AI platform.
- Excellent verbal and written communication skills — able to report findings, document designs, articulate trade-offs and influence cross-functional stakeholders.
- Demonstrated ability to manage large-scale, complex projects across an organization, and lead development of major features with broad impact.
- Customer-obsessed mindset and a passion for building products that solve real-world problems, combined with high organization, diligence, and ability to juggle multiple initiatives and deadlines.