EPAM Georgia is seeking a Generative AI Operations (GenAI Ops) Engineer to build, deploy, and maintain the operational infrastructure for cutting-edge generative AI models and services, ensuring they are scalable, reliable, and efficient across major cloud platforms.
Requirements
- Proven experience with cloud services from major providers like AWS, Google Cloud, or Azure
- Strong experience building and managing CI/CD pipelines using tools like Jenkins, GitLab CI, or cloud-native services
- Proficiency in at least one scripting language (e.g., Python, Bash)
- Hands-on experience with Infrastructure as Code (IaC) tools such as AWS CDK, CloudFormation, or Terraform
- Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes)
- Experience with cloud-native GenAI services like AWS Bedrock, Azure AI Foundry, or Google Vertex AI
- Familiarity with the architecture and operational challenges of Large Language Models (LLMs)
Responsibilities
- Build and Manage CI/CD Pipelines: Design, implement, and maintain robust, automated CI/CD pipelines for training, evaluating, and deploying large language models (LLMs) and AI agents
- Orchestrate Agentic AI Workflows: Design, deploy, and manage sophisticated, multi-agent systems. Ensure seamless Agent-to-Agent (A2A) communication and collaboration between specialized agents to automate complex business processes
- Manage Tool Integration: Implement and manage secure, scalable integrations between AI agents and external tools/APIs, leveraging open standards like the Model Context Protocol (MCP) to ensure interoperability
- Leverage AI-Powered Development: Utilize AI-powered development tools to accelerate the entire software development lifecycle, from writing infrastructure code and tests to troubleshooting operational issues in cloud environments
- Infrastructure as Code (IaC): Utilize cloud-native IaC services or cloud-agnostic tools like Terraform to define and manage the infrastructure required for GenAI workloads
- Model Monitoring and Observability: Implement comprehensive monitoring and logging solutions to track model and agent performance, resource utilization, and system health. For agentic systems, this includes tracing the agent's actions and logging the multi-step conversational flow
- Scalability and Performance Optimization: Design and implement scalable architectures for model serving and inference. Continuously optimize the performance and cost-effectiveness of our GenAI services
Other
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience
- 3+ years of experience in a DevOps, SRE, or MLOps role with a focus on cloud infrastructure
- Fluent English communication skills at a B2+ level
- Strong problem-solving skills and the ability to work effectively in a fast-paced, collaborative environment
- Participation in the Employee Stock Purchase Plan