Enhance ATP Cloud team by developing and optimizing backend services with a focus on scalability, reliability, and performance, utilizing Google Cloud Platform (GCP) and exploring new AI infrastructures
Requirements
- Strong proficiency in GoLang or Python
- Solid knowledge of Linux with deep hands-on experience in Kubernetes and Docker
- Expert knowledge in Google Cloud Platform (GCP) and its suite of managed services (e.g., GKE, Pub/Sub, BigQuery, Dataflow, etc)
- Solid knowledge of web servers/proxies such as Envoy, NGINX or HA proxy
- Comprehensive experience with SQL and No-SQL DB technologies such as MySQL, PostgreSQL and Redis
- Hands-on experience with AI infrastructure and model serving frameworks on CPU/GPU such as Ray Serve, vLLM, Nvidia Triton Inference Server, TorchServe
- Familiarity with GPU/CPU resource management, autoscaling and performance tuning for inference workloads
Responsibilities
- Collaborate with product managers, cybersecurity researchers, AI application researchers and infrastructure software engineers
- Explore and integrate new AI technologies and infrastructures to advance our model serving capabilities
- Lead the design and implementation of standard workflows for performance monitoring and testing
- Work with PLM on new feature requirement
- Collaborate with cross-functional teams to address complex technical challenges and drive innovation
- Ensure the adoption of best practices in code quality, scalability and system design among team members
Other
- BS/MS in Computer Science or Computer Engineering or equivalent military experience required
- Exceptional problem solving skills with the ability to operate in a fast-paced environment
- Excellent communication skills and a strong team player