At U.S. Bank, we are committed to leveraging industry-leading technology to enhance our financial services. Our goal is to empower customers and businesses with advanced tools for smarter financial decision-making and to support community growth through innovative solutions.
Requirements
- Proficiency in Linux, clustering, and distributed systems.
- Expertise in GPU monitoring, GPU scheduling, CUDA, algorithmic optimization, and parallel computing.
- Proficiency in languages such as shell, Ansible, C/C++, Golang, Java, and Python for automating workflows, deployments, and monitoring.
- Deep understanding of Deep Learning, Computer Vision, LLMs, vector databases, and AI platforms (e.g., Pytorch, Huggingface).
- Experience in creating and maintaining documentation for system configurations, operational procedures, and troubleshooting knowledge bases.
- Hands-on experience with Hadoop, Hive, Spark, and migration of Big Data into Azure cloud services.
- Experience in cloud-native architectures (Azure/AWS), Kubernetes, Docker, and containerized ML workflows.
Responsibilities
- Design and implement scalable, high-performance architectures for machine learning, data science, and AI workflows.
- Develop programs to automate workflows, deployments, and monitoring for improved operational efficiency.
- Optimize GPU utilization and performance using CUDA and algorithmic optimization to accelerate machine learning workloads.
- Diagnose and resolve issues related to Linux servers, networks, GPUs, cluster health, job failures, and performance bottlenecks to ensure system reliability and efficiency.
- Drive performance improvements across ML pipelines, inference engines, model serving frameworks (e.g., Triton, vLLM), and data processing layers.
- Develop and maintain reliable AI workflows with robust failover mechanisms to ensure uninterrupted operations in production environments.
- Implement enterprise-grade IAM authentication and authorization with Azure SSO, OIDC, Kerberos, and Active Directory, for ML APIs, model endpoints, and data platforms.
Other
- Advanced degree in Computer Science, Engineering, or related field.
- 10+ years of hands-on experience in systems engineering, ML infrastructure, and performance optimization.
- Strong problem-solving skills and the ability to diagnose and resolve system failures and performance bottlenecks.
- Excellent communication and collaboration skills to work effectively with cross-functional teams
- Proven track record delivering enterprise-scale AI product portfolios (preferably in top-tier financial services), including successful AI transformations in highly regulated environments.