Lilly is looking to advance its pipeline by designing critical algorithms and workflows that expedite the creation of transformative therapies through the development of large-scale, pre-trained models in a decentralized, privacy-preserving manner.
Requirements
- Design and develop novel deep learning architectures (e.g., Transformer, Graph Neural Network-based) for large-scale, federated pre-training on unlabeled or partially labeled data distributed across multiple sources.
- Implement and advance state-of-the-art semi-supervised and self-supervised learning algorithms (e.g., contrastive learning, masked auto-encoding) tailored for the unique constraints of federated learning, such as communication bottlenecks and data heterogeneity.
- Develop and implement robust and communication-efficient federated aggregation strategies (e.g., FedAvg, FedProx, SCAFFOLD) that are stable for large, complex models and can handle non-IID (non-independently and identically distributed) data.
- Create efficient and effective protocols for fine-tuning and adapting the pre-trained federated foundation models for a wide range of specific downstream tasks, ensuring knowledge transfer while maintaining privacy.
- Develop high-fidelity simulation environments to test, debug, and benchmark federated pre-training strategies before real-world deployment.
- Profile, analyze, and optimize the computational performance (e.g., memory, latency, communication cost) of federated training and inference to ensure scalability to a large number of clients and massive datasets.
- Experience in developing statistical and machine learning models for complex endpoints.
Responsibilities
- Design and develop novel deep learning architectures (e.g., Transformer, Graph Neural Network-based) for large-scale, federated pre-training on unlabeled or partially labeled data distributed across multiple sources.
- Implement and advance state-of-the-art semi-supervised and self-supervised learning algorithms (e.g., contrastive learning, masked auto-encoding) tailored for the unique constraints of federated learning, such as communication bottlenecks and data heterogeneity.
- Develop and implement robust and communication-efficient federated aggregation strategies (e.g., FedAvg, FedProx, SCAFFOLD) that are stable for large, complex models and can handle non-IID (non-independently and identically distributed) data.
- Create efficient and effective protocols for fine-tuning and adapting the pre-trained federated foundation models for a wide range of specific downstream tasks, ensuring knowledge transfer while maintaining privacy.
- Collaborate with data engineering teams to establish pipelines for accessing and simulating distributed datasets. Develop high-fidelity simulation environments to test, debug, and benchmark federated pre-training strategies before real-world deployment.
- Profile, analyze, and optimize the computational performance (e.g., memory, latency, communication cost) of federated training and inference to ensure scalability to a large number of clients and massive datasets.
- Author high-impact research papers for publication in top-tier machine learning conferences (e.g., NeurIPS, ICML, ICLR) and relevant scientific journals.
Other
- Plays an essential leadership role, responsible for identifying, assessing, and implementing cutting-edge algorithmic solutions that leverage diverse datasets while ensuring data privacy and security for our partners.
- PhD in a data science field such as Biostatistics, Statistics, Machine Learning, Computational Biology, Computational Chemistry, Physics, Applied mathematics, or related field from an accredited college or university
- Minimum of 2 years of experience in the biopharmaceutical industry or related fields, with demonstrated expertise in drug discovery and early development.
- Exceptional interpersonal and communication skills, with a keen ability to understand, empathize, and navigate complex relationships and dynamics
- Highly self-motivated and organized.