Date
Oct 4, 2024, 12:30 pm1:30 pm
Location
Bendheim House 103

Details

Event Description

Abstract: Language models (LMs) trained on large-scale internet data form the backbone of modern AI systems. While scaling pretraining data sizes has been the primary focus of the model development, this approach alone does not address several critical issues. For example, LMs often suffer from hallucinations, are difficult to update with new knowledge, and pose significant copyright and privacy risks. In this talk, I will explore factors important for building trustworthy LMs beyond data scaling. First, I will discuss how to detect the pretraining data of black-box LMs, taking steps towards data transparency. Next, I will present retrieval-augmented LMs, a family of models that maintain access to data during inference to enhance reliability. Finally, I will examine the impact of data ordering in pretraining datasets and its implications for model performance.

Bio: Weijia Shi is a Ph.D. student at the University of Washington. Her research focuses on LM pretraining and retrieval-augmented models. She also studies multimodal reasoning and investigates copyright and privacy risks associated with LMs. She won an outstanding paper award at ACL 24 and was named as machine learning rising star in 2023. She has co-organized multiple workshops including Long-Context Foundation Models at ICML 24 and Knowledge Augmented Methods for NLP at ACL 24 and KDD 23.