Tutorial on AI Alignment
Agenda
8:30 - 9:00am | Registration & Light Breakfast |
9:00 - 10:30am | Session I: Safety Vulnerabilities of Current Frontier Models [add Session I to calendar] (UC San Diego accounts only) |
10:30 - 10:45am | Break |
10:45am - 12:15pm | Session II: Methodologies for AI Alignment [add Session II to calendar] (UC San Diego accounts only) |
This tutorial is fully self-contained. All necessary background for alignment+jailbreaking will be covered and no prior knowledge of language models is assumed. Anyone interested in AI alignment should feel free to attend. Please register HERE to attend in person as space is limited.
Session I: Safety Vulnerabilities of Current Frontier Models
In recent years, large language models have been used to solve a multitude of natural language tasks. In the first part of the tutorial, we start by giving a brief overview of the history of language modeling and the fundamental techniques that led to the development of the modern language models behind Claude, Gemini, GPT, and Llama. We then dive into the safety failure modes of the current frontier models. Specifically, we will explain that, despite efforts to align large language models (LLMs) with human intentions, popular LLMs are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. We review the current state of the jailbreaking literature, including new questions about robust generalization, discussions of open-box and black-box attacks on LLMs, defenses against jailbreaking attacks, and a new leaderboard to evaluate the robust generalization of production LLMs.
The focus of the first session will be mostly on safety vulnerabilities of the frontier LLMs. In the second session, we will focus on the current methodologies that aim to mitigate these vulnerabilities and more generally align language models with human standards.
Session II: Methodologies for AI Alignment
The second part of the tutorial focuses on AI alignment techniques and is structured as three segments: In the first segment, we examine black-box techniques aimed at aligning models towards various goals (e.g., safety), such as controlled decoding and the best-of-N algorithm. In the second segment, we will also consider efficiency, where we examine information-theoretic techniques designed to improve inference latency, such as model compression or speculative decoding. If time permits, in the final segment, we discuss inference-aware alignment, which is a framework to align models to work better with inference-time compute algorithms.
Date & Time
Thursday, March 6, 2025
8:30am - 12:15pm
Venue
Halıcıoğlu Data Science Institute Room 123
University of California, San Diego
3234 Matthews Lane
La Jolla, CA 92093
[MAP]
Zoom: https://ucsd.zoom.us/j/8829143368
Registration
Registration is free but required for in-person attendance as space is limited. Register HERE.
Contact
Contact tilos@ucsd.edu with questions.
Presenters
Ahmad Beirami is a research scientist at Google DeepMind, leading new research initiatives on post-training within Gen AI Unit. At Google Research, he led a research team on building safe, helpful, and scalable generative language models. At Meta AI, he led research to power the next generation of virtual digital assistants with AR/VR capabilities through robust generative language modeling. At Electronic Arts, he led the AI agent research program for automated playtesting of video games and cooperative reinforcement learning. Before moving to industry, he held a joint postdoctoral fellow position at Harvard & MIT, focused on problems in the intersection of core machine learning and information theory. He is the recipient of the 2015 Sigma Xi Best PhD Thesis Award from Georgia Tech.
Hamed Hassani is an Associate Professor in the Department of Electrical and Systems Engineering at the University of Pennsylvania, and a member of the TILOS Foundations team. He holds secondary appointments in the Department of Computer and Information Systems and the Department of Statistics and Data Science at the Wharton School. Before joining Penn, Hamed was a research fellow in the Foundations of Machine Learning program at the Simons Institute for the Theory of Computing at UC Berkeley. Prior to this he was a postdoctoral scholar and lecturer at the Institute for Machine Learning at ETH Zürich. Hamed earned his Ph.D. in Computer and Communication Sciences from EPFL. His research interests span machine learning, optimization, information theory, and their applications in real-world systems.