TILOS Tutorial on AI Alignment

Agenda

8:30 - 9:00am	Registration & Light Breakfast
9:00 - 10:30am	Session I: Safety Vulnerabilities of Current Frontier Models [add Session I to calendar] (UC San Diego accounts only)
10:30 - 10:45am	Break
10:45am - 12:15pm	Session II: Methodologies for AI Alignment [add Session II to calendar] (UC San Diego accounts only)

This tutorial is fully self-contained. All necessary background for alignment+jailbreaking will be covered and no prior knowledge of language models is assumed. Anyone interested in AI alignment should feel free to attend. Please register HERE to attend in person as space is limited.

Session I: Safety Vulnerabilities of Current Frontier Models

In recent years, large language models have been used to solve a multitude of natural language tasks. In the first part of the tutorial, we start by giving a brief overview of the history of language modeling and the fundamental techniques that led to the development of the modern language models behind Claude, Gemini, GPT, and Llama. We then dive into the safety failure modes of the current frontier models. Specifically, we will explain that, despite efforts to align large language models (LLMs) with human intentions, popular LLMs are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. We review the current state of the jailbreaking literature, including new questions about robust generalization, discussions of open-box and black-box attacks on LLMs, defenses against jailbreaking attacks, and a new leaderboard to evaluate the robust generalization of production LLMs.

The focus of the first session will be mostly on safety vulnerabilities of the frontier LLMs. In the second session, we will focus on the current methodologies that aim to mitigate these vulnerabilities and more generally align language models with human standards.

Session II: Methodologies for AI Alignment

The second part of the tutorial focuses on AI alignment techniques and is structured as three segments: In the first segment, we examine black-box techniques aimed at aligning models towards various goals (e.g., safety), such as controlled decoding and the best-of-N algorithm. In the second segment, we will also consider efficiency, where we examine information-theoretic techniques designed to improve inference latency, such as model compression or speculative decoding. If time permits, in the final segment, we discuss inference-aware alignment, which is a framework to align models to work better with inference-time compute algorithms.

Ahmad Beirami is a research scientist at Google DeepMind, leading new research initiatives on post-training within Gen AI Unit. At Google Research, he led a research team on building safe, helpful, and scalable generative language models. At Meta AI, he led research to power the next generation of virtual digital assistants with AR/VR capabilities through robust generative language modeling. At Electronic Arts, he led the AI agent research program for automated playtesting of video games and cooperative reinforcement learning. Before moving to industry, he held a joint postdoctoral fellow position at Harvard & MIT, focused on problems in the intersection of core machine learning and information theory. He is the recipient of the 2015 Sigma Xi Best PhD Thesis Award from Georgia Tech.

Hamed Hassani is an Associate Professor in the Department of Electrical and Systems Engineering at the University of Pennsylvania, and a member of the TILOS Foundations team. He holds secondary appointments in the Department of Computer and Information Systems and the Department of Statistics and Data Science at the Wharton School. Before joining Penn, Hamed was a research fellow in the Foundations of Machine Learning program at the Simons Institute for the Theory of Computing at UC Berkeley. Prior to this he was a postdoctoral scholar and lecturer at the Institute for Machine Learning at ETH Zürich. Hamed earned his Ph.D. in Computer and Communication Sciences from EPFL. His research interests span machine learning, optimization, information theory, and their applications in real-world systems.

Tutorial on AI Alignment

Agenda

Session I: Safety Vulnerabilities of Current Frontier Models

Session II: Methodologies for AI Alignment

Date & Time

Venue

Registration

Contact

Presenters