By Angela Berti6 March 2025

Tutorial on AI Alignment (part 1 of 2): Safety Vulnerabilities of Current Frontier Models

Ahmad Beirami, Google DeepMind
Hamed Hassani, University of Pennsylvania

In recent years, large language models have been used to solve a multitude of natural language tasks. In the first part of the tutorial, we start by giving a brief overview of the history of language modeling and the fundamental techniques that led to the development of the modern language models behind Claude, Gemini, GPT, and Llama. We then dive into the safety failure modes of the current frontier models. Specifically, we will explain that, despite efforts to align large language models (LLMs) with human intentions, popular LLMs are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. We review the current state of the jailbreaking literature, including new questions about robust generalization, discussions of open-box and black-box attacks on LLMs, defenses against jailbreaking attacks, and a new leaderboard to evaluate the robust generalization of production LLMs.

The focus of the first session will be mostly on safety vulnerabilities of the frontier LLMs. In the second session, we will focus on the current methodologies that aim to mitigate these vulnerabilities and more generally align language models with human standards.

6views

Workshops and Tutorials

You may also like

TILOS HOT-AI Workshop: Flat Minima and Generalization with Maryam Fazel (University of Washington)

TILOS HOT-AI Workshop: Flat Minima and Generalization with Maryam Fazel (University of Washington)

9views

Machine Learning,

Workshops and Tutorials

TILOS HOT-AI Workshop: Hunting the Hessian with Madeleine Udell (Stanford University)

TILOS HOT-AI Workshop: Hunting the Hessian with Madeleine Udell (Stanford University)

8views

Machine Learning,

Workshops and Tutorials

TILOS HOT-AI Workshop: From Test-Time Tweaks to Global Guarantees with Mahdi Soltanolkotabi (USC)

TILOS HOT-AI Workshop: From Test-Time Tweaks to Global Guarantees with Mahdi Soltanolkotabi (USC)

3views

Machine Learning,

Workshops and Tutorials

TILOS HOT-AI Workshop: The Wisdom of the Body Revisited with Benjamin Recht (UC Berkeley)

TILOS HOT-AI Workshop: The Wisdom of the Body Revisited with Benjamin Recht (UC Berkeley)

7views

Workshops and Tutorials

TILOS HOT-AI Workshop: Accelerating Nonconvex Optimization via Online Learning with Aryan Mokhtari (UT Austin)

TILOS HOT-AI Workshop: Accelerating Nonconvex Optimization via Online Learning with Aryan Mokhtari (UT Austin)

7views

Workshops and Tutorials

TILOS HOT-AI Workshop: The Binary Iterative Hard Thresholding Algorithm with Arya Mazumdar (TILOS & UC San Diego)

TILOS HOT-AI Workshop: The Binary Iterative Hard Thresholding Algorithm with Arya Mazumdar (TILOS & UC San Diego)

3views

Workshops and Tutorials

TILOS HOT-AI Workshop: Reverse diffusion Monte Carlo with Yian Ma (TILOS & UC San Diego)

TILOS HOT-AI Workshop: Reverse diffusion Monte Carlo with Yian Ma (TILOS & UC San Diego)

8views

Workshops and Tutorials

TILOS HOT-AI Workshop: Linear Bregman Divergence Control with Babak Hassibi (Caltech)

TILOS HOT-AI Workshop: Linear Bregman Divergence Control with Babak Hassibi (Caltech)

8views

Workshops and Tutorials

TILOS HOT-AI Workshop: Unleashing the Power of Variance Reduction for Training Large Models with Quanquan Gu (UCLA)

TILOS HOT-AI Workshop: Unleashing the Power of Variance Reduction for Training Large Models with Quanquan Gu (UCLA)

3views

Machine Learning,

Workshops and Tutorials

TILOS HOT-AI Workshop: Optimization and Reasoning with Sean Gao (TILOS & UC San Diego)

TILOS HOT-AI Workshop: Optimization and Reasoning with Sean Gao (TILOS & UC San Diego)

6views

Workshops and Tutorials

Page 1 of 2

Leave A Reply Cancel reply