TILOS Seminar: Optimal Quantization for LLMs and Matrix Multiplication

11am PDT | Friday, May 23, 2025

Yury Polyanskiy, MIT

 
Abstract: The main building block of large language models is matrix multiplication, which is often bottlenecked by the speed of loading these matrices from memory. A number of recent quantization algorithms (SmoothQuant, GPTQ, QuIP, SpinQuant etc) address this issue by storing matrices in lower precision. We derive optimal asymptotic
information-theoretic tradeoff between accuracy of the matrix product and compression rate (number of bits per matrix entry). We also show that a non-asymptotic version of our construction (based on nested Gosset lattices and Conway-Sloan decoding), which we call NestQuant, reduces perplexity deterioration almost three-fold compared to the state-of-the-art algorithms (as measured on LLama-2, Llama-3 with 8B to 70B parameters). Based on a joint work with Or Ordentlich (HUJI), Eitan Porat and Semyon Savkin (MIT EECS).


Yury Polyanskiy is a Cutten Professor of Electrical Engineering and Computer Science, a member of IDSS and LIDS at MIT, and an IEEE Fellow (2024). Yury received M.S. degree in applied mathematics and physics from the Moscow Institute of Physics and Technology in 2005 and Ph.D. degree in electrical engineering from Princeton University in 2010. His research interests span information theory, machine learning and statistics. Dr. Polyanskiy won the 2020 IEEE Information Theory Society James Massey Award, 2013 NSF CAREER award and 2011 IEEE Information Theory Society Paper Award.

Local Time

  • Timezone: America/New_York
  • Date: 23 May 2025
  • Time: 14:00 - 15:00

Location

HDSI 123 and Virtual
3234 Matthews Ln, La Jolla, CA 92093

Organizer

TILOS

Speaker