Researchers declare to have developed a brand new approach to run AI language fashions extra effectively by eliminating matrix multiplication from the method. This essentially redesigns neural community operations which can be at the moment accelerated by GPU chips. The findings, detailed in a latest preprint paper from researchers on the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, might have deep implications for the environmental influence and operational prices of AI programs.
Matrix multiplication (typically abbreviated to “MatMul”) is on the middle of most neural community computational duties right this moment, and GPUs are significantly good at executing the mathematics shortly as a result of they will carry out giant numbers of multiplication operations in parallel. That skill momentarily made Nvidia probably the most beneficial firm in the world final week; the corporate at the moment holds an estimated 98 % market share for knowledge middle GPUs, that are generally used to energy AI programs like ChatGPT and Google Gemini.
In the brand new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe making a {custom} 2.7 billion parameter mannequin with out utilizing MatMul that options comparable efficiency to traditional giant language fashions (LLMs). They additionally display operating a 1.3 billion parameter mannequin at 23.8 tokens per second on a GPU that was accelerated by a custom-programmed FPGA chip that makes use of about 13 watts of energy (not counting the GPU’s energy draw). The implication is {that a} extra environment friendly FPGA “paves the best way for the event of extra environment friendly and hardware-friendly architectures,” they write.
The paper would not present energy estimates for typical LLMs, however this put up from UC Santa Cruz estimates about 700 watts for a standard mannequin. However, in our expertise, you may run a 2.7B parameter model of Llama 2 competently on a house PC with an RTX 3060 (that makes use of about 200 watts peak) powered by a 500-watt energy provide. So, when you might theoretically fully run an LLM in solely 13 watts on an FPGA (with no GPU), that will be a 38-fold lower in energy utilization.
The approach has not but been peer-reviewed, however the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—declare that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for constructing high-performing language fashions. They argue that their method might make giant language fashions extra accessible, environment friendly, and sustainable, significantly for deployment on resource-constrained {hardware} like smartphones.
Doing away with matrix math
In the paper, the researchers point out BitNet (the so-called “1-bit” transformer approach that made the rounds as a preprint in October) as an vital precursor to their work. According to the authors, BitNet demonstrated the viability of utilizing binary and ternary weights in language fashions, efficiently scaling as much as 3 billion parameters whereas sustaining aggressive efficiency.
However, they notice that BitNet nonetheless relied on matrix multiplications in its self-attention mechanism. Limitations of BitNet served as a motivation for the present research, pushing them to develop a very “MatMul-free” structure that would preserve efficiency whereas eliminating matrix multiplications even in the eye mechanism.