SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

arXiv · June 5, 2023 · Significant research

Summary

The paper introduces Sparse-Quantized Representation (SpQR), a new compression format and quantization technique for large language models (LLMs). SpQR identifies outlier weights and stores them in higher precision while compressing the remaining weights to 3-4 bits. The method achieves less than 1% accuracy loss in perplexity for LLaMA and Falcon LLMs and enables a 33B parameter LLM to run on a single 24GB consumer GPU. Why it matters: This enables near-lossless compression of LLMs, making powerful models accessible on resource-constrained devices and accelerating inference without significant accuracy degradation.

Keywords

LLM compression · quantization · SpQR · inference · GPU

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

arXiv · Jul 25

This paper benchmarks multilingual and monolingual LLM performance across Arabic, English, and Indic languages, examining model compression effects like pruning and quantization. Multilingual models outperform language-specific counterparts, demonstrating cross-lingual transfer. Quantization maintains accuracy while promoting efficiency, but aggressive pruning compromises performance, particularly in larger models. Why it matters: The findings highlight strategies for scalable and fair multilingual NLP, addressing hallucination and generalization errors in low-resource languages.

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Summary

Keywords

Related

Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks