University of Twente Student Theses

As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

Evaluating Full INT8 Quantization and Inference techniques for Causal Language Model

Srinivasa Kumar, Prawin Kumar (2025) Evaluating Full INT8 Quantization and Inference techniques for Causal Language Model.

PDF
3MB

Abstract:	Deploying large language models (LLMs) onto embedded devices presents significant challenges due to their substantial storage and computational requirements. Integer quantization has emerged as a critical technique to address these constraints, with recent research focusing on reducing model complexity while maintaining inference performance. This work investigates a static quantization approach for transformer-based LLMs, targeting full integer (INT8) representation across both linear and non-linear layers. Unlike existing approaches that primarily target linear layer quantization, our methodology develops a holistic quantization strategy that systematically addresses precision limitations across transformer architectures. This involves mapping both the full-precision weights and activations to 8-bit integer representations. Through this, we demonstrate a novel approach to evaluate different transformer-based quantization techniques on causal language models to understand their impact on performance degradation. The research employs the usage of both the quantization techniques used out of the box from model optimization toolkits in addition to the approximations for quantization as explored in literature and evaluates model performance through standard NLP benchmark tasks, with particular emphasis on text generation capabilities supported by the perplexity measurements. The proposed approach provides a detailed comparative analysis of static integer quantization techniques, exploring the trade-offs between model compression and inference accuracy. Through experimental validation, we illustrate the feasibility of deploying sophisticated small language models on resource-constrained platforms, offering insights into advanced quantization methodologies for embedded AI systems. Our experimental results demonstrate that applying INT8 quantization approximations to causal language models significantly degrades their performance. We observed an average relative decrease of approximately 15 percentage points across most NLP benchmark tasks. More critically, the models' autoregressive capabilities were severely compromised, with perplexity scores reaching orders of 8e4. These results indicate that post-training quantization fundamentally disrupts the models' ability to generate coherent token sequences.
Item Type:	Essay (Master)
Faculty:	EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:	54 computer science
Programme:	Embedded Systems MSc (60331)
Link to this item:	https://purl.utwente.nl/essays/104898
Export this item as:	BibTeX EndNote HTML Citation Reference Manager

Show download statistics for this publication

Repository Staff Only: item control page