AIT Asian Institute of Technology

1 AIT Asian Institute of Technology

> > >

Adjacent layer graft
Author	Lugun, Abhinav Asheesh
Call Number	AIT RSPR no.CS-24-01
Subject(s)	Computer network architectures Computer science
Note	A research study submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science
Publisher	Asian Institute of Technology
Abstract	Transformer-based models achieve state-of-the-art results in Natural Language Process ing (NLP). However, their over-parameterization has led to a focus on model compres sion for efficient deployment. Previous works have focused on compressing models either by reducing depth (fewer layers), width (fewer attention heads and intermediate dimensions in feed-forward networks), or both. Recent studies show that adjacent layers produce highly similar representations, indicating minimal changes in input propagation through the encoder. Even after width-wise compression, the model still display highly similar adjacent layer representations. Previous research demonstrates that layers can be removed with minimal performance impact, but this places a burden on the remaining layers to compensate for the lost encoding knowledge, causing performance degradation as more layers are removed. Given the high similarity in adjacent layer representations, this work investigates grafting important attention heads and intermediate dimensions from two neighboring encoders into one layer as a potential approach for efficient layer reduction. We hypothesize that better preservation of relevant weights can achieve an optimal performance trade-off as layers are reduced. We explore three grafting strate gies to determine the optimal training method for grafting model layers, focusing on the fine-tuned model on a task-specific dataset. Using BERT and the GLUE benchmark as a case study, our findings indicate that combining grafting with knowledge distillation can halve the number of encoder layers with only minor performance loss. Furthermore, additional layers can be grafted and removed with negligible performance impact. This study provides insights into more efficient layer reduction when integrated with other methods, offering significant reductions in model size while optimizing performance and inference speed.
Year	2024
Type	Research Study Project Report (RSPR)
School	School of Engineering and Technology
Department	Department of Information and Communications Technologies (DICT)
Academic Program/FoS	Computer Science (CS)
Chairperson(s)	Chaklam Silpasuwanchai;
Examination Committee(s)	Chantri Polprasert;Mongkol Ekpanyapong;
Scholarship Donor(s)	AIT Scholarships;
Degree	Research Studies Project Report (M. Eng.) - Asian Institute of Technology, 2024