1 AIT Asian Institute of Technology

Adjacent layer graft

AuthorLugun, Abhinav Asheesh
Call NumberAIT RSPR no.CS-24-01
Subject(s)Computer network architectures
Computer science
NoteA research study submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science
PublisherAsian Institute of Technology
AbstractTransformer-based models achieve state-of-the-art results in Natural Language Process ing (NLP). However, their over-parameterization has led to a focus on model compres sion for efficient deployment. Previous works have focused on compressing models either by reducing depth (fewer layers), width (fewer attention heads and intermediate dimensions in feed-forward networks), or both. Recent studies show that adjacent layers produce highly similar representations, indicating minimal changes in input propagation through the encoder. Even after width-wise compression, the model still display highly similar adjacent layer representations. Previous research demonstrates that layers can be removed with minimal performance impact, but this places a burden on the remaining layers to compensate for the lost encoding knowledge, causing performance degradation as more layers are removed. Given the high similarity in adjacent layer representations, this work investigates grafting important attention heads and intermediate dimensions from two neighboring encoders into one layer as a potential approach for efficient layer reduction. We hypothesize that better preservation of relevant weights can achieve an optimal performance trade-off as layers are reduced. We explore three grafting strate gies to determine the optimal training method for grafting model layers, focusing on the fine-tuned model on a task-specific dataset. Using BERT and the GLUE benchmark as a case study, our findings indicate that combining grafting with knowledge distillation can halve the number of encoder layers with only minor performance loss. Furthermore, additional layers can be grafted and removed with negligible performance impact. This study provides insights into more efficient layer reduction when integrated with other methods, offering significant reductions in model size while optimizing performance and inference speed.
Year2024
TypeResearch Study Project Report (RSPR)
SchoolSchool of Engineering and Technology
DepartmentDepartment of Information and Communications Technologies (DICT)
Academic Program/FoSComputer Science (CS)
Chairperson(s)Chaklam Silpasuwanchai;
Examination Committee(s)Chantri Polprasert;Mongkol Ekpanyapong;
Scholarship Donor(s)AIT Scholarships;
DegreeResearch Studies Project Report (M. Eng.) - Asian Institute of Technology, 2024


Usage Metrics
View Detail0
Read PDF0
Download PDF0