1
Adjacent layer graft | |
| Author | Lugun, Abhinav Asheesh |
| Call Number | AIT RSPR no.CS-24-01 |
| Subject(s) | Computer network architectures Computer science |
| Note | A research study submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science |
| Publisher | Asian Institute of Technology |
| Abstract | Transformer-based models achieve state-of-the-art results in Natural Language Process ing (NLP). However, their over-parameterization has led to a focus on model compres sion for efficient deployment. Previous works have focused on compressing models either by reducing depth (fewer layers), width (fewer attention heads and intermediate dimensions in feed-forward networks), or both. Recent studies show that adjacent layers produce highly similar representations, indicating minimal changes in input propagation through the encoder. Even after width-wise compression, the model still display highly similar adjacent layer representations. Previous research demonstrates that layers can be removed with minimal performance impact, but this places a burden on the remaining layers to compensate for the lost encoding knowledge, causing performance degradation as more layers are removed. Given the high similarity in adjacent layer representations, this work investigates grafting important attention heads and intermediate dimensions from two neighboring encoders into one layer as a potential approach for efficient layer reduction. We hypothesize that better preservation of relevant weights can achieve an optimal performance trade-off as layers are reduced. We explore three grafting strate gies to determine the optimal training method for grafting model layers, focusing on the fine-tuned model on a task-specific dataset. Using BERT and the GLUE benchmark as a case study, our findings indicate that combining grafting with knowledge distillation can halve the number of encoder layers with only minor performance loss. Furthermore, additional layers can be grafted and removed with negligible performance impact. This study provides insights into more efficient layer reduction when integrated with other methods, offering significant reductions in model size while optimizing performance and inference speed. |
| Year | 2024 |
| Type | Research Study Project Report (RSPR) |
| School | School of Engineering and Technology |
| Department | Department of Information and Communications Technologies (DICT) |
| Academic Program/FoS | Computer Science (CS) |
| Chairperson(s) | Chaklam Silpasuwanchai; |
| Examination Committee(s) | Chantri Polprasert;Mongkol Ekpanyapong; |
| Scholarship Donor(s) | AIT Scholarships; |
| Degree | Research Studies Project Report (M. Eng.) - Asian Institute of Technology, 2024 |