1
Stabilizing actor-critic learning in POMDP-based disaster relief logistics under stochastic constraints | |
| Author | Khadka, Binit |
| Call Number | AIT RSPR no.IM-25-01 |
| Subject(s) | Disaster relief Logistics Reinforcement learning |
| Note | A research study submitted in partial fulfillment of the requirements for the degree of Master of Science in Information Management |
| Publisher | Asian Institute of Technology |
| Abstract | This research presents a stabilized Actor-Critic reinforcement learning framework designed for improving dynamic decision-making in humanitarian logistics under conditions of uncertainty. The work focuses on optimizing resource distribution across multiple districts and resource types during disaster response, where shifting demand, unpredictable access, and supply constraints necessitate flexible, forward-looking strategies. To support this, we developed a high-fidelity simulation environment, MultiDistrictDisasterEnv, which models 12 districts in Nepal and also 12 province Thailand. The simulation here incorporates stochastic dynamics such as fluctuating severity levels, disruptions in road and air connectivity, and variable resource availability. This environment enables a thorough testing and comparison of policy approaches under realistic disaster conditions. The proposed approach enhances the Proximal Policy Optimization (PPO) algorithm by introducing several novel components. These include a curriculum based scheduling of stochasticity to ease the model into uncertain settings, autoregressive action masking to guide feasible multi-dimensional dispatch decisions, and belief-aware sampling that adapts to partial observability. Additionally, the use of FiLM (Feature-wise Linear Modulation) layers enables the policy to respond very dynamically to evolving contextual signals. To address instability in learning, we also apply advantage normalization using Mean Absolute Deviation (MAD), which helps to reduce variance in the critic and supports consistent policy improvement. Our experimental analysis shows that the PPO agent performs effectively in both deterministic and stochastic scenarios, thereby surpassing classical baselines such as Greedy heuristics and Rolling Linear Programming in adaptability and quality of decision making. A comprehensive ablation study highlights the importance of belief integration and autoregressive action design in achieving stable learning and reliable credit assignment. Overall, our results demonstrate that a well regularized Actor-Critic model can maintain a strong balance across key response objectives efficiency, fairness, and timeliness even under high uncertainty. This study contributes both a reproducible testbed and a robust learning architecture for advancing intelligent disaster logistics systems. |
| Year | 2025 |
| Type | Research Study Project Report (RSPR) |
| School | School of Engineering and Technology |
| Department | Department of Information and Communications Technologies (DICT) |
| Academic Program/FoS | Information Management (IM) |
| Chairperson(s) | Chaklam Silpasuwanchai; |
| Examination Committee(s) | Chantri Polprasert;Mongkol Ekpanyapong; |
| Scholarship Donor(s) | AIT Scholarship; |
| Degree | Research studies project report (M. Sc.) - Asian Institute of Technology, 2025 |