AIT Asian Institute of Technology

1 AIT Asian Institute of Technology

> > >

Stabilizing actor-critic learning in POMDP-based disaster relief logistics under stochastic constraints
Author	Khadka, Binit
Call Number	AIT RSPR no.IM-25-01
Subject(s)	Disaster relief Logistics Reinforcement learning
Note	A research study submitted in partial fulfillment of the requirements for the degree of Master of Science in Information Management
Publisher	Asian Institute of Technology
Abstract	This research presents a stabilized Actor-Critic reinforcement learning framework designed for improving dynamic decision-making in humanitarian logistics under conditions of uncertainty. The work focuses on optimizing resource distribution across multiple districts and resource types during disaster response, where shifting demand, unpredictable access, and supply constraints necessitate flexible, forward-looking strategies. To support this, we developed a high-fidelity simulation environment, MultiDistrictDisasterEnv, which models 12 districts in Nepal and also 12 province Thailand. The simulation here incorporates stochastic dynamics such as fluctuating severity levels, disruptions in road and air connectivity, and variable resource availability. This environment enables a thorough testing and comparison of policy approaches under realistic disaster conditions. The proposed approach enhances the Proximal Policy Optimization (PPO) algorithm by introducing several novel components. These include a curriculum based scheduling of stochasticity to ease the model into uncertain settings, autoregressive action masking to guide feasible multi-dimensional dispatch decisions, and belief-aware sampling that adapts to partial observability. Additionally, the use of FiLM (Feature-wise Linear Modulation) layers enables the policy to respond very dynamically to evolving contextual signals. To address instability in learning, we also apply advantage normalization using Mean Absolute Deviation (MAD), which helps to reduce variance in the critic and supports consistent policy improvement. Our experimental analysis shows that the PPO agent performs effectively in both deterministic and stochastic scenarios, thereby surpassing classical baselines such as Greedy heuristics and Rolling Linear Programming in adaptability and quality of decision making. A comprehensive ablation study highlights the importance of belief integration and autoregressive action design in achieving stable learning and reliable credit assignment. Overall, our results demonstrate that a well regularized Actor-Critic model can maintain a strong balance across key response objectives efficiency, fairness, and timeliness even under high uncertainty. This study contributes both a reproducible testbed and a robust learning architecture for advancing intelligent disaster logistics systems.
Year	2025
Type	Research Study Project Report (RSPR)
School	School of Engineering and Technology
Department	Department of Information and Communications Technologies (DICT)
Academic Program/FoS	Information Management (IM)
Chairperson(s)	Chaklam Silpasuwanchai;
Examination Committee(s)	Chantri Polprasert;Mongkol Ekpanyapong;
Scholarship Donor(s)	AIT Scholarship;
Degree	Research studies project report (M. Sc.) - Asian Institute of Technology, 2025