Multimodal Learning in AIoT Systems: Sensor Fusion and Vision-Based Intelligence
Abstract
This study evaluates the effectiveness of multimodal learning in Artificial Intelligence of Things (AIoT) systems, focusing on the integration of sensor fusion and computer vision for classification tasks. A systematic review and meta-analysis were conducted on studies published between 2020 and 2025. Thirteen studies met the inclusion criteria; however, only six provided comparable quantitative data due to inconsistent baseline reporting and evaluation practices. The results indicate that multimodal approaches generally improve accuracy compared to unimodal baselines when comparable evaluations are available, with an average increase of 8.88% (95% CI: 5.33%–12.44%, p < 0.001). High heterogeneity was observed, influenced by domain, sensor configuration, and model architecture. These findings suggest that multimodal effectiveness is conditional and depends on modality complementarity, fusion strategy, and system-level constraints
Downloads
Copyright (c) 2025 Agnes Prima Wulanjari, Ria Dymyati, Indar Bismoko Indar Bismoko, Nuryake Fajaryati, Pipit Utami

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.




.png)
