Share
Export Citation
Attention-Driven Image Captioning for Mobile Accessibility of the Visually Impaired
Santi D.
International Journal of Interactive Mobile Technologies
Q3Abstract
In a world increasingly reliant on visual information, individuals with visual impairments face significant challenges in understanding their environment. This paper introduces an attention-based image captioning model to improve accessibility for visually impaired users. The model integrates ResNet-152 for visual feature extraction, long short-term memory (LSTM) for text processing, and an attention mechanism to generate contextual image descriptions. Captured images are processed via a mobile device, then the description text is translated into Bahasa and converted to speech in real-time using text-to-speech technology. The system shows an average inference time of 2.99 seconds per image, enabling real-time use. The model is tested on the Flickr dataset and new datasets covering a variety of environments and object interactions. Experimental results show superior performance on the Flickr dataset (bilingual evaluation understudy (BLEU)-1: 0.59, metric for evaluation of translation with explicit ordering (METEOR): 0.25). Performance on real-world datasets is slightly lower, indicating challenges in generalizing to scenarios with occluded objects and inconsistent text. Future research will focus on scaling up real-world datasets, adversarial training, and integrating the system into devices such as smart glasses or canes for wider accessibility.
Access to Document
10.3991/ijim.v19i09.53441Other files and links
- Link to publication in Scopus
- Open Access Version Available