Share
Export Citation
Image Caption Generation Through the Integration of CNN-Based Residual Network Architectures and LSTM
Santi D.
Proceedings International Conference on Informatics and Computational Sciences
Abstract
Image captioning improves understanding of visuals and words and impacts image retrieval and visual information. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs), more especially long short-term memory (LSTM) systems, are being used in this field’s recent advances to address the difficulty of using deep learning to produce concise narratives and highlight object-concept relationships. This study focuses on CNNs with residual network architectures (ResNet-50, ResNet-101, ResNet-152) tested on the Flickr 8k dataset. The aim is to explore and understand how the depth of the network affects the understanding of visual structures and contexts in improving the quality of descriptions. Our methodology involves several stages, including image feature extraction, text preprocessing, and model optimization and evaluation using metrics like BLEU scores. Experimental results demonstrate the effectiveness of our approach, with the ResNet-101 model achieving the highest BLEU score among the tested models. This work contributes to the ongoing efforts to bridge the gap between visual data understanding and natural language generation, offering promising prospects for more natural and accurate image captioning systems.