Share

Export Citation

APA
MLA
Chicago
Harvard
Vancouver
BIBTEX
RIS
Universitas Hasanuddin
Research output:Contribution to journalArticlepeer-review

Image Caption Generation Through the Integration of CNN-Based Residual Network Architectures and LSTM

Santi D.

Proceedings International Conference on Informatics and Computational Sciences

Published: 2024Citations: 10

Abstract

Image captioning improves understanding of visuals and words and impacts image retrieval and visual information. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs), more especially long short-term memory (LSTM) systems, are being used in this field’s recent advances to address the difficulty of using deep learning to produce concise narratives and highlight object-concept relationships. This study focuses on CNNs with residual network architectures (ResNet-50, ResNet-101, ResNet-152) tested on the Flickr 8k dataset. The aim is to explore and understand how the depth of the network affects the understanding of visual structures and contexts in improving the quality of descriptions. Our methodology involves several stages, including image feature extraction, text preprocessing, and model optimization and evaluation using metrics like BLEU scores. Experimental results demonstrate the effectiveness of our approach, with the ResNet-101 model achieving the highest BLEU score among the tested models. This work contributes to the ongoing efforts to bridge the gap between visual data understanding and natural language generation, offering promising prospects for more natural and accurate image captioning systems.

Other files and links

Fingerprint

Computer scienceSciences
ResidualSciences
Artificial intelligenceSciences
Residual neural networkSciences
Computer visionSciences
Convolutional neural networkSciences
Computer graphics (images)Sciences
AlgorithmSciences