Enhancing Image Captioning Accuracy Through Attention-Based CNN–LSTM Architecture

13 Nov

Authors: Kosisochukwu Henry Ukpabi

Abstract: This research details a deep learning model designed to automatically describe images in natural language. The core innovation is a hybrid encoder-decoder system that fuses visual feature extraction (via a pretrained CNN) with sequential text generation (via an LSTM). Crucially, this system incorporates a Bahdanau attention mechanism to ensure the generated captions are accurate and contextually focused on the most relevant parts of the image. The model was trained and assessed using established datasets Microsoft COCO and Flickr30k employing standard preprocessing methods and optimized through techniques such as transfer learning, teacher forcing, dropout regularization, and early stopping. Quantitative assessments utilizing BLEU, METEOR, ROUGE-L, CIDEr, and SPICE metrics indicate the model's robust performance and its alignment with human-generated captions, notably achieving a BLEU-4 score of 0.30 and a CIDEr score of 0.95 on the COCO dataset. Additionally, qualitative evaluations through attention heatmaps further demonstrate the model's capability to concentrate on pertinent image areas during word prediction, thereby enhancing interpretability and contextual relevance. Although the system exhibits high accuracy and fluency in the captions produced, it also highlights opportunities for future improvements, such as increasing linguistic diversity and fine-tuning for specific domains. This study adds to the expanding domain of visual-language comprehension and presents promising applications in assistive technologies, automated content creation, and intelligent image indexing systems.

DOI: http://doi.org/10.5281/zenodo.17605715