Research Article Open Access

Automated Medical Image Captioning with Soft Attention-Based LSTM Model Utilizing YOLOv4 Algorithm

Paspula Ravinder1 and Saravanan Srinivasan1
  • 1 Department of Computer Science and Engineering, School of Computing, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, Tamil Nadu, India


The medical image captioning field is one of theprominent fields nowadays. The interpretation and captioning of medical imagescan be a time-consuming and costly process, often requiring expert support. Thegrowing volume of medical images makes it challenging for radiologists tohandle their workload alone. However, addressing the issues of high cost andtime can be achieved by automating the process of medical image captioningwhile assisting radiologists in improving the reliability and accuracy of thegenerated captions. It also provides an opportunity for new radiologists withless experience to benefit from automated support. Despite previous efforts inautomating medical image captioning, there are still some unresolved issues,including generating overly detailed captions, difficulty in identifyingabnormal regions in complex images, and low accuracy and reliability of somegenerated captions. To tackle these challenges, we suggest the new deeplearning model specifically tailored for captioning medical images. Our modelaims to extract features from images and generate meaningful sentences relatedto the identified defects with high accuracy. The approach we present utilizesa multi-model neural network that closely mimics the human visual system andautomatically learns to describe the content of images. Our proposed methodconsists of two stages. In the first stage, known as the information extractionphase, we employ the YOLOv4 model to extract medical image features efficientlywhich is then transformed into a feature vector. This phase focuses primarilyon visual recognition using deep neural network techniques. The generatedfeatures are then fed into the second stage of caption generation, where the modelproduces grammatically correct natural language sentences describing theextracted features. The caption generation stage incorporates two sub-models: Anobject detection and localization model, which extracts information aboutobjects present in the image and their spatial relationships, and asophisticated deep Recurrent Neural Network (RNN), which utilizes Long Short-TermMemory (LSTM) units, enhanced by an attention mechanism, to generate sentences.This attention mechanism enables each word of the description to be alignedwith different objects in the input image during generation. We evaluated ourproposed model, using the PEIR dataset. Various Performance metrics includingRouge-L, Meteor score, and Bleu score were evaluated. Among these metrics, theBLEU score obtained using this model was 81.78%, while the METEOR scoreachieved was 78.56%. These results indicate that our model surpassesestablished benchmark models in terms of caption generation for medical images.This model was implemented using the Python Platform, making effective use ofits capabilities and PEIR dataset. We compared its performance with recentexisting models, demonstrating its superiority. The high BLEU and METEOR scoresobtained highlight the effectiveness of our suggested model excels in producingprecise and contextually rich descriptions for medical images. In summary, the modelperforms exceptionally well in this regard. Overall, the development of this model provides a promising solution toautomate medical image captioning, addressing the challenges faced byradiologists in managing their workload and improving the precision anddependability of generated descriptions.

Journal of Computer Science
Volume 20 No. 1, 2024, 52-68


Submitted On: 2 May 2023 Published On: 16 December 2023

How to Cite: Ravinder, P. & Srinivasan, S. (2024). Automated Medical Image Captioning with Soft Attention-Based LSTM Model Utilizing YOLOv4 Algorithm. Journal of Computer Science, 20(1), 52-68.

  • 0 Citations



  • Automatic Medical Image Captioning
  • Deep Learning
  • Wiener Filtering
  • Color Channel
  • YOLOv4
  • Hyper-Parameter Tuning
  • Hybrid Attention
  • Long-Short-Term Memory