Automatic Audio and Image Caption Generation with Deep Learning
Abstract
A novel approach to image caption generation tailored specifically for visually impaired individuals. The proposed system employs advanced computer vision algorithms to analyze images and generate descriptive textual captions. Furthermore, it integrates seamless text-to-speech conversion functionality, allowing for the automatic transformation of these captions into spoken audio, thereby enabling access to visual content for individuals with visual impairments. The goal of this project is to generate descriptive captions for a given photograph or image. We achieve this by employing Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) models, both of which are advanced deep learning techniques. Using computer vision, the system identifies the content of the image and generates a relevant caption. This caption is then converted into audio using Natural Language Processing (NLP).
Copyright (c) 2024 K Lavanya, B Jayamala, C Jeyasri, A Sakthivel

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.