Automatic Audio and Image Caption Generation with Deep Learning

K Lavanya; B Jayamala; C Jeyasri; A Sakthivel

doi:10.34293/sijash.v11iS3-July.7916

K Lavanya Assistant Professor, Department of AI & DS, Arjun College of Technology
B Jayamala Department of AI& DS, Arjun College of Technology
C Jeyasri Department of AI& DS, Arjun College of Technology
A Sakthivel Department of AI& DS, Arjun College of Technology

DOI: https://doi.org/10.34293/sijash.v11iS3-July.7916

Keywords: Image Description, Audio Conversion, Visually Impaired, Computer Vision, Descriptive Captions, Natural Language Processing

Abstract

A novel approach to image caption generation tailored specifically for visually impaired individuals. The proposed system employs advanced computer vision algorithms to analyze images and generate descriptive textual captions. Furthermore, it integrates seamless text-to-speech conversion functionality, allowing for the automatic transformation of these captions into spoken audio, thereby enabling access to visual content for individuals with visual impairments. The goal of this project is to generate descriptive captions for a given photograph or image. We achieve this by employing Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) models, both of which are advanced deep learning techniques. Using computer vision, the system identifies the content of the image and generates a relevant caption. This caption is then converted into audio using Natural Language Processing (NLP).