The Automated Audio Captioning task centers around generating natural language descriptions from audio inputs. Given the distinct modalities between the input…
The Automated Audio Captioning task centers around generating natural language descriptions from audio inputs. Given the distinct modalities between the input (audio) and the output (text), AAC systems typically rely on an audio encoder to extract relevant information from the sound, represented as feature vectors, which a decoder then uses to generate text descriptions.