"Play" sound based on visual input

SonicVisionLM - "Play" sound based on Visual Input

SonicVisionLM is a recently proposed system, still under development, that aims to bridge the gap between vision and sound. It leverages the power of large language models (LLMs) trained on massive text and code datasets to "play" sound based on visual input.

Sonic Vision LM generates sound effects for videos based on prompts, providing an efficient way to add Foley effects to videos.

Here's a breakdown of the key aspects of SonicVisionLM:


  • It uses an LLM, specifically a vision-language model (VLM), which is trained on paired data of images and their textual descriptions.

  • This VLM understands the relationships between visual elements and their corresponding language.

  • SonicVisionLM takes this understanding and extends it to sound generation.


  • When provided with an image, SonicVisionLM analyzes the visual content and generates a sonic description based on its interpretation.

  • This description can be in the form of musical notes, sound effects, or even spoken words.

Potential applications:

  • SonicVisionLM has the potential to be used in various creative and artistic domains, such as:

    • Generating soundtracks for movies or video games based on the visuals.

    • Creating immersive audio experiences for virtual reality applications.

    • Assisting visually impaired individuals in "hearing" the world around them.

Current stage:

  • It's important to note that SonicVisionLM is still under research and development.

  • While the concept is promising, the actual results are likely to be in their early stages and may not be perfect yet.

  • The paper describing the system was only published recently in January 2024.

Further resources:

  • You can find more information about SonicVisionLM in the research paper titled "Playing Sound with Vision Language Models" available on arXiv:

Last updated