Procedurally generated AI compound media for expanding audial creations, broadening immersion and perception experience
Abstract
Recently, the world has been gaining vastly increasing access to more and more advanced artificial intelligence tools. This phenomenon does not bypass the world of sound and visual art, and both of these worlds can benefit in ways yet unexplored, drawing them closer to one another. Recent breakthroughs open possibilities to utilize AI driven tools for creating generative art and using it as a compound of other multimedia. The aim of this paper is to present an original concept of using AI to create a visual compound material to existing audio source. This is a way of broadening accessibility thus appealing to different human senses using source media, expanding its initial form. This research utilizes a novel method of enhancing fundamental material consisting of text audio or text source (script) and sound layer (audio play) by adding an extra layer of multimedia experience – a visual one, generated procedurally. A set of images generated by AI tools, creating a story-telling animation as a new way to immerse into the experience of sound perception and focus on the initial audial material. The main idea of the paper consists of creating a pipeline, form of a blueprint for the process of procedural image generation based on the source context (audial or textual) transformed into text prompts and providing tools
to automate it by programming a set of code instructions. This process allows creation of coherent and cohesive (to a certain extent) visual cues accompanying audial experience levering it to multimodal piece of art. Using nowadays technologies, creators can enhance audial forms procedurally, providing them with visual context. The paper refers to current possibilities, use cases, limitations and biases giving presented tools and solutions.
References
S. Bubeck, V. Chandrasekaran, R. Eldan, J. A. Gehrke, E. Horvitz,
E. Kamar, P. Lee, Y. Lee, Y.-F. Li, S. M. Lundberg, H. Nori, H. Palangi,
M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence:
Early experiments with gpt-4,” arXiv.org, 2023.
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Highresolution
image synthesis with latent diffusion models,” Computer
Vision and Pattern Recognition, 2021.
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton,
S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes,
T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic textto-
image diffusion models with deep language understanding,” Neural
Information Processing Systems, 2022.
C. Gao, J. J. Green, X. Yang, S. Oh, J. Kim, and S. V. Shinkareva,
“Audiovisual integration in the human brain: a coordinate-based
meta-analysis,” Cerebral Cortex, vol. 33, no. 9, pp. 5574–5584, 11
[Online]. Available: https://doi.org/10.1093/cercor/bhac443
H. Lima, B. LimaHugo, C. G. R. dos Santos, S. G. R. Dos, S. MeiguinsBianchi,
and B. S. Meiguins, “A survey of music visualization
techniques,” ACM Computing Surveys, 2021.
M. Tiihonen, E. Brattico, J. Maksimainen, J. Maksimainen, J. Wikgren,
and S. Saarikallio, “Constituents of music and visual-art related pleasure
– a critical integrative literature review,” Frontiers in Psychology, 2017.
M. Mller, Fundamentals of Music Processing: Audio, Analysis, Algorithms,
Applications, 1st ed. Springer Publishing Company, Incorporated,
S. Latif, H. Cuay´ahuitl, F. Pervez, F. Shamshad, H. S. Ali, and
E. Cambria, “A survey on deep reinforcement learning for audiobased
applications,” Artificial Intelligence Review, vol. 56, no. 3,
pp. 2193–2240, 2023. [Online]. Available: https://doi.org/10.1007/
s10462-022-10224-2
W. S. Peebles and S. Xie, “Scalable diffusion models with transformers,”
arXiv.org, 2022.
S. Wu, T. Wu, F. Lin, S. Tian, and G. Guo, “Fully transformer
networks for semantic image segmentation,” arXiv.org, 2021. [Online].
Available: https://doi.org/10.48550/arXiv.2106.04108
L. Yang, Z. Zhang, and S. Hong, “Diffusion models: A comprehensive
survey of methods and applications,” arXiv.org, 2022.
A. Ulhaq, N. Akhtar, and G. Pogrebna, “Efficient diffusion models for
vision: A survey,” Cornell University - arXiv, 2022.
X. Pan, P. Qin, Y. Li, H. Xue, and W. Chen, “Synthesizing coherent
story with auto-regressive latent diffusion models,” arXiv.org, 2022.
J. Zakraoui, M. Saleh, S. Al-M´aadeed, and J. M. Alja’am, “A pipeline
for story visualization from natural language,” Applied Sciences, 2023.
H. Chen, R. Han, T.-L. Wu, H. Nakayama, and N. Peng, “Charactercentric
story visualization via visual planning and token alignment,”
Cornell University - arXiv, 2022.
Y. Z. Song, Y.-Z. Song, Y.-Z. Song, Z. R. Tam, Z. R. Tam, H.-J.
Chen, H.-J. Chen, H.-H. Lu, H.-H. Shuai, and H.-H. Shuai, “Characterpreserving
coherent story visualization,” European Conference on Computer
Vision, 2020.
S. Chen, B. Liu, B. Liu, B. Liu, B. Liu, B. Liu, J. Fu, R. Song, Q. Jin,
P. Lin, P. Lin, X. Qi, C. Wang, and J. Zhou, “Neural storyboard artist:
Visualizing stories with coherent image sequences,” arXiv: Artificial
Intelligence, 2019.
A. Maharana, D. Hannan, and M. Bansal, “Storydall-e: Adapting pretrained
text-to-image transformers for story continuation,” European
Conference on Computer Vision, 2022.
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image
synthesis,” Neural Information Processing Systems, 2021.
J. Zhou, X. Shen, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield,
D. Guo, L. Kong, M. Wang, and Y. Zhong, “Audio-visual segmentation
with semantics,” arXiv.org, 2023.
G. Irie, M. Ostrek, H. Wang, H. Kameoka, A. Kimura, T. Kawanishi,
and K. Kashino, “Seeing through sounds: Predicting visual semantic segmentation
results from multichannel audio signals,” IEEE International
Conference on Acoustics, Speech, and Signal Processing, 2019.
C. Liu, P. Li, X. Qi, H. Zhang, L. Li, D. Wang, and X. Yu, “Audio-visual
segmentation by exploring cross-modal mutual semantics,” null, 2023.
G. Yariv, I. Gat, L. Wolf, Y. Adi, and I. Schwartz, “Audiotoken:
Adaptation of text-conditioned diffusion models for audio-to-image
generation,” arXiv.org, 2023.
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang,
J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren,
Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, and J. rong Wen, “A survey of
large language models,” arXiv.org, 2023.
T. G¨orne, “The emotional impact of sound: A short theory of film sound
design,” null, 2019.
J. Z. Wang, S. Zhao, C. Wu, R. B. Adams, M. Newman, T. Shafir,
and R. Tsachor, “Unlocking the emotional world of visual media: An
overview of the science, research, and impact of understanding emotion,”
Proceedings of the IEEE, 2023.
X. Wang, X. Li, Z. Yin, Y. Wu, L. J. D. O. P. L. O. Brain, Intelligence,
T. University, D. Psychology, and R. University, “Emotional intelligence
of large language models,” arXiv.org, 2023.
S. C. Patel and J. Fan, “Identification and description of emotions by
current large language models,” bioRxiv, 2023.
Z. Akhtar and T. H. Falk, “Audio-visual multimedia quality assessment:
A comprehensive survey,” IEEE Access, 2017.
A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria,
“A review of deep learning techniques for speech processing,”
Information Fusion, vol. 99, p. 101869, 2023. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1566253523001859
J. Li, X. Zhang, X. Zhang, X. Zhang, X. Zhang, X. Zhang, C. Jia, J. Xu,
X. Jizheng, L. Zhang, L. Zhang, L. Zhang, Z. Li, L. Zhang, Y. Wang,
Y. Wang, W. Yue, Y. Wang, S. Ma, W. Gao, and W. Gao, “Direct speechto-
image translation,” arXiv: Multimedia, 2020.
G. Samson, “Multimodal media generation: Exploring pipeline
of procedural visual context-dependent media layer creation,”
Warsaw, p. 67, 2023, thesis (Engineering) - Polish-Japanese
Academy of Information Technology, 2023. [Online]. Available:
https://system-biblioteka.pja.edu.pl/Opac5/faces/Opis.jsp?ido=40788#
J. Edwards, A. Perrone, and P. R. Doyle, “Transparency in language
generation: Levels of automation,” CIU, 2020. [Online]. Available:
https://doi.org/10.48550/arXiv.2006.06295
R. Adaval, G. Saluja, and Y. Jiang, “Seeing and thinking in pictures: A
review of visual information processing,” Consumer Psychology Review,
P. Gholami and R. Xiao, “Diffusion brush: A latent diffusion modelbased
editing tool for ai-generated images,” arXiv.org, 2023.
P. Li, Q. Huang, Y. Ding, and Z. Li, “Layerdiffusion: Layered controlled
image editing with diffusion models,” arXiv.org, 2023.
X. Zhang, W. Zhao, X. Lu, and J. Chien, “Text2layer: Layered image
generation using latent diffusion model,” arXiv.org, 2023.
X. Ma, Y. Zhou, X. Xu, B. Sun, V. Filev, N. Orlov, Y. Fu, and H. Shi,
“Towards layer-wise image vectorization,” Computer Vision and Pattern
Recognition, 2022.
M. Dorkenwald, T. Milbich, A. Blattmann, R. Rombach, K. Derpanis,
and B. Ommer, “Stochastic image-to-video synthesis using cinns,”
Computer Vision and Pattern Recognition, 2021.
Y. Hu, C. Luo, and Z. Chen, “Make it move: Controllable image-tovideo
generation with text descriptions,” Computer Vision and Pattern
Recognition, 2021.
M. Stypulkowski, K. Vougioukas, S. He, M. Ziba, S. Petridis, and
M. Pantic, “Diffused heads: Diffusion models beat gans on talking-face
generation,” arXiv.org, 2023.
L. Shen, X. Li, H. Sun, J. Peng, K. Xian, Z. Cao, and G.-S. Lin, “Makeit-
d: Synthesizing a consistent long-term dynamic scene video from a
single image,” arXiv.org, 2023.
J. Wu, J. J. Y. Chung, and E. Adar, “Viz2viz: Prompt-driven stylized
visualization generation using a diffusion model,” arXiv.org, 2023.
C. K. Praveen and K. Srinivasan, “Psychological impact and influence
of animation on viewer’s visual attention and cognition: A systematic
literature review, open challenges, and future research directions.” Computational
and Mathematical Methods in Medicine, 2022.
Additional Files
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Electronics and Telecommunications

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
1. License
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on https://creativecommons.org/licenses/by/4.0/.
2. Author’s Warranties
The author warrants that the article is original, written by stated author/s, has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author/s. The undersigned also warrants that the manuscript (or its essential substance) has not been published other than as an abstract or doctorate thesis and has not been submitted for consideration elsewhere, for print, electronic or digital publication.
3. User Rights
Under the Creative Commons Attribution license, the author(s) and users are free to share (copy, distribute and transmit the contribution) under the following conditions: 1. they must attribute the contribution in the manner specified by the author or licensor, 2. they may alter, transform, or build upon this work, 3. they may use this contribution for commercial purposes.
4. Rights of Authors
Authors retain the following rights:
- copyright, and other proprietary rights relating to the article, such as patent rights,
- the right to use the substance of the article in own future works, including lectures and books,
- the right to reproduce the article for own purposes, provided the copies are not offered for sale,
- the right to self-archive the article
- the right to supervision over the integrity of the content of the work and its fair use.
5. Co-Authorship
If the article was prepared jointly with other authors, the signatory of this form warrants that he/she has been authorized by all co-authors to sign this agreement on their behalf, and agrees to inform his/her co-authors of the terms of this agreement.
6. Termination
This agreement can be terminated by the author or the Journal Owner upon two months’ notice where the other party has materially breached this agreement and failed to remedy such breach within a month of being given the terminating party’s notice requesting such breach to be remedied. No breach or violation of this agreement will cause this agreement or any license granted in it to terminate automatically or affect the definition of the Journal Owner. The author and the Journal Owner may agree to terminate this agreement at any time. This agreement or any license granted in it cannot be terminated otherwise than in accordance with this section 6. This License shall remain in effect throughout the term of copyright in the Work and may not be revoked without the express written consent of both parties.
7. Royalties
This agreement entitles the author to no royalties or other fees. To such extent as legally permissible, the author waives his or her right to collect royalties relative to the article in respect of any use of the article by the Journal Owner or its sublicensee.
8. Miscellaneous
The Journal Owner will publish the article (or have it published) in the Journal if the article’s editorial process is successfully completed and the Journal Owner or its sublicensee has become obligated to have the article published. Where such obligation depends on the payment of a fee, it shall not be deemed to exist until such time as that fee is paid. The Journal Owner may conform the article to a style of punctuation, spelling, capitalization and usage that it deems appropriate. The Journal Owner will be allowed to sublicense the rights that are licensed to it under this agreement. This agreement will be governed by the laws of Poland.
By signing this License, Author(s) warrant(s) that they have the full power to enter into this agreement. This License shall remain in effect throughout the term of copyright in the Work and may not be revoked without the express written consent of both parties.