IS THIS THE REAL LIFE? INVESTIGATING THE CREDIBILITY OF SYNTHESIZED FACES AND VOICES CREATED BY AMATEURS USING ARTIFICIAL INTELLIGENCE TOOLS
DOI:
https://doi.org/10.69609/1516-2893.2024.v30.n1.a3846Keywords:
Deep Fake, Automatic Generation, MoviesAbstract
The widespread availability and accessibility of artificial intelligence (AI) tools have enabled experts to create content that fools many and needs deep scrutiny to be discernible from reality; nevertheless, it is unclear whether presented with the same tools amateurs can also create synthesized faces and voices with similar ease. The possibility of creating this kind of content can be life-changing for smaller movie makers. Thus, it is important to understand how, can amateurs be supported and guided into creating similar media and how believable are their results. This paper aims to propose a framework that can be used by amateurs to create completely artificial content and investigate the credibility of synthesized faces and voices created by amateurs using AI tools. Specifically, we explore whether an entirely AI-generated piece of media, encompassing both visual and audio components, can be convincingly created by non-experts. To achieve this, we conducted a series of experiments in which participants were asked to evaluate the credibility of synthesized media produced by amateurs. We analyzed the responses and evaluated the extent to which the synthesized media could pass as authentic to the participants. Our findings suggest that, while AI-generated media created by amateurs may appear visually convincing, the audio component is still lacking in terms of naturalness and authenticity. However, we also found that participants’ perceptions of credibility were influenced by their prior knowledge of AI-generated media and their familiarity with the source material. Our findings also suggest that while AI-generated media has the potential to be highly convincing, current AI tools and techniques are still far from achieving perfect emulation of human behavior and speech, when done by amateurs without artistic interference.
References
Branwen, G. Making anime faces with stylegan. Available in: https://www.gwern.net/Faces, 2019.
Bunnell, H. T.; Pennington, C.; Yarrington, D. Gray, J. Automatic personal synthetic voice construction. In Ninth European Conference on Speech Communication and Technology, 2005.
Cabral. J.; Oliveira, L.; Raimundo, G. Paiva, A. What voice do we expect from a synthetic character? In Proceedings of SPECOM. Citeseer, 536–539, 2006.
Cabral, J. P.; Cowan, B. R.; Zibrek, K.; McDonnell, R. The Influence of Synthetic Voice on the Evaluation of a Virtual Character.. In INTERSPEECH. Stockholm, 229–233, 2017.
Jia, Y.; Zhang, Y.; Weiss, R.; Wang, Q.; Shen, J.; Ren, F.; Nguyen, P.; Pang, R.; Moreno, I. L.; Wu, Y. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems, Vol. 31, 2018.
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 4401–4410, 2019.
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 8110–8119, 2020.
Khodadadeh, S.; Ghadar, S.; Motiian, S.; Lin, W. A.; Bölöni, L.; Kalarot, R. Latent to Latent: A Learned Mapper for Identity Preserving Editing of Multiple Face Attributes in StyleGAN-generated Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, p. 3184–3192, 2022.
Kim, D.; Joo, D.; Kim, J. Tivgan: Text to image to video generation with step-by-step evolutionary generator. IEEE Access, v. 8, 153113–153122, 2020.
Li, Y. Lyu, S. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656, 2018.
Mitchell, W. J.; Szerszen, K. A.; Lu, A. S.; Schermerhorn, P. W.; Scheutz, M.; MacDorman, K. F. A mismatch in the human realism of face and voice produces an uncanny valley. i-Perception, v 2, n. p. 10–12, 2011.
Mullennix, J. W.; Johnson, K. A. Meral Topcu-Durgun, and Lynn M Farnsworth. The perceptual representation of voice gender. The Journal of the Acoustical Society of America, v. 98, n. 6, p. 3080–3095, 1995.
Nguyen, T. T.; Nguyen, C. M.; Nguyen, D. T.; Nahavandi, S. Deep learning for deepfakes creation and detection. arXiv preprint arXiv:1909.11573, 2019.
Perov, I.; Gao, D.; Chervoniy, N.; Liu, K.; Marangonda, S.; Umé, C.; Dpfks, M.; Facenheim, K. S.; Jiang, J. DeepFaceLab: Integrated, flexible and extensible face-swapping framework. arXiv preprint arXiv:2005.05535, 2020.
Pidhorskyi, S.; Adjeroh, D. A.; Doretto, G. Adversarial Latent Autoencoders. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 14104–14113, 2020.
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. International Conference on Machine Learning, p. 8821–8831, 2021.
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Niebner, M. FaceForensics++: Learning to Detect Manipulated Facial Images. Proceedings of the IEEE/CVF International Conference on Computer Vision, p. 1–11, 2019.
Shi, Y.; Aggarwal, D.; Jain, A. K. Lifting 2D StyleGAN for 3D-Aware Face Generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 6258–6266, 2021.
Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; Sebe, N First Order Motion Model for Image Animation. Advances in Neural Information Processing Systems, v. 32, 2019.
Tewari, A.; Elgharib, M.; Bharaj, G.; Bernard, F.; Seidel, H. P.; Pérez, P.; Zollhofer, M.; Theobalt, C. StyleRig: Rigging StyleGAN for 3D Control over Portrait Images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 6142–6151, 2020.
Xia, W.; Yang, Y.; Xue, J. H.; Wu, B. Tedigan: Text-Guided Diverse Face Image Generation and Manipulation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 2256–2265, 2021.
Zhu, X.; Xue, L. Building a controllable expressive speech synthesis system with multiple emotion strengths. Cognitive Systems Research, v. 59, p. 151–159. 2020.
Downloads
Published
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
A submissão de originais para este periódico implica na transferência, pelos autores, dos direitos de publicação impressa e digital. Os direitos autorais para os artigos publicados são do autor, com direitos do periódico sobre a primeira publicação. Os autores somente poderão utilizar os mesmos resultados em outras publicações indicando claramente este periódico como o meio da publicação original.