As the games industry moves closer and closer in expectation to Hollywood quality productions, so too is the expectation of each component in games heightened. One critical component of modern games is dialogue and voice acting. Players expect high quality audio and voice acting akin to their favourite movies, however, people don’t always understand the differences in length between a movie and a AAA video game. For instance interactive games take longer to complete than a movie, and therefore require exponentially more lines of dialogue than a linear film (Sweet, 2014). For example, Star Wars – The Old Republic received the Guinness world record for the “Largest Entertainment Voice Over Project Ever ”, it has over 200,000 lines in English (Senior, 2012). Furthermore games have voice lines which need multiple versions to accommodate player choice/style like in combat action (Schmidt, 2019). The amount of content in these modern games is then multiplied when considering the amount of languages available. Pre-production is also an important aspect of game development to consider; there are many steps to creating dialogue for games (Bridgett, 2009) which all create costs. The issue we face when dealing with dialogue is now bifurcated due to the immense amount of content and time: Data management and money. What options are available to solve these issues?
One solution which we’ll be focusing on in this piece is the use of synthesis, specifically vocal synthesis. Vocal or speech synthesis is the process of using synthesis techniques to mimic human speech patterns (Carbonneau, 2021).
Vocal synthesis was first realised in the voder, an invention of Bell Labs in the 1930s (Dudley, 1940). From there the technology developed into formant synthesis and articulatory synthesis: All of which can be seen in toys and games both old and modern. One of the first vocal synthesis toys on the market was the speak and spell. Utilising vocoders and formant synthesis [see Video 1, 1:45], it creates pre-existing phonemes using at least 2 oscillators to create consonant and vowel sounds. Formants are the building components to phonemes, the single units of sound that our mouths can make (McCarty, 2003). Fig.1 displays the frequency ranges/combinations that create certain phonemic vowel sounds. Formant synthesis is just one piece of the puzzle, next research at Haskins Laboratory aimed to create a physical model of our throats to decipher how we use sound to articulate (Baer and Rubin, 1981). The Pink Trombone is a great tool to realise the work done at Haskins Laboratory [See Link 1]. With modern AI and technology, we can utilise the processing power of our computers to on the fly combine phonemes and articulations fluidly to create believable human speech (Carbonneau, 2021).
Before speech synthesis, games of old utilised repeating filtered noises to indicate speech pattern [see video 2]. We can see formant synthesis prominently in modern games like Bastion (2011) or Celeste (2018). For example, in Celeste, each character’s dialogue voice has been created using parametric EQ on oscillators to create the unique timbre of their respective voices (Regamey, 2021) [see Link 2]. While the majority of their voices are a far cry from human speech, there still is an expression present that is not available in text based conversation alone. Additionally, there are points in the game where the synthesis edges on articulatory synthesis (Baer and Ruben, 1981) to create emphasis in dialogue [see Video 2].
In these examples, we see obviously this doesn’t necessarily replace voice acting and dialogue equally, yet the developers in their own way have leaned into the aesthetic to fit their own artistic styles. In the more realistic side of vocal synthesis, Marc-André Carbonnaeu’s presentation on speech synthesis at this year’s GDC illuminates more ways we can utilise modern speech synthesis. One such example he gives is in pre production of cut-scenes in games like Watch Dogs: Legion, or Assassin’s Creed’s Discovery Tour of ancient Greece. Hiring voice actors for multiple sessions is quite expensive, and creating a cutscene for a video game can take much longer than the time needed to voice it, so Ubisoft has used voice synthesis to demo how characters would speak to each other during the development of a cutscene [see Video 4, 16:37]. This can see application in not just pre-production, but also for localisation/translation, character sound variation, etc.
Of course, Voice synthesis is not perfect. As we can clearly see in Celeste, a voice is not realistic in just using formant synthesis. Additionally the AI speech synthesis found in the Ubisoft demo also was devoid of that human quality and cadence which we as humans are acutely attuned to. In conclusion, Voice synthesis is not quite powerful enough to replace voice acting outright, but it can be a data and cost reducing method to it. Game developers have become clever to this and try to lean into the limitation rather than ignore it, which creates quirky and unique sonic styles for their games. Perhaps with the progression of technology, Voice and Speech synthesis could one day fully replace voice acting.
Bridgett, R. (2009), A Holistic Approach to Game Dialogue Production. [Blog] 29 October 2009. Available from: https://www.gamedeveloper.com/design/a-holistic-approach-to-game-dialogue-production [Accessed 22 October 2021]
Carbonneau, M. (2021) Speech Synthesis in the Context of Video Gaming [Online Video]. Available from: https://www.gdcvault.com/play/1027229/Speech-Synthesis-in-the-Context [Accessed 26 October 2021]
Dudley, H. (1940) The Carrier Nature of Speech. The Bell System Technical Journal, 19, (4) October, pp. 495.
Kazuo, A.S. (2021) Celeste Sound Designer Discusses Game’s Unique Dialogue System. [Blog] 17 July 2021. Available from: https://gamerant.com/celeste-game-dialogue-audio-explained/ [Accessed 23 October 2021]
McCarty, J. (2003) Formant Analysis. [Blog] 2003. Available from: https://ccrma.stanford.edu/~jmccarty/formant.htm [Accessed 26 October 2021]
McGee, M. (2017) How to design character voices for games || Waveform [Online Video]. 19 March 2017. Available from: https://www.youtube.com/watch?v=ekfsPO9FUgU&t=315s [Accessed 26 October 2021]
Phototristan. (2020) Retro Tech & Kraftwerk Influence – Vintage Texas Instruments Speak & Spell Demo, [Online Video]2 August 2020. Available from: https://www.youtube.com/watch?v=og9NbWhn76E&t=105s [Accessed 26 October 2021]
Regamey, K. @regameyk (2021) Celeste’s dialogue design is the #1 most-asked-about topic when it comes to the game’s sound. I figured I’d share some of how we went about creating it. 17th July. [Online] Available from: https://twitter.com/regameyk/status/1416483583053602816 [Accessed: 26 October 2021]
Rubin, P. Baer, T. (1981) An articulatory synthesizer for perceptual research. The Journal of the Acoustical Society of America, 70, (2) August, pp. 321 – 326.
Schmidt, B. (2019) Dialogue for video games: 11 things you should know. [Blog] 11 September 2019. Available from: https://www.gamesoundcon.com/post/2019/09/09/dialogue
-for-video-games-11-things-you-should-know [Accessed 22 October 2021]
Senior, T. (2012) Star Wars: The Old Republic scoops Guinness World Record for voice acting. [Blog] 06 January 2012 Available at https://www.pcgamer.com/star-wars-the-old-republic-scoops-guinness-world-record-for-voice-acting/ [Accessed 25 October 2021]
Sweet, M. (2014) Writing Interactive Music for Video Games. Crawfordsville, Indiana: Pearson Education, pp.16-17.
TeDeMos. (2018) Celeste All Dialogues/Full Story Prologue + Chapter 1 Forsaken City [Online Video]. 24 February 2018. Available from: https://www.youtube.com/watch?v=TZpQH8kSWNU&t=305s [Accessed 26 October 2021]
Trammell, A. (2013) Video Gaming and the Sonic Feedback of Surveillance: Bastion and The Stanley Parable, March 25. [Online] Available from: https://soundstudiesblog.com/2013/03/25/surviellance-immersion-and-the-male-voice-in-bastion-and-the-stanley-parable/ [Accessed 26 October 2021]