【Gamystery】EP21: The Future of Sound—How AI Voice Generation Creates New Value for the Gaming Industry?
- 彥澤 廖
- Feb 3
- 7 min read
Updated: Feb 5
Background
Recently, I've been developing a plugin designed to improve the integration of AI voice generation platforms into game engines. It is scheduled to be released on the Fab platform in April.
The core functionality is to optimize the cumbersome processes that arise after using AI voice generation. Based on current tests, it can save over 70% of the time compared to the old workflow.
If you are involved in any of the following roles:
Animation production teams
Game development teams
Animators
Game development engineers
Sound designers
Engine integration engineers
and you’re interested in incorporating AI voice generation into your development process, you’re welcome to join the whitelist (limited to 50 people). (For details, please refer to the form.)
Because of my “figure it out as I go” personality, with only two months left, I started to reflect on what elements were missing from the entire development process. Besides ramping up efforts to find promotional channels, I realized that I also lacked a comprehensive analysis and discussion to prove the feasibility and value of this plugin.
So, during the Lunar New Year holiday, I gathered data and tried to apply the business analysis knowledge I had previously only learned from the internet and books.
The Role of Sound in Games
At a recent gaming meetup, I casually asked several development team members: “What are the primary skills of your top three teammates?” In nine out of ten cases, the answers were planning, programming, and art—with music always coming last. Even in larger game teams, the importance and production process of music are still ranked after those three areas.
Recently, I watched a YouTube video by Ali Elzoheiry titled “10 Ways to Make Combat Feel Better in Unreal Engine” The video carefully guides developers through how the cumulative effect of each layer brings positive feedback. From my non-professional, subjective perspective, I felt that sound effects contribute about 20% to 30% to the final game experience, underscoring the importance of pairing games with the right sound environment.
I have also followed numerous AI technologies and explored how they might be applied in games. However, in reality, there are few cases of successful application—most technologies remain at the inspirational stage, with more than half the journey left before practical implementation. In my view, music and voice-over are the areas closest to real-world game applications.
There are three main reasons for this:
Limited and Constrained Resources
For small teams, historically, very few resources have been allocated to this area—a trade-off made due to technical and resource limitations.
Simple, Standardized Formats
Compared to 3D models, audio file formats such as .wav and .mp3 are much more standardized, which streamlines the workflow from AI generation to importing assets into the game engine. For instance, with AI-generated models, aside from quality issues, textures or materials often require manual adjustment and configuration. This puts AI-generated models at a disadvantage compared to professionally produced 3D models in cross-platform scenarios.
Mature AI Technology
I believe that the quality of AI-generated voice-over and sound effects is already at a level acceptable to gamers. Other industries—such as social media, audiobooks, and healthcare—already showcase viable use cases. Personally, I often play games in foreign languages with subtitles, and in that context, it’s nearly impossible to distinguish between AI-generated and human-generated foreign language audio.
All these factors have sparked my interest in the “overall state of the dubbing industry and the impact of AI voice generation on the gaming industry.” Below, .
Industry Overview
According to 2022 data from AI voice service provider Speechify, the dubbing market is approximately a $4.4 billion industry, although its output is relatively low compared to other sectors (for example, the gaming industry, which has an annual output of around $200 billion). (Data for 2024 is expected to be similar.)
The dubbing market encompasses two service categories:
Dubbing : This refers to completely replacing the original voice with another language or sound, while ensuring that the character’s lip-sync, emotions, and tone are appropriately matched.
Voice-over: Mainly used for narration or documentaries, where the original audio might still be audible, or the voice does not need to perfectly correspond to the character’s lip movements.
The entertainment media sector represents the largest share of this industry, covering films, television, podcasts, and online entertainment videos. In 2024, these segments are expected to account for over 32.5% of the total market share, mainly because films, TV shows, and streaming platforms not only require original dubbing but also benefit from localized content. North America dominates the dubbing and voice-over market, holding over 43.5% of the share and generating approximately $1.8 billion in revenue.
This growth is largely driven by the demand for multilingual content resulting from localization efforts, which creates demand for various languages and boosts overall market revenue.
Live-recorded production processes accounted for 58.2% of the market in 2024. This indicates that AI technology still cannot meet the threshold required for traditional commercial services, underscoring the importance of authentic and contextually relevant voice performance.
New Technology, New Markets
Although several reports have indicated that the dividends from streaming technology significantly impact market share growth, for providers of AI SaaS platforms, this customer segment is not easy to tap into—primarily due to quality thresholds and pricing mechanisms. This is because these platforms are largely extensions of traditional audio-visual formats.
The current best entry points focus on two sectors: social media and audiobooks. The former faces pressures related to production volume and costs, while the latter represents a hybrid model that bridges text and voice. According to forecasts, the audiobook market is expected to reach a value of $19.4 billion by 2027 (source).
The Impact of AI Generation on Voice Acting
Below is my analysis, which attempts to follow the theories mentioned in 《Seeing What’s Next:Using the Theories of Innovation to Predict Industry Change》.
Due to limitations in data and experiential knowledge, the scope of this analysis is relatively limited.
Some terminology will borrow concepts referenced in the book; if you’d like to delve deeper, please refer to the book.
In the following text, “AI generation” specifically refers to AI-generated dubbing and sound effects.
Stakeholders
Incumbents: Voice acting companies
New Entrants: AI voice platforms
Changes Brought by New Entrants
Type (Conclusion): Disruptive Innovation
Characteristics: "Low cost," "Convenience," "Room for service development," and "Creating a market that did not previously exist."
The gaming market generally does not place enough emphasis on voice, which appears to stem from two factors:
Funding
In development teams with limited resources, dubbing often provides only a marginal improvement to the player's experience. The enhancement depends not only on the game genre but also on the development costs involved. To create a high-quality audio-visual experience, teams usually need to acquire dubbing materials, then split and organize these assets according to system requirements before integrating them into the game engine.
Workflow
From an overall production process perspective, even when resources are abundant, dubbing is closely intertwined with the script and textual content. It is easy to imagine that a game's script must undergo repeated revisions and testing. Unless the production team has a highly efficient voice acting collaboration process, the repetitive communication and revisions can result in significant cumulative costs.
AI generation technology offers development teams a better option than before, with the potential to solve a problem that was previously unsolvable (assuming canned sound effects are not considered a viable solution).
Resources, Workflow, and Values
Incumbents
Voice acting companies serve a wide variety of media formats—games, anime, films, and more. Their workflow encompasses early-stage business development, mid-stage asset recording, and post-production; well-established companies can manage the entire process in-house.
In terms of resources, many voice acting companies remain privately held, which leads to two inferences: 1) limited scale, and 2) a reluctance to expand. According to Wikipedia, several Japanese voice acting companies have been operating since the 1950s, evolving from the era when radio became popular. In Japan, family-run businesses may have a slightly higher chance of going public, but overall, the industry's scale is inherently limited, making it difficult to create companies large enough for an IPO.
To cultivate "star voice actors," these companies’ overall values center on "management" and "production." Much like idols or actors, nurturing talented and well-known voice actors is a major challenge that companies must overcome. This also implies the need for a multi-step, well-planned process aimed at addressing the aspects of "training" and "exposure."
New Entrants
In contrast, teams building AI SaaS services reshape their value proposition around "software" and "platforms" rather than relying on "star power." They focus on how to provide platform users with better (more dynamic), more abundant (greater voice talent), and cheaper (subscription-based) solutions.
AI teams follow the Silicon Valley startup model, where a single round of funding can reach hundreds of millions (e.g., ElevenLabs Series C). This implies that to recoup these investments, their vision must extend far beyond traditional, narrow boundaries.
What Kind of Innovation Do New Entrants Represent?
Incumbents
Voice acting companies primarily target mainstream projects that already have a certain level of finished production. Their works are predominantly non-realistic in nature, and for live-action projects, the demand shifts from dubbing to voice-over—demand that may not exist unless the program is aimed at international markets.
New Entrants
Based on advertisements and official videos, new entrants not only create voices for non-realistic characters but also emphasize that productions featuring real-life scenes can offer significant "production flexibility."
Use Cases:
Individuals:During the post-visual planning phase, synchronizing visuals with text design and using generative voice-over can lower costs while avoiding common issues associated with human-recorded audio, such as drooling or mispronunciations.
Teams:They address the high post-production costs that arise when different people record main audio tracks and require re-recording of materials.
New entrants are attempting to create a customer base distinct from that of traditional voice acting companies. This tool may not be feasible within conventional television production workflows, where professional specialization divides responsibilities (e.g., planning teams, hosts, and post-production teams operating separately).
With the rise of social media, production teams have become leaner, and audience viewing preferences have shifted. Cross-team production is increasingly scarce and expensive. Often, on-screen personnel are core team members who are also the owners, making the notion of treating a “voice model” as a team asset more acceptable.
SaaS solutions create demand for improvements in existing production workflows that are both "convenient" and "affordable," without directly conflicting with the business models of incumbents. In this context, SaaS embodies the characteristics of disruptive innovation as described in the literature.
If you enjoyed this issue's content, please like and follow my fan page—it fuels my ongoing creativity. Alternatively, subscribe to the newsletter to receive more timely Unreal-related tech updates.