Cloud-based Wavenet TTS for UNITY
- teejaydub
- May 25, 2018
- 4 min read
Updated: May 30, 2018
It's the next morning, and, of course, right back into this "weekend" project of real-time, cloud based, neural network generated voices in a video game. It's a mouthful but put simply, the goal is to create an in-game chat system whereby messages sent to and from a player are read out loud by a highly realistic synthesized voice. The final touch is that the voice will be passed through a filter to sound as if it is spoken through a walkie-talkie.. Before we start, here's the semi-finished idea in action!
If you haven't read part one, go do that now, as it will help this blog post make sense.
First, I wanted to get the same level of success that I had with the Python script, but this time through a C# script. To do that, I created a new script called TTS and discovered the UnityWebRequest class which allows you to send info to an HTTP server and get a response.

Referring to the documentation, I was able to create what looked like a correct way to send and receive data to the Google server, but it wasn't successful.
I had to head back to Google and direct my attention to some forums that dealt with specifically sending JSON data to a server. It seemed like, to accomplish that, you had to send the form as raw data. Rewriting the script under the advisement of the forum, I ended up with the following:

This time, I got a response, and it was in the correct format! Now it was just a matter of parsing the string response, decoding it, and saving it as an mp3. I save the file to the Resources folder in the Unity project, but to my dismay, the process of loading the mp3 is inconsistent in both its success rate and time it takes to start playing back. Sometimes it would play back the previously generated text, and it would take anywhere between two or three seconds to actually play back the audio.
Then, I thought about other ways of accessing the file besides loading the file from the resources folder. I came across some documentation relating to streaming media in Unity, and the use of the WWW class to access it.

As you might imagine, the WWW class normally deals with locations on the internet, but much to our benefit, you can also point it to a local file location. Once implemented, accessing the audio clip after saving it becomes a matter of a couple of lines of code.

The audio clip is sent to the Sound Manager object in my project, which contains functions for playing all sorts of different sounds, even with different effects. One effect that I implemented was a "walkie-talkie" filter that contains a low and high pass filter, in addition to a distortion effect. With the effects in place and the Text to Speech script functional, the incoming message sounds as if it is being spoken by a character in the game trying to communicate over a walkie-talkie. Awesome.
Time for a chat system! I created an empty game object and began writing a script to facilitate the conversations. Pressing enter opens up the chat system and turns on the chat object. Monitoring key presses, strings are lengthened and shortened, and get displayed on a 3D Text Mesh in the scene.

Upon hitting the return key, a few things happen:
The entered string is sent to the "speak" function we wrote in the TTS script that contacts Google to get the audio clip and send it to the Sound Manager to play it back.
The string gets sent Network Manager script to be sent to the opponent's computer and run through the same function in the TTS script.
The string gets stored into the "chat history" list variable of strings, and the Text Mesh for the chat history gets updated to include the latest line from the conversation.
The chat window becomes disabled
When you receive a message from a connected opponent, steps 1 and 3 occur, although slightly differently:
In step 3, the text is added to the Text Mesh for the chat history with a red color, and is not added to the chat history list variable of strings. The latter is because an additional feature that I implemented was the ability to go through your personal chat history by tapping up or down on the arrow keys and placing that previous message in the active text box, ready to enter and send off to be sent again. Since we probably don't want the feature to include re-sending messages that our opponents sent to us, we don't add the opponent text to that list.

I've also added string variables to represent each of the different options available by Google when requesting the speech synthesis. This is not visible to the user at the moment, but through the Unity editor, I can specify which voice to use, the pitch, the speed, and the language!
A useful note is that variables on scripts can be accessed even if the object they are attached to is disabled. This is fortunate because we need to update the visible chat log even if your chat window isn't open.
Further considerations: I think I might implement a system whereby you must enter a short code like "/t" or "tts" before the content of the string you wish to send if you want to use text to speech, and otherwise it will just send the text with no voice.
There is also a more comprehensive type of text input that Google accepts called SSML, or Speech Synthesis Markup Language. This allows for "pauses, acronym pronounciations, or other additional details into the audio data" to really refine how the final speech sounds. Adding that functionality in the chat system would allow the possibility of some pretty wild outcomes.
A quick thought on Google's Cloud-based Wavenet powered speech synthesis tool: As much as I'd like it to be, it isn't a free tool. I'm currently signed up for a 12 month trial and was granted the standard $300 credit for use in any of their API's. The first million characters of synthesized text per month are free. Beyond that, using Google's Wavenet generated voices cost $16 per million characters of synthesized text. The practicality of putting that in a multiplayer game is uncertain, as it could be abused, or at least would have to be limited to avoid abuse.
Comments