Cloud-based Wavenet TTS for PYTHON

teejaydub
May 24, 2018
5 min read

Okay, it's about to be 2AM. Do you ever get that feeling, when you have this supreme urge to finish something? Maybe it's a book you're reading or a TV show you're watching and you just can't stop doing the thing until you feel some sense of accomplishment (or shame, in the case of TV) when you complete it? Yeah, that's the general mood for the past 2 hours. Let's talk about it.

Speech synthesis is bad, right? It doesn't sound like the real thing... y'know, good ol' human speech. Well that notion is changing because of the smart people at, you probably guessed it, Google. They're combining their immense research in the field of neural networks and machine learning and applying it to speech, and speech synthesis. The result? Eerily realistic speech.

You can even try it yourself if you scroll down on that page. The top of that page, though, is what caught my attention. They have this Text-to-Speech tool available to use in the cloud through the use of their API. Right away I thought of a challenge. Get this to work and stick it in a game. Speech synthesis in a multiplayer game format. We've seen this before and it's fucking hilarious. But this time, with a lil' more realistic audio.

So, the goal is set, off to the races! Except I really don't have a clue what I'm doing. On that page I linked, to the left of the "Speak It" button, is a collapsible menu that says "Show JSON". I think I've heard of JSON a few times but I'm really not familiar with it. Looks like this:

Seems like, you send a request to that URL and submit the body and it somehow returns data in the form of audio? My conclusion is that it must work in some way like that. If you try and navigate directly to that request URL, you get a 404 error. Noted.

I thought, well, before we get into Unity and start fiddling around, let's take a step back to try and understand what I'm working with. Google API's, JSON, and since I figure it had to work in some capacity, Python.

After Googling such choice word combinations as "json request body python" I come across, whaddaya know, a Python module named "Requests".

At this point I'm not sure if it's even what I need, or if JSON is related to HTTP, but I pursue this avenue. After updating, installing, and testing the latest version of Python and installing auxiliary modules to then install this Requests module, it seems like it's ready. In my ongoing search, I come across a list of examples of Request uses, and get some confidence restored that this is the right tool.

I don't have to really understand the language to gather from this example that there's a built in function, requests.post that takes in a URL argument and a data argument that seems to be JSON format. It looks like this might be the right tool after all. So, I start deconstructing this example to get my python script started.

To my disbelief, I actually got a response from this code in the form of an error message.

It was something to the tune of "You need an Authorization Key to use this Google API"

Well, that much of a response got a lot of motivation to get this thing working. So I headed back to the Google text to speech page and hit the shiny "Try for free" button at the top. This would give me access to their Cloud API and hopefully I could figure out a way to set up some authentication to authorize the use of their Cloud API.

After reading some brief documentation, I "activated" the Text to Speech Cloud API and generated an authentication key. The documentation as to how to implement the authentication key was less straightforward and basically said "Just append the key to the end of the request."

A bit of trial and error and Googling lead me to believe that you append it at the end of the actual URL you are trying to contact, so instead of

url = 'https://texttospeech.googleapis.com/v1beta1/text:synthesize

I tried

url = 'https://texttospeech.googleapis.com/v1beta1/text:synthesize&key=mysupersecretkey

And all I was getting was a 404 error. Turns out, it's not supposed to be a "&", it's actually a "?"

So the correct format is:

url = 'https://texttospeech.googleapis.com/v1beta1/text:synthesize?key=mysupersecretkey

What printed next was, erm, interesting. Lots and lots of seemingly gibberish characters, so much so that it was filling the terminal and I couldn't see the start of it.

Turns out this is actually correct! From the Google documentation:

Very conveniently, Python has the base64 module installed and so I could decode it that way. I felt like I was so friggin close. And I was!

Now the code looked something like this.

Once I get the response, I decode the data and save it to an mp3 file. It took me a little while to figure out I needed to write response.content to access the, uh, content of the response. Took some trial and error as well to learn to open the file with the argument "wb" which says we're going to write to it, and write in binary as opposed to strings.

Running the program aaaaand….no dice. The files have a size but no audio playback in VLC.

To help me debug the problem, I save a text file of the "content" before I decode it, and discover what might be the cause of the problem. Remember how I was saying that the crazy letters filled up the terminal so I couldn't see the start of the message? Well, before all of the gibberish letters, there's the following:

The actual sound file is represented by everything that's inside of the quotations containing the gibberish. If we leave that first bracket and the "audioContent": in the file, when it's time to be decoded into an mp3, it breaks the file and doesn't allow it it play. So, let's do some string magic and get just the contents of those quotes.

After splitting the text based on the quotation mark, we get an array, and one of those array slots is going to contain all that good gibberish data that we need (It's array location 4, or ra[3]).

Note that I also added an input line so you can define what you want to say every time you run it! And when you run it, it works!

To finish it off, I added a line to open the finished mp3 file and stuck the whole thing in a loop so you could keep on writing after running the Python script.

os.startfile("c:\project\Output.mp3", "open")

One other notice - the Requests module for Python is weird in that I couldn't simply run the script from IDLE, I had to open an elevated command prompt, navigate to the project, then run the Python executable with arguments to run some virtual environment which allowed the Requests module to function properly. To make life easier I just wrote a quick batch file that enters all those commands to successfully open up the script.

cd c:\project

c:\python36\python -m pipenv run python

In the end, I got this thing operational in a couple hours, and there's almost no latency. Next up is to figure how to rewrite this code in C# so that I could integrate this into Unity and therefore into a friggin video game. How cool is that?

teejaydub's blog

Cloud-based Wavenet TTS for PYTHON

Recent Posts

Comentários