As Ascension's Audio Lead, I'll be releasing several dev logs detailing the process of the audio side of things in the project. Naturally, I'll start with the speech recognition system and how we got it to work. This first post will go into detail specifically about IBM Watson's features, both the capabilities and limitations, and how that's affected us so far.
Ascension’s speech recognition system relies on IBM Watson’s Unity SDK. We’re using IBM Watson’s Speech to Text and Conversation services. Speech to Text parses the player’s voice input into text that is fed into the Conversation service, which then figures out the player’s “intent” in order for the Unity game engine to play back the correct voiceline in response.
The Speech to Text function is relatively simple in terms of setup. After getting the auto-generated credentials off of IBM Watson’s website and including them into the C# script in the Unity SDK, Speech to Text was up and running. On the other hand, the Conversation service takes a bit more setup and requires more maintenance as the project grows in scope. Workspaces are created and hold all the conversation logic needed for the game. We mainly work with Intents, words and/or phrases that IBM Watson will specifically look for as the player interacts within the experience.
IBM Watson is smart and does a fantastic job figuring out what the player says and matching that to an Intent. This is a bit of a double-edged sword. On one hand, it takes the pressure off our shoulders in having to manually type in every single way a player might say something along with all of its countless possible variations. However, it sometimes does too good of a job. A prime example is the countdown mechanic we use to introduce voice input interaction to the player in the opening sequence. The Countdown Intent is “321” (three two one). It works perfectly fine and flawlessly detects it whenever the player says “three two one.” Unfortunately, IBM Watson also matches player input to the Countdown Intent when the player says only “three.” Although the prompt is very exact in that it specifically asks the player to say “three two one,” edge cases can occur where players end up saying something that is not the countdown but is close enough to still trigger it, breaking immersion. We have noticed that there is a feature to include “conflicts,” which we suspect is specifically for cases like these where we want only the exact phrase to trigger the Intent, but we have not had a chance to test that theory since it is a feature only available for premium versions of IBM Watson.
As a student project, we have been operating on the free/lite plans for the IBM Watson services, meaning we are limited to 10,000 API calls to Conversation and 100 minutes of Speech to Text usage. The usage resets at the start of every month. So far, we have not even come close to hitting the limits of the Conversation service, but have been having quite a bit of trouble with Speech to Text. Because Speech to Text is much more of a background task and is active at the slightest bit of voice input, something as simple as forgetting to disable it when it is not needed while testing in Unity can quickly eat up the minutes at a ridiculously fast rate. Currently, we are scrambling to find a solution to that, whether that be to use the 6-month academic license (which has not been too promising when we tried it out) or to bite the bullet and upgrade to a standard plan and pay $.02 every minute of use. Through this process, we have found that IBM Watson’s customer service is unfortunately slow and, after more than two weeks since opening the ticket, not very helpful in providing effective resolutions to what seems like a rather simple problem.
Furthermore, because we are using the free plan for the Conversation service, we are also limited to one Workspace. Luckily, this limitation has not been an issue thus far. A Workspace is where we set up all the Intents that our game needs to scan for, as well as dialogue branches and “entities,” or specific types of input such as numbers, locations, or dates, which we have not seen a use for yet. The way we have the game set up works with just a single workspace holding all the information.
To be continued.
Next up, Part 2: Designing for Voice Input.