An integral part in building this speech recognition system is planning how IBM Watson and Unity communicate with one another. The reason that a single Workspace is sufficient for our use is that we are essentially using all the same Intents for every character. Only in code do we have our game react differently based on the character the player is speaking to. Within the C# script handling the speech recognition system, we have a State Machine setup so that the player can only having a conversation with one character at any given time; when the player is talking to that character, their voicelines will play when Intents are triggered.
Currently, as of today, we are still designing the way that the player starts and ends a conversation with a player. Some ideas we have include: saying the character’s name or title, greeting them in their native language, tapping them on the shoulder, looking at them for a set amount of time, or any combination of these. Figuring out how we should initiate a conversation with a character also leads us to thinking about how we should create a more controlled method for the player to talk and how this system handles anything that is said that does not register as an Intent.
A recurring problem we have while showing demos to people has been the environment being too loud with too many other background conversations happening, which causes the speech recognition system to behave poorly. The game runs flawlessly when it is quiet and only one person speaking. However, once it picks up multiple voices, it starts to fall apart since it does not know which voice is the player’s voice and random words from other conversations get added into the Speech to Text parsing or it finishes parsing much later than when the player stops talking because it thinks the other voices are a part of the input.
Some possible solutions we are looking into are: setting up a threshold for volume of the microphone for voice input so that nothing below a certain volume level (like background conversations) get parsed, requiring the player to hold down a button or move the controller to their ear as if using a comm radio (or both) whenever they are talking in order for the game to recognize that they are speaking so that all other speech or noises are automatically ignored at any other time, creating a sort of calibration system where the player requests a radio check to test microphone clarity and the game responds with a result, or training the player to adhere to a system such as always saying ‘Over’ when ending their voice input.
For the thesis show, we will be building a small booth, in addition to situating ourselves in a corner next to a project that requires a large projection screen, in order to mask as much of the noise as possible. We are also hoping that this booth will provide enough privacy for people demoing to feel more comfortable and less self-conscious about using their voice and talking, since it is such an unconventional mode of interaction within games.
To be continued.
Next up, Part 3: Findings.