Libraries for speech recognition and speech synthesis are necessary in order to increase the ability of the HOAP-2 Robot to interact with humans.
For years, scientists have been trying to get softwares to best imitate humans. The goal of this project was not to implement a new solution for speech
processing, but to use existent softwares and to adapt them to the interaction with HOAP-2.
A hardware device has also been developed to amplify the audio output of the computer.
Speech processing software
The first stage was to look for open source programs in the domain of speech recognition and speech synthesis.
Sphinx 2 is a speech recognition program developed at the Canergy Mellon Insitute
(http://cmusphinx.sourceforge.net). Based on a HMM architecture and
the Viterbi algorithm, it tries to recognise the word we say, out of a given vocabulary.
Festival is a text-to-speech program developed at the University of Edinburgh (http://www.cstr.ed.ac.uk/projects/festival/). It analyses the phrase we
want to be said and cut it into a list of diphones. With this list, it looks up in its database and puts voice samples together.
Speech system
The second stage consisted in incorporating these programs to an application able to interact with humans, programs and machines. In order
to do this, a client-server architecture was chosen, offering the possibility to be autonomous and easily adaptable to new applications for speech processing.
On the figure below, we can see the two entities, communicating over the LAN. The microphone and the loudspeaker are plugged to the server, which only handles
row strings. The ones recognised by Sphinx are sent to the client, and the ones sent by the client are redirected to Festival. Exceptions to that are the three
orders executed directly by the server: change the speaker used by Festival, do a break for recognition and halt down.

The client handles the strings recognised by Sphinx. An example of decoding is provided in the demonstration application. First of all, the global character of
the sentence is extracted: question, action… Then, depending on the category, the exact signification is found, so that the application can react correctly and
interact with the human.
Results
The whole system works correctly. The developed applications are simple to use. Their execution is reliable, rapid and not resource consuming.
Festival reached our expectations, but we have to be lenient with Sphinx. Indeed, for good results, we have to speak distinctly, with an American accent and, over all, the less vocabulary we have in the dictionary, the better the chances are to get a correct answer. Men and women are recognised in an equivalent way.
With an appropriate algorithm, we can handle sentences recognised incorrectly and still react the right
way.
|