Question

I m working on a program for tone deaf people. I ve working with sapi and a TTs. The program does a 3D animation with a hand at the same time. But the problem is that the voices (also when a put them at its slowest speech) is to fast for what I want. So, I ve thought on speech recognition, but the problem is that I ve to do a large process to the text before the animation start.

So, I want to know if It would be posible to do speech recognition(from my voice on a .wave file) and afterwards do the same process of TTs (with Sapi events...) but using the .wave with my voice.

If It s posible, please tell me how. If you think there are better alternatives, let my see them.

Thanks for your time (and excuse my English)

Jesuskiewicz

Answer 1

Now that I understand what you want to happen, I can say that as far as I know, the SAPI SR engine doesn t really provide phoneme-level markup that s synchronized to the incoming text.

What you could try (although I have no real expectation for this to work) would be to take the audio, run it through a pronunciation grammar to generate phonemes, and then take the text elements to find the corresponding bits of audio.

When I say a pronunciation grammar , I mean a dictation grammar with the pronunciation model loaded - set it up like this:

CComPtr<ISpRecoGrammar> cpGrammar;
... initialize SR engine and create a grammar ...
cpGrammar->LoadDictation(L"Pronunciation", SPLO_STATIC);

In your recognition handler, you would need to parse out the elements:

ISpRecoResult* ipReco;
SPPHRASE* pPhrase;
ipReco->GetPhrase(&pPhrase);
for (int i = 0; i < pPhrase->Rule.ulCountOfElements; ++i)
{
    const SPPHRASEELEMENT * pElem = pPhrase->pElements + i;
    // examine pElem->ulAudioSizeTime, etc.
}
::CoTaskMemFree(pPhrase);

I hope this is enough to get you started...

友情链接