Estimating The Cost of Speech to Text and Natural Language Understanding Integration

When using the estimators from IBM and Google for Speech to Text (STT) the standard input is for an estimated number of minutes. For the most accurate estimate would this be based on handled calls (talk+hold times) or should we be expanding this to received calls, and/or AWC, etc.?

In short, what data is sent to IBM or Google for transcription?

For the Natural Language Understanding (NLU), once transcribed, does anyone know of an estimated character count per minute of talk time? - Just trying to get an idea of how many NLUs to anticipate per call.


Hi Brian,
for Transcription it is actually only the talk time that goes to transcription. Unless you’re using conversational IVR in the scenario - in this case it will include speech recognition in IVR

For NLU that is way more challenging to estimate, it counts per NLU units (messages) multiplied by features requested - like BOT, sentiment and others. It is still an open exercise to me. If anyone has ideas how to calculate it - please let me know :))

1 Like