> Once we have individual tracks to work with, we begin transcription. This is t...

> Once we have individual tracks to work with, we begin transcription. This is the most resource intensive part of the process. We rely on the Whisper AI transcription model from OpenAI, via WhisperX. The WhisperX project also uses wave2vec2 to provide accurate word-level timestamps, which is important for sentence-level synchronization. The transcription process is fairly standard; the only interesting addition to the process that Storyteller makes is to supply an "initial prompt" to the transcription model, outlining its task as transcribing an audiobook chapter and providing a list of words from the book that don't exist in the English dictionary as hints.

https://smoores.gitlab.io/storyteller/docs/how-it-works/the-...