Skip to content

Scribe v1 Speech to Text

The world's most accurate speech-to-text model with 96.7% accuracy in English. Outperforms Whisper v3, Gemini 2.0 Flash, and Deepgram Nova-3 across 99 languages.

Supports:
Speech to Text

Audio Generator

No parameters available for this model

Cost: -

Transcription Preview

No transcription yet

Key Features

96.7% Accuracy

Industry-leading accuracy with just 3.3% word error rate in English, outperforming all major competitors

99 Languages

Transcribe speech in 99 languages with automatic language detection and code-switching support

32-Speaker Diarization

Identify and label up to 32 different speakers in a single recording with pinpoint accuracy

Word-Level Timestamps

Get precise word-level timestamps for perfect synchronization and subtitle generation

Audio Event Tagging

Automatically tag non-verbal sounds like (laughter), (applause), (footsteps) for richer context

Code Switching

Seamlessly handle switching between different languages within the same audio file

Pricing

Transparent credit-based pricing

4

credits per 60 seconds

How to Use

Transcribe audio with world-class accuracy in three steps

1

Upload Audio

Upload your audio or video file in any major format, up to 3GB in size

2

Configure Options

Enable speaker diarization, timestamps, and audio event tagging as needed

3

Get Results

Receive structured transcription with speaker labels, timestamps, and event tags

Technical Specifications

Provider
ElevenLabs
Model ID
scribe_v1
Accuracy (English)
96.7% (3.3% WER)
Languages Supported
99 languages
Max File Size
3 GB
Max Speakers
32 speakers
Supported Formats
All major audio/video formats
Output Format
Structured JSON with timestamps

Use Cases

Enterprise Meetings

Transcribe complex multi-speaker meetings with accurate speaker identification for up to 32 participants

Global Content

Process multilingual content with code-switching support for international teams and audiences

Media Production

Generate precise subtitles with word-level timestamps and audio event markers

Podcast Transcription

Create searchable transcripts with speaker labels for podcast archives and SEO

Research & Analysis

Transcribe interviews and focus groups with high accuracy for qualitative research

Accessibility

Generate accurate captions for deaf and hard-of-hearing audiences across 99 languages

Model Comparison

English Accuracy

96.7%

~94%

Languages

99

97

Speaker Diarization

Up to 32

Not built-in

Audio Events

Yes

No

Code Switching

Yes

Limited

Max File Size

3 GB

Varies

Scribe v1
vs
Whisper v3

Frequently Asked Questions

Find answers to common questions about this model

Scribe v1 is ElevenLabs' state-of-the-art automatic speech recognition (ASR) model. It achieves 96.7% accuracy in English and consistently outperforms leading models like OpenAI Whisper v3, Google Gemini 2.0 Flash, and Deepgram Nova-3 across 99 languages.

Scribe v1 achieves a word error rate (WER) of just 3.3% in English and 1.3% in Italian according to FLEURS benchmarks. This translates to approximately 96.7% accuracy, making it the most accurate publicly available ASR model.

Scribe v1 supports 99 languages with automatic language detection. It also handles code-switching, meaning it can accurately transcribe audio that switches between different languages within the same recording.

Scribe v1 can identify and label up to 32 different speakers in a single recording. Each speaker is labeled accurately, making it ideal for complex meetings, panel discussions, and multi-participant conversations.

Audio event tagging automatically detects and labels non-verbal sounds in your transcription, such as (laughter), (applause), (footsteps), or (music). This adds valuable context that pure speech transcription misses.

Scribe v1 supports all major audio and video formats including MP3, WAV, AAC, M4A, OGG, MP4, WebM, and more. The maximum file size is 3GB.

In benchmark tests (FLEURS & Common Voice), Scribe v1 consistently outperforms OpenAI Whisper Large v3 across all 99 supported languages, with particularly significant improvements in accuracy and speaker diarization capabilities.

Yes, all transcriptions generated through our platform can be used for commercial purposes including business meetings, podcasts, video subtitles, and content production without any additional licensing fees.

Scribe v1

Experience World-Class Transcription

Try Scribe v1 and discover why it's the most accurate speech-to-text model available

Join thousands of creators using Scribe v1