Skip to main content
POST
/
v3
/
elevenlabs-scribe-v1
ElevenLabs Speech to Text V1
curl --request POST \
  --url https://api.highwayapi.ai/v3/elevenlabs-scribe-v1 \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "seed": 123,
  "diarize": true,
  "file_format": "<string>",
  "temperature": 123,
  "num_speakers": 123,
  "language_code": "<string>",
  "tag_audio_events": true,
  "cloud_storage_url": "<string>",
  "use_multi_channel": true,
  "diarization_threshold": 123,
  "timestamps_granularity": "<string>"
}
'
Transcribe audio or video files. When use_multi_channel is true and the uploaded audio has multiple channels, a ‘transcripts’ object is returned, with one transcript per channel. Otherwise, a single transcription result is returned.

Request Headers

Content-Type
string
required
Enum value: application/json
Authorization
string
required
Bearer authentication format: Bearer {{API key}}.

Request Body

seed
integer
If specified, the system will make a best effort to sample deterministically. Requests with the same seed and parameters should return the same result, but absolute determinism is not guaranteed. Must be an integer between 0 and 2147483647.Value range: [0, 2147483647]
diarize
boolean
default:false
Whether to annotate the current speaker in the uploaded file.
file_format
string
default:"other"
The input audio format. Options are ‘pcm_s16le_16’ or ‘other’. pcm_s16le_16 requires the audio to be 16kHz sample rate, 16-bit integer, mono, little-endian format, and has lower latency compared to encoded waveforms.Allowed values: pcm_s16le_16, other
temperature
number
Controls the randomness of the transcription output. The value ranges from 0.0 to 2.0; higher values make results more diverse and less deterministic. If omitted, the selected model’s default temperature is used (usually 0).Value range: [0, 2]
num_speakers
integer
The maximum number of speakers in the uploaded file. This can be used to help distinguish speakers, with support for up to 32 speakers.Value range: [1, 32]
language_code
string
Specify the ISO-639-1 or ISO-639-3 language code of the audio file. Providing this in advance can sometimes improve transcription performance. The default is null, which automatically detects the language.
tag_audio_events
boolean
default:true
Whether to tag audio events such as (laughter) and (footsteps) in the transcription.
cloud_storage_url
string
required
The HTTPS link to the file to be transcribed. Exactly one of file and cloud_storage_url must be provided. The file must be accessible over HTTPS and smaller than 2GB. Any valid HTTPS address is supported, including cloud storage (AWS S3, GCS, Cloudflare R2, etc.), CDNs, or other HTTPS sources. Presigned links with tokens or authentication via URL query parameters are supported.
use_multi_channel
boolean
default:false
Whether the audio file is multi-channel and each channel contains only a single speaker. When enabled, each channel is transcribed independently and the results are combined. Each word in the output content contains a channel_index field. Up to 5 channels are supported.
diarization_threshold
number
The diarization threshold. A higher value lowers the probability that one person is split into multiple speakers, but increases the probability that different people are merged into one speaker (fewer speakers detected). A lower value increases the probability that one person is split into multiple speakers, but reduces the probability that different people are merged into one speaker (more speakers detected). Can only be set when diarize=True and num_speakers=None. The default is None, and the threshold is selected based on the model id (usually 0.22).Value range: [0.1, 0.4]
timestamps_granularity
string
default:"word"
The granularity of timestamps in the transcription. ‘word’ provides word-level timestamps, while ‘character’ provides timestamps for each character.Allowed values: none, word, character

Response Information

The response may be one of the following response types:
text
string
required
The raw transcribed text.
words
array
required
A list of words and their timing information.
channel_index
integer
The channel index corresponding to this transcript (valid for multi-channel audio).
language_code
string
required
The detected language code (for example, ‘eng’ for English).
transcription_id
string
The unique transcription ID for this response.
language_probability
number
required
The confidence of language detection (between 0 and 1).
transcripts
array
required
A list of transcripts corresponding to each audio channel. Each transcript contains the text for its channel and word-level details.
transcription_id
string
The unique transcription ID for this response.